--- language: - id tags: - Twitter license: apache-2.0 datasets: - Twitter 2021 widget: - text: "guweehh udh ga' paham lg sm [MASK]" --- # IndoBERTweet 🐦 ## 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. [_IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization_](https://arxiv.org/pdf/2109.04607.pdf). In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (**EMNLP 2021**), Dominican Republic (virtual). ## 2. About [IndoBERTweet](https://github.com/indolem/IndoBERTweet) is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary. In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections. ## 3. Pretraining Data We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of **409M word tokens**, two times larger than the training data used to pretrain [IndoBERT](https://aclanthology.org/2020.coling-main.66.pdf). Due to Twitter policy, this pretraining data will not be released to public. ## 4. How to use Load model and tokenizer (tested with transformers==3.5.1) ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased") model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased") ``` **Preprocessing Steps:** * lower-case all words * converting user mentions and URLs into @USER and HTTPURL, respectively * translating emoticons into text using the [emoji package](https://pypi.org/project/emoji/). ## 5. Results over 7 Indonesian Twitter Datasets
Models | Sentiment | Emotion | Hate Speech | NER | Average | |||
---|---|---|---|---|---|---|---|---|
IndoLEM | SmSA | EmoT | HS1 | HS2 | Formal | Informal | ||
mBERT | 76.6 | 84.7 | 67.5 | 85.1 | 75.1 | 85.2 | 83.2 | 79.6 |
malayBERT | 82.0 | 84.1 | 74.2 | 85.0 | 81.9 | 81.9 | 81.3 | 81.5 |
IndoBERT (Willie, et al., 2020) | 84.1 | 88.7 | 73.3 | 86.8 | 80.4 | 86.3 | 84.3 | 83.4 |
IndoBERT (Koto, et al., 2020) | 84.1 | 87.9 | 71.0 | 86.4 | 79.3 | 88.0 | 86.9 | 83.4 |
IndoBERTweet (1M steps from scratch) | 86.2 | 90.4 | 76.0 | 88.8 | 87.5 | 88.1 | 85.4 | 86.1 |
IndoBERT + Voc adaptation + 200k steps | 86.6 | 92.7 | 79.0 | 88.4 | 84.0 | 87.7 | 86.9 | 86.5 |