indolem commited on
Commit
0a5bac4
1 Parent(s): c77ff8c

Model Card

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IndoBERTweet 🐦 :indonesia:
2
+
3
+ ## 1. Paper
4
+ Fajri Koto, Jey Han Lau, and Timothy Baldwin. [_IndoBERTweet: A Pretrained Language Model for Indonesian Twitter
5
+ with Effective Domain-Specific Vocabulary Initialization_](https://arxiv.org/pdf/2109.04607.pdf).
6
+ In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (**EMNLP 2021**), Dominican Republic (virtual).
7
+
8
+ ## 2. About
9
+
10
+ [IndoBERTweet](https://github.com/indolem/IndoBERTweet) is the first large-scale pretrained model for Indonesian Twitter
11
+ that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.
12
+
13
+ In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.
14
+
15
+ ## 3. Pretraining Data
16
+
17
+ We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of **409M word tokens**, two times larger than the training data used to pretrain [IndoBERT](https://aclanthology.org/2020.coling-main.66.pdf). Due to Twitter policy, this pretraining data will not be released to public.
18
+
19
+ ## 4. How to use
20
+
21
+ Load model and tokenizer (tested with transformers==3.5.1)
22
+ ```python
23
+ from transformers import AutoTokenizer, AutoModel
24
+ tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
25
+ model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")
26
+ ```
27
+ **Preprocessing Steps:**
28
+ * lower-case all words
29
+ * converting user mentions and URLs into @USER and HTTPURL, respectively
30
+ * translating emoticons into text using the [emoji package](https://pypi.org/project/emoji/).
31
+
32
+ ## 5. Results over 7 Indonesian Twitter Datasets
33
+
34
+ <table>
35
+ <col>
36
+ <colgroup span="2"></colgroup>
37
+ <colgroup span="2"></colgroup>
38
+ <tr>
39
+ <th rowspan="2">Models</td>
40
+ <th colspan="2" scope="colgroup">Sentiment</th>
41
+ <th colspan="1" scope="colgroup">Emotion</th>
42
+ <th colspan="2" scope="colgroup">Hate Speech</th>
43
+ <th colspan="2" scope="colgroup">NER</th>
44
+ <th rowspan="2" scope="colgroup">Average</th>
45
+ </tr>
46
+ <tr>
47
+ <th scope="col">IndoLEM</th>
48
+ <th scope="col">SmSA</th>
49
+ <th scope="col">EmoT</th>
50
+ <th scope="col">HS1</th>
51
+ <th scope="col">HS2</th>
52
+ <th scope="col">Formal</th>
53
+ <th scope="col">Informal</th>
54
+ </tr>
55
+ <tr>
56
+ <td scope="row">mBERT</td>
57
+ <td>76.6</td>
58
+ <td>84.7</td>
59
+ <td>67.5</td>
60
+ <td>85.1</td>
61
+ <td>75.1</td>
62
+ <td>85.2</td>
63
+ <td>83.2</td>
64
+ <td>79.6</td>
65
+ </tr>
66
+ <tr>
67
+ <td scope="row">malayBERT</td>
68
+ <td>82.0</td>
69
+ <td>84.1</td>
70
+ <td>74.2</td>
71
+ <td>85.0</td>
72
+ <td>81.9</td>
73
+ <td>81.9</td>
74
+ <td>81.3</td>
75
+ <td>81.5</td>
76
+ </tr>
77
+ <tr>
78
+ <td scope="row">IndoBERT (Willie, et al., 2020)</td>
79
+ <td>84.1</td>
80
+ <td>88.7</td>
81
+ <td>73.3</td>
82
+ <td>86.8</td>
83
+ <td>80.4</td>
84
+ <td>86.3</td>
85
+ <td>84.3</td>
86
+ <td>83.4</td>
87
+ </tr>
88
+ <tr>
89
+ <td scope="row">IndoBERT (Koto, et al., 2020)</td>
90
+ <td>84.1</td>
91
+ <td>87.9</td>
92
+ <td>71.0</td>
93
+ <td>86.4</td>
94
+ <td>79.3</td>
95
+ <td>88.0</td>
96
+ <td><b>86.9</b></td>
97
+ <td>83.4</td>
98
+ </tr>
99
+ <tr>
100
+ <td scope="row">IndoBERTweet (1M steps from scratch)</td>
101
+ <td>86.2</td>
102
+ <td>90.4</td>
103
+ <td>76.0</td>
104
+ <td><b>88.8</b></td>
105
+ <td><b>87.5</b></td>
106
+ <td><b>88.1</b></td>
107
+ <td>85.4</td>
108
+ <td>86.1</td>
109
+ </tr>
110
+ <tr>
111
+ <td scope="row">IndoBERT + Voc adaptation + 200k steps</td>
112
+ <td><b>86.6</b></td>
113
+ <td><b>92.7</b></td>
114
+ <td><b>79.0</b></td>
115
+ <td>88.4</td>
116
+ <td>84.0</td>
117
+ <td>87.7</td>
118
+ <td><b>86.9</b></td>
119
+ <td><b>86.5</b></td>
120
+ </tr>
121
+ </table>