updates
Browse files
README.md
CHANGED
@@ -84,10 +84,10 @@ Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if
|
|
84 |
The language of Twitter differs significantly from that of other domains commonly included in large language model training.
|
85 |
While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
|
86 |
language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
|
87 |
-
or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language
|
88 |
model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
|
89 |
and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
|
90 |
-
adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is
|
91 |
more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
|
92 |
|
93 |
## Training data
|
|
|
84 |
The language of Twitter differs significantly from that of other domains commonly included in large language model training.
|
85 |
While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
|
86 |
language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
|
87 |
+
or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language
|
88 |
model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
|
89 |
and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
|
90 |
+
adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is
|
91 |
more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
|
92 |
|
93 |
## Training data
|