aadelucia commited on
Commit
1755d74
1 Parent(s): 73134ca
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -84,10 +84,10 @@ Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if
84
  The language of Twitter differs significantly from that of other domains commonly included in large language model training.
85
  While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
86
  language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
87
- or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language
88
  model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
89
  and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
90
- adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is
91
  more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
92
 
93
  ## Training data
 
84
  The language of Twitter differs significantly from that of other domains commonly included in large language model training.
85
  While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
86
  language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
87
+ or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language
88
  model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
89
  and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
90
+ adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is
91
  more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
92
 
93
  ## Training data