dqnguyen commited on
Commit
d59e2e6
1 Parent(s): 09c8f9f

commit from Dat Quoc Nguyen

Browse files
Files changed (7) hide show
  1. README.md +50 -0
  2. config.json +25 -0
  3. merges.txt +0 -0
  4. pytorch_model.bin +3 -0
  5. tf_model.h5 +3 -0
  6. tokenizer.json +0 -0
  7. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # <a name="introduction"></a> BERTweet: A pre-trained language model for English Tweets
2
+
3
+ BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the [RoBERTa](https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md) pre-training procedure. The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the **COVID-19** pandemic. The general architecture and experimental results of BERTweet can be found in our [paper](https://aclanthology.org/2020.emnlp-demos.2/):
4
+
5
+ @inproceedings{bertweet,
6
+ title = {{BERTweet: A pre-trained language model for English Tweets}},
7
+ author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
8
+ booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
9
+ pages = {9--14},
10
+ year = {2020}
11
+ }
12
+
13
+ **Please CITE** our paper when BERTweet is used to help produce published results or is incorporated into other software.
14
+
15
+ For further information or requests, please go to [BERTweet's homepage](https://github.com/VinAIResearch/BERTweet)!
16
+
17
+ ### <a name="models2"></a> Pre-trained models
18
+
19
+
20
+ Model | #params | Arch. | Pre-training data
21
+ ---|---|---|---
22
+ `vinai/bertweet-base` | 135M | base | 850M English Tweets (cased)
23
+ `vinai/bertweet-covid19-base-cased` | 135M | base | 23M COVID-19 English Tweets (cased)
24
+ `vinai/bertweet-covid19-base-uncased` | 135M | base | 23M COVID-19 English Tweets (uncased)
25
+ `vinai/bertweet-large` | 355M | large | 873M English Tweets (cased)
26
+
27
+
28
+ ### <a name="usage2"></a> Example usage
29
+
30
+
31
+ ```python
32
+ import torch
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ bertweet = AutoModel.from_pretrained("vinai/bertweet-large")
36
+ tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")
37
+
38
+ # INPUT TWEET IS ALREADY NORMALIZED!
39
+ line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
40
+
41
+ input_ids = torch.tensor([tokenizer.encode(line)])
42
+
43
+ with torch.no_grad():
44
+ features = bertweet(input_ids) # Models outputs are now tuples
45
+
46
+ ## With TensorFlow 2.0+:
47
+ # from transformers import TFAutoModel
48
+ # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-large")
49
+ ```
50
+
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.2.2",
22
+ "type_vocab_size": 1,
23
+ "use_cache": true,
24
+ "vocab_size": 50265
25
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36af147452ccf8846e38283d143efb1e1e003fd28c895008eed78a0a2e47794c
3
+ size 1422008553
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c920d1c6aabf5261079966ccf681b03645b15af9890cf9c06f783049ccb8359
3
+ size 1630210852
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff