51la5 commited on
Commit
a1e14ce
1 Parent(s): 46ee6c7

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +267 -0
  2. config.json +43 -0
  3. pytorch_model.bin +3 -0
  4. rust_model.ot +3 -0
  5. sentencepiece.bpe.model +3 -0
  6. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bg
11
+ - bn
12
+ - br
13
+ - bs
14
+ - ca
15
+ - cs
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - eo
22
+ - es
23
+ - et
24
+ - eu
25
+ - fa
26
+ - fi
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - he
35
+ - hi
36
+ - hr
37
+ - hu
38
+ - hy
39
+ - id
40
+ - is
41
+ - it
42
+ - ja
43
+ - jv
44
+ - ka
45
+ - kk
46
+ - km
47
+ - kn
48
+ - ko
49
+ - ku
50
+ - ky
51
+ - la
52
+ - lo
53
+ - lt
54
+ - lv
55
+ - mg
56
+ - mk
57
+ - ml
58
+ - mn
59
+ - mr
60
+ - ms
61
+ - my
62
+ - ne
63
+ - nl
64
+ - no
65
+ - om
66
+ - or
67
+ - pa
68
+ - pl
69
+ - ps
70
+ - pt
71
+ - ro
72
+ - ru
73
+ - sa
74
+ - sd
75
+ - si
76
+ - sk
77
+ - sl
78
+ - so
79
+ - sq
80
+ - sr
81
+ - su
82
+ - sv
83
+ - sw
84
+ - ta
85
+ - te
86
+ - th
87
+ - tl
88
+ - tr
89
+ - ug
90
+ - uk
91
+ - ur
92
+ - uz
93
+ - vi
94
+ - xh
95
+ - yi
96
+ - zh
97
+ ---
98
+
99
+ # xlm-roberta-large-finetuned-conll03-english
100
+
101
+ # Table of Contents
102
+
103
+ 1. [Model Details](#model-details)
104
+ 2. [Uses](#uses)
105
+ 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
106
+ 4. [Training](#training)
107
+ 5. [Evaluation](#evaluation)
108
+ 6. [Environmental Impact](#environmental-impact)
109
+ 7. [Technical Specifications](#technical-specifications)
110
+ 8. [Citation](#citation)
111
+ 9. [Model Card Authors](#model-card-authors)
112
+ 10. [How To Get Started With the Model](#how-to-get-started-with-the-model)
113
+
114
+
115
+ # Model Details
116
+
117
+ ## Model Description
118
+
119
+ The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) fine-tuned with the [conll2003](https://huggingface.co/datasets/conll2003) dataset in English.
120
+
121
+ - **Developed by:** See [associated paper](https://arxiv.org/abs/1911.02116)
122
+ - **Model type:** Multi-lingual language model
123
+ - **Language(s) (NLP) or Countries (images):** XLM-RoBERTa is a multilingual model trained on 100 different languages; see [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr) for full list; model is fine-tuned on a dataset in English
124
+ - **License:** More information needed
125
+ - **Related Models:** [RoBERTa](https://huggingface.co/roberta-base), [XLM](https://huggingface.co/docs/transformers/model_doc/xlm)
126
+ - **Parent Model:** [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large)
127
+ - **Resources for more information:**
128
+ -[GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr)
129
+ -[Associated Paper](https://arxiv.org/abs/1911.02116)
130
+
131
+ # Uses
132
+
133
+ ## Direct Use
134
+
135
+ The model is a language model. The model can be used for token classification, a natural language understanding task in which a label is assigned to some tokens in a text.
136
+
137
+ ## Downstream Use
138
+
139
+ Potential downstream use cases include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. To learn more about token classification and other potential downstream use cases, see the Hugging Face [token classification docs](https://huggingface.co/tasks/token-classification).
140
+
141
+ ## Out-of-Scope Use
142
+
143
+ The model should not be used to intentionally create hostile or alienating environments for people.
144
+
145
+ # Bias, Risks, and Limitations
146
+
147
+ **CONTENT WARNING: Readers should be made aware that language generated by this model may be disturbing or offensive to some and may propagate historical and current stereotypes.**
148
+
149
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). In the context of tasks relevant to this model, [Mishra et al. (2020)](https://arxiv.org/pdf/2008.03415.pdf) explore social biases in NER systems for English and find that there is systematic bias in existing NER systems in that they fail to identify named entities from different demographic groups (though this paper did not look at BERT). For example, using a sample sentence from [Mishra et al. (2020)](https://arxiv.org/pdf/2008.03415.pdf):
150
+
151
+ ```python
152
+ >>> from transformers import pipeline
153
+ >>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
154
+ >>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
155
+ >>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
156
+ >>> classifier("Alya told Jasmine that Andrew could pay with cash..")
157
+ [{'end': 2,
158
+ 'entity': 'I-PER',
159
+ 'index': 1,
160
+ 'score': 0.9997861,
161
+ 'start': 0,
162
+ 'word': '▁Al'},
163
+ {'end': 4,
164
+ 'entity': 'I-PER',
165
+ 'index': 2,
166
+ 'score': 0.9998591,
167
+ 'start': 2,
168
+ 'word': 'ya'},
169
+ {'end': 16,
170
+ 'entity': 'I-PER',
171
+ 'index': 4,
172
+ 'score': 0.99995816,
173
+ 'start': 10,
174
+ 'word': '▁Jasmin'},
175
+ {'end': 17,
176
+ 'entity': 'I-PER',
177
+ 'index': 5,
178
+ 'score': 0.9999584,
179
+ 'start': 16,
180
+ 'word': 'e'},
181
+ {'end': 29,
182
+ 'entity': 'I-PER',
183
+ 'index': 7,
184
+ 'score': 0.99998057,
185
+ 'start': 23,
186
+ 'word': '���Andrew'}]
187
+ ```
188
+
189
+ ## Recommendations
190
+
191
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
192
+
193
+ # Training
194
+
195
+ See the following resources for training data and training procedure details:
196
+ - [XLM-RoBERTa-large model card](https://huggingface.co/xlm-roberta-large)
197
+ - [CoNLL-2003 data card](https://huggingface.co/datasets/conll2003)
198
+ - [Associated paper](https://arxiv.org/pdf/1911.02116.pdf)
199
+
200
+ # Evaluation
201
+
202
+ See the [associated paper](https://arxiv.org/pdf/1911.02116.pdf) for evaluation details.
203
+
204
+ # Environmental Impact
205
+
206
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
207
+
208
+ - **Hardware Type:** 500 32GB Nvidia V100 GPUs (from the [associated paper](https://arxiv.org/pdf/1911.02116.pdf))
209
+ - **Hours used:** More information needed
210
+ - **Cloud Provider:** More information needed
211
+ - **Compute Region:** More information needed
212
+ - **Carbon Emitted:** More information needed
213
+
214
+ # Technical Specifications
215
+
216
+ See the [associated paper](https://arxiv.org/pdf/1911.02116.pdf) for further details.
217
+
218
+ # Citation
219
+
220
+ **BibTeX:**
221
+
222
+ ```bibtex
223
+ @article{conneau2019unsupervised,
224
+ title={Unsupervised Cross-lingual Representation Learning at Scale},
225
+ author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
226
+ journal={arXiv preprint arXiv:1911.02116},
227
+ year={2019}
228
+ }
229
+ ```
230
+
231
+ **APA:**
232
+ - Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
233
+
234
+ # Model Card Authors
235
+
236
+ This model card was written by the team at Hugging Face.
237
+
238
+ # How to Get Started with the Model
239
+
240
+ Use the code below to get started with the model. You can use this model directly within a pipeline for NER.
241
+
242
+ <details>
243
+ <summary> Click to expand </summary>
244
+
245
+ ```python
246
+ >>> from transformers import AutoTokenizer, AutoModelForTokenClassification
247
+ >>> from transformers import pipeline
248
+ >>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
249
+ >>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
250
+ >>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
251
+ >>> classifier("Hello I'm Omar and I live in Zürich.")
252
+
253
+ [{'end': 14,
254
+ 'entity': 'I-PER',
255
+ 'index': 5,
256
+ 'score': 0.9999175,
257
+ 'start': 10,
258
+ 'word': '▁Omar'},
259
+ {'end': 35,
260
+ 'entity': 'I-LOC',
261
+ 'index': 10,
262
+ 'score': 0.9999906,
263
+ 'start': 29,
264
+ 'word': '▁Zürich'}]
265
+ ```
266
+
267
+ </details>
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_num_labels": 8,
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "id2label": {
13
+ "0": "B-LOC",
14
+ "1": "B-MISC",
15
+ "2": "B-ORG",
16
+ "3": "I-LOC",
17
+ "4": "I-MISC",
18
+ "5": "I-ORG",
19
+ "6": "I-PER",
20
+ "7": "O"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 4096,
24
+ "label2id": {
25
+ "B-LOC": 0,
26
+ "B-MISC": 1,
27
+ "B-ORG": 2,
28
+ "I-LOC": 3,
29
+ "I-MISC": 4,
30
+ "I-ORG": 5,
31
+ "I-PER": 6,
32
+ "O": 7
33
+ },
34
+ "layer_norm_eps": 1e-05,
35
+ "max_position_embeddings": 514,
36
+ "model_type": "xlm-roberta",
37
+ "num_attention_heads": 16,
38
+ "num_hidden_layers": 24,
39
+ "output_past": true,
40
+ "pad_token_id": 1,
41
+ "type_vocab_size": 1,
42
+ "vocab_size": 250002
43
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7eb522dbc4c6a8d172dd64fdd274c6450db29b636e258655a8d44049b6091bcb
3
+ size 2239692364
rust_model.ot ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:447953fd8e79dcbbe2945f43589ac037f95dd2a3f1f67080c514ff1744659eda
3
+ size 2239692336
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff