Michael Beukman commited on
Commit
e8e8e45
1 Parent(s): c682ba1

Initial Commit

Browse files
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sw
4
+ tags:
5
+ - NER
6
+ - token-classification
7
+ datasets:
8
+ - masakhaner
9
+ metrics:
10
+ - f1
11
+ - precision
12
+ - recall
13
+ widget:
14
+ - text: "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."
15
+ ---
16
+
17
+
18
+ # xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili
19
+ This is a token classification (specifically NER) model that fine-tuned [xlm-roberta-base-finetuned-amharic](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-amharic) on the [MasakhaNER](https://arxiv.org/abs/2103.11811) dataset, specifically the Swahili part.
20
+
21
+ More information, and other similar models can be found in the [main Github repository](https://github.com/Michael-Beukman/NERTransfer).
22
+
23
+ ## About
24
+ This model is transformer based and was fine-tuned on the MasakhaNER dataset. It is a named entity recognition dataset, containing mostly news articles in 10 different African languages.
25
+ The model was fine-tuned for 50 epochs, with a maximum sequence length of 200, 32 batch size, 5e-5 learning rate. This process was repeated 5 times (with different random seeds), and this uploaded model performed the best out of those 5 seeds (aggregate F1 on test set).
26
+
27
+ This model was fine-tuned by me, Michael Beukman while doing a project at the University of the Witwatersrand, Johannesburg. This is version 1, as of 20 November 2021.
28
+ This model is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
29
+
30
+ ### Contact & More information
31
+ For more information about the models, including training scripts, detailed results and further resources, you can visit the the [main Github repository](https://github.com/Michael-Beukman/NERTransfer). You can contact me by filing an issue on this repository.
32
+
33
+ ### Training Resources
34
+ In the interest of openness, and reporting resources used, we list here how long the training process took, as well as what the minimum resources would be to reproduce this. Fine-tuning each model on the NER dataset took between 10 and 30 minutes, and was performed on a NVIDIA RTX3090 GPU. To use a batch size of 32, at least 14GB of GPU memory was required, although it was just possible to fit these models in around 6.5GB's of VRAM when using a batch size of 1.
35
+
36
+
37
+ ## Data
38
+ The train, evaluation and test datasets were taken directly from the MasakhaNER [Github](https://github.com/masakhane-io/masakhane-ner) repository, with minimal to no preprocessing, as the original dataset is already of high quality.
39
+ The motivation for the use of this data is that it is the "first large, publicly available, high­ quality dataset for named entity recognition (NER) in ten African languages" ([source](https://arxiv.org/pdf/2103.11811.pdf)). The high-quality data, as well as the groundwork laid by the paper introducing it are some more reasons why this dataset was used. For evaluation, the dedicated test split was used, which is from the same distribution as the training data, so this model may not generalise to other distributions, and further testing would need to be done to investigate this. The exact distribution of the data is covered in detail [here](https://arxiv.org/abs/2103.11811).
40
+
41
+ ## Intended Use
42
+ This model are intended to be used for NLP research into e.g. interpretability or transfer learning. Using this model in production is not supported, as generalisability and downright performance is limited. In particular, this is not designed to be used in any important downstream task that could affect people, as harm could be caused by the limitations of the model, described next.
43
+
44
+ ## Limitations
45
+ This model was only trained on one (relatively small) dataset, covering one task (NER) in one domain (news articles) and in a set span of time. The results may not generalise, and the model may perform badly, or in an unfair / biased way if used on other tasks. Although the purpose of this project was to investigate transfer learning, the performance on languages that the model was not trained for does suffer.
46
+
47
+
48
+ Because this model used xlm-roberta-base as its starting point (potentially with domain adaptive fine-tuning on specific languages), this model's limitations can also apply here. These can include being biased towards the hegemonic viewpoint of most of its training data, being ungrounded and having subpar results on other languages (possibly due to unbalanced training data).
49
+
50
+ As [Adelani et al. (2021)](https://arxiv.org/abs/2103.11811) showed, the models in general struggled with entities that were either longer than 3 words and entities that were not contained in the training data. This could bias the models towards not finding, e.g. names of people that have many words, possibly leading to a misrepresentation in the results. Similarly, names that are uncommon, and may not have been found in the training data (due to e.g. different languages) would also be predicted less often.
51
+
52
+
53
+ Additionally, this model has not been verified in practice, and other, more subtle problems may become prevalent if used without any verification that it does what it is supposed to.
54
+
55
+ ### Privacy & Ethical Considerations
56
+ The data comes from only publicly available news sources, the only available data should cover public figures and those that agreed to be reported on. See the original MasakhaNER paper for more details.
57
+
58
+ No explicit ethical considerations or adjustments were made during fine-tuning of this model.
59
+
60
+ ## Metrics
61
+ The language adaptive models achieve (mostly) superior performance over starting with xlm-roberta-base. Our main metric was the aggregate F1 score for all NER categories.
62
+
63
+ These metrics are on the test set for MasakhaNER, so the data distribution is similar to the training set, so these results do not directly indicate how well these models generalise.
64
+ We do find large variation in transfer results when starting from different seeds (5 different seeds were tested), indicating that the fine-tuning process for transfer might be unstable.
65
+
66
+ The metrics used were chosen to be consistent with previous work, and to facilitate research. Other metrics may be more appropriate for other purposes.
67
+ ## Caveats and Recommendations
68
+ In general, this model performed worse on the 'date' category compared to others, so if dates are a critical factor, then that might need to be taken into account and addressed, by for example collecting and annotating more data.
69
+
70
+ ## Model Structure
71
+ Here are some performance details on this specific model, compared to others we trained.
72
+ All of these metrics were calculated on the test set, and the seed was chosen that gave the best overall F1 score. The first three result columns are averaged over all categories, and the latter 4 provide performance broken down by category.
73
+
74
+ This model can predict the following label for a token ([source](https://huggingface.co/Davlan/xlm-roberta-large-masakhaner)):
75
+
76
+
77
+ Abbreviation|Description
78
+ -|-
79
+ O|Outside of a named entity
80
+ B-DATE |Beginning of a DATE entity right after another DATE entity
81
+ I-DATE |DATE entity
82
+ B-PER |Beginning of a person’s name right after another person’s name
83
+ I-PER |Person’s name
84
+ B-ORG |Beginning of an organisation right after another organisation
85
+ I-ORG |Organisation
86
+ B-LOC |Beginning of a location right after another location
87
+ I-LOC |Location
88
+
89
+
90
+
91
+ | Model Name | Staring point | Evaluation / Fine-tune Language | F1 | Precision | Recall | F1 (DATE) | F1 (LOC) | F1 (ORG) | F1 (PER) |
92
+ | -------------------------------------------------- | -------------------- | -------------------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- |
93
+ | [xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili) (This model) | [amh](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-amharic) | swa | 86.66 | 85.23 | 88.13 | 84.00 | 90.00 | 74.00 | 92.00 |
94
+ | [xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili) | [hau](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-hausa) | swa | 88.36 | 86.95 | 89.82 | 86.00 | 91.00 | 77.00 | 94.00 |
95
+ | [xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili) | [ibo](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-igbo) | swa | 87.75 | 86.55 | 88.97 | 85.00 | 92.00 | 77.00 | 91.00 |
96
+ | [xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-swahili) | [kin](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-kinyarwanda) | swa | 87.26 | 85.15 | 89.48 | 83.00 | 91.00 | 75.00 | 93.00 |
97
+ | [xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili) | [lug](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-luganda) | swa | 88.93 | 87.64 | 90.25 | 83.00 | 92.00 | 79.00 | 95.00 |
98
+ | [xlm-roberta-base-finetuned-luo-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-swahili) | [luo](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-luo) | swa | 87.93 | 86.91 | 88.97 | 83.00 | 91.00 | 76.00 | 94.00 |
99
+ | [xlm-roberta-base-finetuned-naija-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-naija-finetuned-ner-swahili) | [pcm](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-naija) | swa | 87.26 | 85.15 | 89.48 | 83.00 | 91.00 | 75.00 | 93.00 |
100
+ | [xlm-roberta-base-finetuned-swahili-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-swahili) | [swa](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-swahili) | swa | 90.36 | 88.59 | 92.20 | 86.00 | 93.00 | 79.00 | 96.00 |
101
+ | [xlm-roberta-base-finetuned-wolof-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-wolof-finetuned-ner-swahili) | [wol](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-wolof) | swa | 87.80 | 86.50 | 89.14 | 86.00 | 90.00 | 78.00 | 93.00 |
102
+ | [xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili) | [yor](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-yoruba) | swa | 87.73 | 86.67 | 88.80 | 85.00 | 91.00 | 75.00 | 93.00 |
103
+ | [xlm-roberta-base-finetuned-ner-swahili](https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-swahili) | [base](https://huggingface.co/xlm-roberta-base) | swa | 88.71 | 86.84 | 90.67 | 83.00 | 91.00 | 79.00 | 95.00 |
104
+ ## Usage
105
+ To use this model (or others), you can do the following, just changing the model name ([source](https://huggingface.co/dslim/bert-base-NER)):
106
+
107
+ ```
108
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
109
+ from transformers import pipeline
110
+ model_name = 'mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili'
111
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
112
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
113
+
114
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
115
+ example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."
116
+
117
+ ner_results = nlp(example)
118
+ print(ner_results)
119
+ ```
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Davlan/xlm-roberta-base-finetuned-amharic",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "O",
16
+ "1": "B-DATE",
17
+ "2": "I-DATE",
18
+ "3": "B-PER",
19
+ "4": "I-PER",
20
+ "5": "B-ORG",
21
+ "6": "I-ORG",
22
+ "7": "B-LOC",
23
+ "8": "I-LOC"
24
+ },
25
+ "initializer_range": 0.02,
26
+ "intermediate_size": 3072,
27
+ "label2id": {
28
+ "B-DATE": 1,
29
+ "B-LOC": 7,
30
+ "B-ORG": 5,
31
+ "B-PER": 3,
32
+ "I-DATE": 2,
33
+ "I-LOC": 8,
34
+ "I-ORG": 6,
35
+ "I-PER": 4,
36
+ "O": 0
37
+ },
38
+ "layer_norm_eps": 1e-05,
39
+ "max_position_embeddings": 514,
40
+ "model_type": "xlm-roberta",
41
+ "num_attention_heads": 12,
42
+ "num_hidden_layers": 12,
43
+ "output_past": true,
44
+ "pad_token_id": 1,
45
+ "position_embedding_type": "absolute",
46
+ "torch_dtype": "float32",
47
+ "transformers_version": "4.11.3",
48
+ "type_vocab_size": 1,
49
+ "use_cache": true,
50
+ "vocab_size": 250002
51
+ }
eval_results.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ f1 = 0.8893709327548807
2
+ loss = 0.22877022493675644
3
+ precision = 0.8674188998589563
4
+ recall = 0.9124629080118695
5
+ report = precision recall f1-score support
6
+
7
+ DATE 0.74 0.90 0.81 80
8
+ LOC 0.92 0.92 0.92 303
9
+ ORG 0.69 0.79 0.74 86
10
+ PER 0.94 0.95 0.95 205
11
+
12
+ micro avg 0.87 0.91 0.89 674
13
+ macro avg 0.82 0.89 0.85 674
14
+ weighted avg 0.87 0.91 0.89 674
15
+
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5210ff7500343b36b6c9e4211dd09db87ba59ecba318d4f046b09a87b9042540
3
+ size 1109921841
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
test_results.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ f1 = 0.866555462885738
2
+ loss = 0.23361517936721038
3
+ precision = 0.85233798195242
4
+ recall = 0.8812553011026294
5
+ report = precision recall f1-score support
6
+
7
+ DATE 0.77 0.93 0.84 162
8
+ LOC 0.88 0.92 0.90 463
9
+ ORG 0.75 0.73 0.74 221
10
+ PER 0.93 0.91 0.92 333
11
+
12
+ micro avg 0.85 0.88 0.87 1179
13
+ macro avg 0.83 0.87 0.85 1179
14
+ weighted avg 0.85 0.88 0.87 1179
15
+
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "Davlan/xlm-roberta-base-finetuned-amharic", "sp_model_kwargs": {}, "tokenizer_class": "XLMRobertaTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:022c4df669a59f60ab9a58a25460a7029d881ec69a9936a19706f6583311471e
3
+ size 1583