pranaydeeps commited on
Commit
5f8d742
1 Parent(s): 40a1b2d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md CHANGED
@@ -1,3 +1,82 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - sw
5
  ---
6
+ # SwahBERT: Language model of Swahili
7
+
8
+ The model and all credits belong to the original authors. The model was uploaded to the HuggingFace Hub for convenience.
9
+ For more details, please refer the **[original repository](https://github.com/gatimartin/SwahBERT)**.
10
+
11
+ Is a pretrained monolingual language model for Swahili. <br>
12
+ The model was trained for 800K steps using a corpus of 105MB that was collected from news sites, online discussion, and Wikipedia. <br>
13
+ The evaluation was perfomed on several downstream tasks such as emotion classification, news classification, sentiment classification, and Named entity recognition.
14
+
15
+ ```ruby
16
+ import torch
17
+ from transformers import BertTokenizer
18
+
19
+ tokenizer = BertTokenizer.from_pretrained("swahbert-base-uncased")
20
+
21
+ # Tokenized input
22
+ text = "Mlima Kilimanjaro unapatikana Tanzania"
23
+ tokenized_text = tokenizer.tokenize(text)
24
+
25
+ SwahBERT => ['mlima', 'kilimanjaro', 'unapatikana', 'tanzania']
26
+ mBERT => ['ml', '##ima', 'ki', '##lima', '##nja', '##ro', 'una', '##patikana', 'tan', '##zania']
27
+
28
+ ```
29
+
30
+ ## Pre-training data
31
+ The text was extracted from different sorces;<br>
32
+ - News sites: `United Nations news, Voice of America (VoA), Deutsche Welle (DW) and taifaleo`<br>
33
+ - Forums: `JaiiForum`<br>
34
+ - ``Wikipedia``.
35
+
36
+
37
+ ## Pre-trained Models
38
+ Download the models here:<br>
39
+ - **[`SwahBERT-Base, Uncased`](https://drive.google.com/drive/folders/1HZTCqxt93F5NcvgAWcbrXZammBPizdxF?usp=sharing)**:12-layer, 768-hidden, 12-heads , 124M parameters
40
+ - **[`SwahBERT-Base, Cased`](https://drive.google.com/drive/folders/1cCcPopqTyzY6AnH9quKcT9kG5zH7tgEZ?usp=sharing)**:12-layer, 768-hidden, 12-heads , 111M parameters
41
+
42
+ Steps | vocab size | MLM acc | NSP acc | loss |
43
+ --- | --- | --- | --- | --- |
44
+ **800K** | **50K (uncased)** | **76.54** | **99.67** | **1.0667** |
45
+ **800K** | **32K (cased)** | **76.94** | **99.33** | **1.0562** |
46
+
47
+
48
+ ## Emotion Dataset
49
+ We released the **[`Swahili emotion dataset`](https://github.com/gatimartin/SwahBERT/tree/main/emotion_dataset)**.<br>
50
+ The data consists of ~13K emotion annotated comments from social media platforms and translated English dataset. <br>
51
+ The data is multi-label with six Ekman’s emotions: happy, surprise, sadness, fear, anger, and disgust or neutral.
52
+
53
+ ## Evaluation
54
+ The model was tested on four downstream tasks including our new emotion dataset
55
+
56
+ F1-score of language models on downstream tasks
57
+ Tasks | SwahBERT | SwahBERT_cased | mBERT |
58
+ --- | --- | --- | --- |
59
+ Emotion | 64.46 | 64.77 | 60.52 |
60
+ News | 90.90 | 89.90 | 89.73 |
61
+ Sentiment | 70.94 | 71.12 | 67.20 |
62
+ NER | 88.50 | 88.60 | 89.36 |
63
+
64
+
65
+ ## Citation
66
+ Please use the following citation if you use the model or dataset:
67
+
68
+ ```
69
+ @inproceedings{martin-etal-2022-swahbert,
70
+ title = "{S}wah{BERT}: Language Model of {S}wahili",
71
+ author = "Martin, Gati and Mswahili, Medard Edmund and Jeong, Young-Seob and Woo, Jiyoung",
72
+ booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
73
+ month = jul,
74
+ year = "2022",
75
+ address = "Seattle, United States",
76
+ publisher = "Association for Computational Linguistics",
77
+ url = "https://aclanthology.org/2022.naacl-main.23",
78
+ pages = "303--313"
79
+ }
80
+ ```
81
+
82
+