MoseliMotsoehli commited on
Commit
371e8c3
·
1 Parent(s): 2c47d42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -9,7 +9,7 @@ Pretrained model on the Tswana language using a masked language modeling (MLM) o
9
  TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
10
 
11
  ## Intended uses & limitations
12
- The model can be used for either masked language modeling or next word prediction. It can also be fine-tuned on a specific down-stream NLP application.
13
 
14
  #### How to use
15
 
@@ -45,15 +45,15 @@ The model can be used for either masked language modeling or next word predicti
45
  ```
46
 
47
  #### Limitations and bias
48
- The model is trained on a relatively small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet.
49
 
50
  ## Training data
51
 
52
  1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
53
 
54
- 2. I Then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that is generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus.
55
 
56
- 3. I went on to add 300 more sentences by scrapping following news sites and blogs that mosty originate in Botswana. I actively continue to expand the dataset.
57
 
58
  * http://setswana.blogspot.com/
59
  * https://omniglot.com/writing/tswana.php
 
9
  TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
10
 
11
  ## Intended uses & limitations
12
+ The model can be used for either masked language modeling or next-word prediction. It can also be fine-tuned on a specific downstream NLP application.
13
 
14
  #### How to use
15
 
 
45
  ```
46
 
47
  #### Limitations and bias
48
+ The model is trained on a relatively small collection of sestwana, mostly from news articles and creative writings, and so is not representative enough of the language as yet.
49
 
50
  ## Training data
51
 
52
  1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
53
 
54
+ 2. We then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that are generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus.
55
 
56
+ 3. We went on to add 300 more sentences by scrapping following news sites and blogs that mostly originate in Botswana. We actively continue to expand the dataset.
57
 
58
  * http://setswana.blogspot.com/
59
  * https://omniglot.com/writing/tswana.php