Update README.md
Browse files
README.md
CHANGED
@@ -1,38 +1,103 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
3 |
tags:
|
4 |
- generated_from_trainer
|
5 |
model-index:
|
6 |
- name: distilbert-base-nepali
|
7 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
11 |
-
should probably proofread and complete it, then remove this comment. -->
|
12 |
-
|
13 |
# distilbert-base-nepali
|
14 |
|
15 |
-
This model is
|
|
|
16 |
It achieves the following results on the evaluation set:
|
17 |
-
|
|
|
|
|
|
|
|
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
-
|
22 |
|
23 |
## Intended uses & limitations
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
|
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## Training procedure
|
32 |
|
|
|
|
|
33 |
### Training hyperparameters
|
34 |
|
35 |
-
The following hyperparameters were used
|
36 |
- learning_rate: 5e-05
|
37 |
- train_batch_size: 28
|
38 |
- eval_batch_size: 8
|
@@ -44,9 +109,21 @@ The following hyperparameters were used during training:
|
|
44 |
|
45 |
### Training results
|
46 |
|
47 |
-
|
48 |
-
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
|
52 |
### Framework versions
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
mask_token: "<mask>"
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
model-index:
|
7 |
- name: distilbert-base-nepali
|
8 |
results: []
|
9 |
+
widget:
|
10 |
+
- text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
|
11 |
+
example_title: "Example 1"
|
12 |
+
- text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
|
13 |
+
example_title: "Example 2"
|
14 |
+
- text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
|
15 |
+
example_title: "Example 3"
|
16 |
---
|
17 |
|
|
|
|
|
|
|
18 |
# distilbert-base-nepali
|
19 |
|
20 |
+
This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling.
|
21 |
+
|
22 |
It achieves the following results on the evaluation set:
|
23 |
+
|
24 |
+
mlm probability|evaluation loss|evaluation perplexity
|
25 |
+
--:|----:|-----:|
|
26 |
+
15%|2.349|10.479|
|
27 |
+
20%|2.605|13.351|
|
28 |
|
29 |
## Model description
|
30 |
|
31 |
+
Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
|
32 |
|
33 |
## Intended uses & limitations
|
34 |
|
35 |
+
This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
|
36 |
+
The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
|
37 |
+
|
38 |
+
## Usage
|
39 |
+
|
40 |
+
This model can be used directly with a pipeline for masked language modeling:
|
41 |
+
|
42 |
+
```python
|
43 |
+
>>> from transformers import pipeline
|
44 |
+
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
|
45 |
+
>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
|
46 |
+
|
47 |
+
[{'score': 0.04128897562623024,
|
48 |
+
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
|
49 |
+
'token': 2605,
|
50 |
+
'token_str': 'मौसम'},
|
51 |
+
{'score': 0.04100276157259941,
|
52 |
+
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
|
53 |
+
'token': 2792,
|
54 |
+
'token_str': 'प्रकृति'},
|
55 |
+
{'score': 0.026525357738137245,
|
56 |
+
'sequence': 'मानविय गतिविधिले प्रा���ृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
|
57 |
+
'token': 387,
|
58 |
+
'token_str': 'पानी'},
|
59 |
+
{'score': 0.02340106852352619,
|
60 |
+
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
|
61 |
+
'token': 1313,
|
62 |
+
'token_str': 'जल'},
|
63 |
+
{'score': 0.02055591531097889,
|
64 |
+
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
|
65 |
+
'token': 790,
|
66 |
+
'token_str': 'वातावरण'}]
|
67 |
+
```
|
68 |
+
|
69 |
+
Here is how we can use the model to get the features of a given text in PyTorch:
|
70 |
+
|
71 |
+
```python
|
72 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
73 |
|
74 |
+
tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
|
75 |
+
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
|
76 |
|
77 |
+
# prepare input
|
78 |
+
text = "चाहिएको text यता राख्नु होला।"
|
79 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
80 |
+
|
81 |
+
# forward pass
|
82 |
+
output = model(**encoded_input)
|
83 |
+
```
|
84 |
+
|
85 |
+
## Training data
|
86 |
+
|
87 |
+
This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
|
88 |
+
As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
|
89 |
+
|
90 |
+
## Tokenization
|
91 |
+
|
92 |
+
A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
|
93 |
|
94 |
## Training procedure
|
95 |
|
96 |
+
The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
|
97 |
+
|
98 |
### Training hyperparameters
|
99 |
|
100 |
+
The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
|
101 |
- learning_rate: 5e-05
|
102 |
- train_batch_size: 28
|
103 |
- eval_batch_size: 8
|
|
|
109 |
|
110 |
### Training results
|
111 |
|
112 |
+
The model is trained for 4 epochs with varying hyperparameters:
|
113 |
+
|
114 |
+
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
|
115 |
+
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
|
116 |
+
| 3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 |
|
117 |
+
| 2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 |
|
118 |
+
| 2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 |
|
119 |
+
| 2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 |
|
120 |
+
| 2.6412 | 5.0 | 20 | 28 | 35715 | 2.5161 | 12.3802 |
|
121 |
+
|
122 |
+
Final model evaluated with MLM Probability of 15%:
|
123 |
+
|
124 |
+
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
|
125 |
+
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
|
126 |
+
| - | - | 15 | - | - | 2.3494 | 10.4791 |
|
127 |
|
128 |
|
129 |
### Framework versions
|