File size: 9,457 Bytes
2798d8a 0131f65 2798d8a 0ca8c2f 2798d8a e5bd71a 63703fc f4aec21 723fe4e 2798d8a 723fe4e 2798d8a 723fe4e 2798d8a 723fe4e 2798d8a 723fe4e 63703fc 723fe4e 2798d8a 723fe4e 2798d8a 723fe4e 2798d8a 0ca8c2f 2798d8a 723fe4e 2798d8a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: apache-2.0
tags:
- generated_from_trainer
datasets: Sakonii/nepalitext-language-model-dataset
mask_token: <mask>
widget:
- text: मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ।
परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित
छ।
example_title: Example 1
- text: अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही
वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले
संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।
example_title: Example 2
- text: जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट
गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।
example_title: Example 3
model-index:
- name: distilbert-base-nepali
results: []
---
# distilbert-base-nepali
This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling. Find more details in [this paper](https://aclanthology.org/2022.sigul-1.14/).
It achieves the following results on the evaluation set:
mlm probability|evaluation loss|evaluation perplexity
--:|----:|-----:|
15%|2.349|10.479|
20%|2.605|13.351|
## Model description
Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
## Intended uses & limitations
This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
## Usage
This model can be used directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
>>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
[{'score': 0.04128897562623024,
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
'token': 2605,
'token_str': 'मौसम'},
{'score': 0.04100276157259941,
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
'token': 2792,
'token_str': 'प्रकृति'},
{'score': 0.026525357738137245,
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
'token': 387,
'token_str': 'पानी'},
{'score': 0.02340106852352619,
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
'token': 1313,
'token_str': 'जल'},
{'score': 0.02055591531097889,
'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
'token': 790,
'token_str': 'वातावरण'}]
```
Here is how we can use the model to get the features of a given text in PyTorch:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
# prepare input
text = "चाहिएको text यता राख्नु होला।"
encoded_input = tokenizer(text, return_tensors='pt')
# forward pass
output = model(**encoded_input)
```
## Training data
This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
## Tokenization
A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
## Training procedure
The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
### Training hyperparameters
The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
- learning_rate: 5e-05
- train_batch_size: 28
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
- mixed_precision_training: Native AMP
### Training results
The model is trained for 4 epochs with varying hyperparameters:
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
| 3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 |
| 2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 |
| 2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 |
| 2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 |
| 2.6412 | 5.0 | 20 | 28 | 35715 | 2.5161 | 12.3802 |
Final model evaluated with MLM Probability of 15%:
| Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
| - | - | 15 | - | - | 2.3494 | 10.4791 |
### Framework versions
- Transformers 4.16.2
- Pytorch 1.9.1
- Datasets 1.18.3
- Tokenizers 0.10.3
|