File size: 6,641 Bytes
0731f43
 
 
 
 
 
 
2531c5a
c9248aa
2531c5a
c9248aa
2531c5a
c9248aa
2531c5a
c9248aa
2531c5a
c9248aa
 
2531c5a
 
0731f43
 
 
 
2531c5a
 
0731f43
2531c5a
 
 
 
0731f43
 
 
2531c5a
0731f43
 
 
2531c5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0731f43
2531c5a
0731f43
2531c5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0731f43
 
 
2531c5a
 
 
0731f43
 
 
 
 
 
 
 
 
2531c5a
0731f43
 
 
 
2531c5a
 
 
 
 
 
 
0731f43
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: distilgpt2-nepali
  results: []
widget:

- text: "नेपाल र भारतबीच"
  example_title: "Example 1"
- text: "प्रधानमन्त्री"
  example_title: "Example 2"
- text: "दस वर्ष लामो "
  example_title: "Example 3"
- text: "जापानमा आज "
  example_title: "Example 4"
- text: "नेपालका धेरैजसो चाडपर्वहरूमध्ये,"
  example_title: "Example 5"
  
---

# distilgpt2-nepali

This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a Causal language modeling (CLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilgpt2](https://huggingface.co/distilgpt2) for language modeling.

It achieves the following results on the evaluation set:

| Training Loss | Validation Loss | Perplexity
|:-------------:|:---------------:|:----------:|
| 3.3968        | 3.2705          | 26.3245

## Model description

Refer to original [distilgpt2](https://huggingface.co/distilgpt2)

## Intended uses & limitations

This raw model can be used for Nepali text generation and intends to be fine-tuned on Nepali language focused downstream task. 
The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.

## Usage

This model can be used directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

```python
>>> from transformers import pipeline, set_seed
>>> set_seed(42)
>>> generator = pipeline('text-generation', model='Sakonii/distilgpt2-nepali')
>>> generator("नेपालका धेरैजसो चाडपर्वहरूमध्ये,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[{'generated_text': 'नेपालका धेरैजसो चाडपर्वहरूमध्ये, तिहार र छठपर्व विशेष रूपमा मनाइने भएकाले नेपाली मौलिक पर्व पनि हो । हिन्दू धर्म र संस्कृतिक... काठमाडौं ।'},
 {'generated_text': 'नेपालका धेरैजसो चाडपर्वहरूमध्ये, तिहारको मुख्य दिन आज साँझ अस्ताउँदो सूर्यलाई अर्घ्य दिइएको छ । वैदिक विधि...विस्तृतमा पढ्नुस् काठमाडौं । नेपाल चिकित्सक संघका'},
 {'generated_text': 'नेपालका धेरैजसो चाडपर्वहरूमध्ये, चाडपर्व, विवाह,... नेपाली काँग्रेसका प्रवक्ता विश्वप्रकाश शर्माले पार्टीभित्र आन्तरिक झगडा हुने निश्चित भएको र गुटबन्दीका कारण चुनावमा हार बेहोर्नु'},
 {'generated_text': 'नेपालका धेरैजसो चाडपर्वहरूमध्ये, दशैं नेपालीहरूको मौलिक पर्वका रूपमा मनाउँछन् । नेपालीहरूको दोस्रो महान् पर्व तिहार हो । तिहारले दाजुभाइ तथा दिदीबहिनीहरूको बीचमा प्रगाढ सम्बन्ध स्थापित'},
 {'generated_text': 'नेपालका धेरैजसो चाडपर्वहरूमध्ये, माघे संक्रान्ति र माघे संक्रान्तिमा माघे संक्रान्तिमा मात्र नभएर फागुन महिनाभर नै विशेष महत्व रहने गरेको छ । काठमाडौं ।'}]
```

Here is how we can use the model to get the features of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilgpt2-nepali')
model = AutoModelForCausalLM.from_pretrained('Sakonii/distilgpt2-nepali')

# prepare input
text = "चाहिएको text यता राख्नु होला।"
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)
```

## Training data

This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
As for training the language model, the texts are tokenized using Sentence Piece Model (SPM), a vocabulary size of 24,576 and texts are are grouped to a block of 512 tokens.

## Training procedure

The model is trained with the same configuration as the original [distilgpt2](https://huggingface.co/distilgpt2); but with 512 tokens per instance, 12 instances per batch, and around 188.8K training steps.


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 12
- eval_batch_size: 12
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step   | Validation Loss | Perplexity |
|:-------------:|:-----:|:------:|:---------------:|:----------:|
| 3.7645        | 1.0   | 94395  | 3.6291          | 37.6789    |
| 3.5857        | 2.0   | 188790 | 3.4442          | 31.3182    |
| 3.505         | 3.0   | 283185 | 3.3749          | 29.2214    |
| 3.4688        | 4.0   | 377580 | 3.3439          | 28.3294    |
| 3.3968        | 5.0   | 471975 | 3.2705          | 26.3245    |


### Framework versions

- Transformers 4.17.0
- Pytorch 1.9.1
- Datasets 2.0.0
- Tokenizers 0.11.6