File size: 3,537 Bytes
98be0c2
 
 
 
 
 
 
 
 
 
 
b406e6f
 
c5a3afe
b406e6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5a3afe
 
 
 
 
 
 
 
 
8d3e10b
 
c5a3afe
7756f55
 
 
8da8e78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: mit
datasets:
- Shushant/nepali
language:
- ne
metrics:
- perplexity
library_name: transformers
pipeline_tag: fill-mask
---
# NEPALI BERT
## Masked Language Model for nepali language trained on nepali news scrapped from different nepali news website. The data set contained about 10 million of nepali sentences mainly related to nepali news.

This model is a fine-tuned version of [Bert Base Uncased](https://huggingface.co/bert-base-uncased) on dataset composed of different news scrapped from nepali news portals comprising of 4.6 GB of textual data.
It achieves the following results on the evaluation set:
- Loss: 1.0495

## Model description

Pretraining done on bert base architecture. 

## Intended uses & limitations
This transformer model can be used for any NLP tasks related to Devenagari language. At the time of training, this is the state of the art model developed 
for Devanagari dataset. Intrinsic evaluation with Perplexity of 8.56 achieves this state of the art whereas extrinsit evaluation done on sentiment analysis of Nepali tweets outperformed other existing 
masked language models on Nepali dataset. 
## Training and evaluation data
THe training corpus is developed using 85467 news scrapped from different job portals. This is a preliminary dataset 
for the experimentation. THe corpus size is about 4.3 GB of textual data. Similary evaluation data contains few news articles about 12 mb of textual data.

## Training procedure
For the pretraining of masked language model, Trainer API from Huggingface is used. The pretraining took about 3 days 8 hours 57 minutes. Training was done on Tesla V100 GPU. 
With 640 Tensor Cores, Tesla V100 is the world's first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance. This GPU was faciliated by Kathmandu University (KU) supercomputer. 
Thanks to KU administration.

Usage
```
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")

from transformers import pipeline

fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, ) 
from pprint import pprint pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))
```

## Data Description
Trained on about 4.6 GB of Nepali text corpus collected from various sources
These data were collected from nepali news site, OSCAR nepali corpus


# Paper and CItation Details
If you are interested to read the implementation details of this language model, you can read the full paper here.
https://www.researchgate.net/publication/375019515_NepaliBERT_Pre-training_of_Masked_Language_Model_in_Nepali_Corpus 

## Plain Text
S. Pudasaini, S. Shakya, A. Tamang, S. Adhikari, S. Thapa and S. Lamichhane, "NepaliBERT: Pre-training of Masked Language Model in Nepali Corpus," 2023 7th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Kirtipur, Nepal, 2023, pp. 325-330, doi: 10.1109/I-SMAC58438.2023.10290690.

## Bibtex

@INPROCEEDINGS{10290690,
  author={Pudasaini, Shushanta and Shakya, Subarna and Tamang, Aakash and Adhikari, Sajjan and Thapa, Sunil and Lamichhane, Sagar},
  booktitle={2023 7th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)}, 
  title={NepaliBERT: Pre-training of Masked Language Model in Nepali Corpus}, 
  year={2023},
  volume={},
  number={},
  pages={325-330},
  doi={10.1109/I-SMAC58438.2023.10290690}}