Model Card for Deepakvictor/tamil_bs_bert
BERT base model
Pretrained model on Tamil language using a masked language modeling (MLM) objective.It was introduced in this paper and first released in this repository.
Model description
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.In the same way this model is trained on tamil in a objective to predict a masked word [MASK]. Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
Training of this model
This model was trained on the dataset AnanthZeke/tamil_sentences_master_raw. the first 10.6M sentences are used in training this model with a batch_size of 64. the model performed a loss of 0.687 in overall training. the model performed a loss of 0.80 in evaluation. the dataset used for for evaluation is the same dataset with last 120000 rows
Model variations
BERT has originally been released in base and large variations, for cased and uncased input text. this model doesn't face any "case" input since language tamil doesn't work on cases. this bert model is base model with 110M parameteres
Model | #params | Language |
---|---|---|
bert-base-uncased |
110M | Tamil |
Intended uses & limitations
You can use this raw model for masked language modeling. and can be used to finetune any task. since this model doesn't follow wordpiece tokenization and performed on subword tokenization there might be a higher chance that the predicted masked word may be a subword.
How to use
from transformers import pipeline
unmasker = pipeline('fill-mask', model='Deepakvictor/tamil_bs_bert')
unmasker("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்")
[{'score': 0.14111991226673126,
'token': 12540,
'token_str': 'மொழியை',
'sequence': 'தமிழ் மொழியை வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
{'score': 0.0806930884718895,
'token': 2461,
'token_str': 'மக்களுக்கு',
'sequence': 'தமிழ் மக்களுக்கு வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
{'score': 0.016404788941144943,
'token': 3461,
'token_str': 'எழுத',
'sequence': 'தமிழ் எழுத வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
{'score': 0.015853099524974823,
'token': 5849,
'token_str': 'எழுதி',
'sequence': 'தமிழ் எழுதி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'},
{'score': 0.015091801062226295,
'token': 1107,
'token_str': 'எப்படி',
'sequence': 'தமிழ் எப்படி வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்'}]
To use the model in pytorch
# Load the model and tokenizer
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Deepakvictor/tamil_bs_bert")
model = AutoModelForMaskedLM.from_pretrained("Deepakvictor/tamil_bs_bert")
#tokenize the input
inp = tokenizer("தமிழ் [MASK] வாழ்த்துவதற்கு உதவும் பொதுவான மற்றும் அடிப்படையான தமிழ் சொற்றொடர்களை ஒவ்வொரு பயணியும் கண்டிப்பாக அறிந்திருக்க வேண்டும்",return_tensors="pt")
out = model(**inp)
#decode Process
tokenizer.decode(out.logits.softmax(-1).argmax(-1).view(-1).tolist(),skip_special_tokens=True)
Limitations and bias
As mentioned the model may output a subword with the masked token since the model is trained self-supervised there might be any biased found.
Training data
This BERT model was pretrained on tamil-sentence
Training procedure
Preprocessing
a Tokenizer is trained with the same dataset tamil-sentence with a vocab size of 29677 The details of the masking procedure for each sentence are the following: 15% of the tokens are masked.
pretraining
The model was trained on P100 GPU for ten million sentences with a batch size of 64.The optimizer used is AdamW with a learning rate of 1e-5,
Evaluation results
this bert-base model produces a evaluation loss of 0.8 on 1,20,200 sentences
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-1810-04805,
author = {Jacob Devlin and
Ming{-}Wei Chang and
Kenton Lee and
Kristina Toutanova},
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
Understanding},
journal = {CoRR},
volume = {abs/1810.04805},
year = {2018},
url = {http://arxiv.org/abs/1810.04805},
archivePrefix = {arXiv},
eprint = {1810.04805},
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
- Downloads last month
- 2