monsoon-nlp
/

dv-muril

Inference Endpoints

Model card Files Files and versions Community

dv-muril / README.md

monsoon-nlp's picture

10k vocab

1b9e8bd almost 4 years ago

|

869 Bytes

	---
	language: dv
	---

	# dv-muril

	This is an experiment in transfer learning, to insert Dhivehi word and
	word-piece tokens into Google's MuRIL model.

	This BERT-based model currently performs better than dv-wave ELECTRA on
	the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

	## Training

	- Start with MuRIL (similar to mBERT) with no Thaana vocabulary
	- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Malayalam or English embeddings
	- Add remaining words and word-pieces from BertWordPieceTokenizer / vocab.txt
	- Continue BERT pretraining

	## Performance

	- mBERT: 52%
	- dv-wave (ELECTRA, 30k vocab): 89%
	- dv-muril (10k vocab) before BERT pretraining step: 89.8%
	- previous dv-muril (30k vocab): 90.7%
	- dv-muril (10k vocab): 91.6%

	CoLab notebook:
	https://colab.research.google.com/drive/113o6vkLZRkm6OwhTHrvE0x6QPpavj0fn