dv-muril

This is an experiment in transfer learning, to insert Dhivehi word and word-piece tokens into Google's MuRIL model.

This BERT-based model currently performs better than dv-wave ELECTRA on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

Training

  • Start with MuRIL (similar to mBERT) with no Thaana vocabulary
  • Based on PanLex dictionaries, attach 1,100 Dhivehi words to Malayalam or English embeddings
  • Add remaining words and word-pieces from BertWordPieceTokenizer / vocab.txt
  • Continue BERT pretraining

Performance

  • mBERT: 52%
  • dv-wave (ELECTRA, 30k vocab): 89%
  • dv-muril (10k vocab) before BERT pretraining step: 89.8%
  • previous dv-muril (30k vocab): 90.7%
  • dv-muril (10k vocab): 91.6%

CoLab notebook: https://colab.research.google.com/drive/113o6vkLZRkm6OwhTHrvE0x6QPpavj0fn

Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.