File size: 869 Bytes
aee566e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b9e8bd
 
 
 
 
 
 
 
 
 
aee566e
 
1b9e8bd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
language: dv
---

# dv-muril

This is an experiment in transfer learning, to insert Dhivehi word and
word-piece tokens into Google's MuRIL model.

This BERT-based model currently performs better than dv-wave ELECTRA on
the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

## Training

- Start with MuRIL (similar to mBERT) with no Thaana vocabulary
- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Malayalam or English embeddings
- Add remaining words and word-pieces from BertWordPieceTokenizer / vocab.txt
- Continue BERT pretraining

## Performance

- mBERT: 52%
- dv-wave (ELECTRA, 30k vocab): 89%
- dv-muril (10k vocab) before BERT pretraining step: 89.8%
- previous dv-muril (30k vocab): 90.7%
- dv-muril (10k vocab): 91.6%

CoLab notebook: 
https://colab.research.google.com/drive/113o6vkLZRkm6OwhTHrvE0x6QPpavj0fn