mBERT swedish distilled base model (cased)

This model is a distilled version of mBERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.

Model description

This is a 6-layer version of mBERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Training data

The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 9 GB.

Evaluation results

When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.859 which is competitive with the score mBERT obtained, 0.866.

When evaluated on the English WikiANN dataset, it achieved an average F1 score of 0.826 which is competitive with the score mBERT obtained, 0.849.

Additional results and comparisons are presented in my Master's Thesis

Downloads last month: 26

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Addedk
/

mbert-swedish-distilled-cased

mBERT swedish distilled base model (cased)

Model description

Intended uses & limitations

Training data

Evaluation results

Dataset used to train Addedk/mbert-swedish-distilled-cased