MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers".

Please find the information about preprocessing, training and full details of the MiniLM in the original MiniLM repository.

Please note: This checkpoint uses BertModel with XLMRobertaTokenizer so AutoTokenizer won't work with this checkpoint!

Multilingual Pretrained Model

Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters

Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on huggingface/transformers. Please replace run_xnli.py in transformers with ours to fine-tune multilingual MiniLM.

We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).

Cross-Lingual Natural Language Inference - XNLI

We evaluate our model on cross-lingual transfer from English to other languages. Following Conneau et al. (2019), we select the best single model on the joint dev set of all the languages.

Model	#Layers	#Hidden	#Transformer Parameters	Average	en	fr	es	de	el	bg	ru	tr	ar	vi	th	zh	hi	sw	ur
mBERT	12	768	85M	66.3	82.1	73.8	74.3	71.1	66.4	68.9	69.0	61.6	64.9	69.5	55.8	69.3	60.0	50.4	58.0
XLM-100	16	1280	315M	70.7	83.2	76.7	77.7	74.0	72.7	74.1	72.7	68.7	68.6	72.9	68.9	72.5	65.6	58.2	62.4
XLM-R Base	12	768	85M	74.5	84.6	78.4	78.9	76.8	75.9	77.3	75.4	73.2	71.5	75.4	72.5	74.9	71.1	65.2	66.5
mMiniLM-L12xH384	12	384	21M	71.1	81.5	74.8	75.7	72.9	73.0	74.5	71.3	69.7	68.8	72.1	67.8	70.0	66.2	63.3	64.2

This example code fine-tunes 12-layer multilingual MiniLM on XNLI.

# run fine-tuning on XNLI
DATA_DIR=/{path_of_data}/
OUTPUT_DIR=/{path_of_fine-tuned_model}/
MODEL_PATH=/{path_of_pre-trained_model}/

python ./examples/run_xnli.py --model_type minilm \
 --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
 --model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
 --tokenizer_name xlm-roberta-base \
 --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
 --do_train \
 --do_eval \
 --max_seq_length 128 \
 --per_gpu_train_batch_size 128 \
 --learning_rate 5e-5 \
 --num_train_epochs 5 \
 --per_gpu_eval_batch_size 32 \
 --weight_decay 0.001 \
 --warmup_steps 500 \
 --save_steps 1500 \
 --logging_steps 1500 \
 --eval_all_checkpoints \
 --language en \
 --fp16 \
 --fp16_opt_level O2

Cross-Lingual Question Answering - MLQA

Following Lewis et al. (2019b), we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.

Model F1 Score	#Layers	#Hidden	#Transformer Parameters	Average	en	es	de	ar	hi	vi	zh
mBERT	12	768	85M	57.7	77.7	64.3	57.9	45.7	43.8	57.1	57.5
XLM-15	12	1024	151M	61.6	74.9	68.0	62.2	54.8	48.8	61.4	61.1
XLM-R Base (Reported)	12	768	85M	62.9	77.8	67.2	60.8	53.0	57.9	63.1	60.2
XLM-R Base (Our fine-tuned)	12	768	85M	64.9	80.3	67.0	62.7	55.0	60.4	66.5	62.3
mMiniLM-L12xH384	12	384	21M	63.2	79.4	66.1	61.2	54.9	58.5	63.1	59.0

Citation

If you find MiniLM useful in your research, please cite the following paper:

@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 65,739

Model tree for microsoft/Multilingual-MiniLM-L12-H384

Adapters

1 model

Finetunes

32 models

Quantizations

2 models

Spaces using microsoft/Multilingual-MiniLM-L12-H384 5

Papers for microsoft/Multilingual-MiniLM-L12-H384

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Paper • 2002.10957 • Published Feb 25, 2020 • 2