Paper

SEAD: SIMPLE ENSEMBLE AND KNOWLEDGE DISTILLATION FRAMEWORK FOR NATURAL LANGUAGE UNDERSTANDING

Aurthors: Moyan Mei, Rohit Sroch

Abstract

With the widespread use of pre-trained language models (PLM), there has been increased research on how to make them applicable, especially in limited-resource or low latency high throughput scenarios. One of the dominant approaches is knowledge distillation (KD), where a smaller model is trained by receiving guidance from a large PLM. While there are many successful designs for learning knowledge from teachers, it remains unclear how students can learn better. Inspired by real university teaching processes, in this work we further explore knowledge distillation and propose a very simple yet effective framework, SEAD, to further improve task-specific generalization by utilizing multiple teachers. Our experiments show that SEAD leads to better performance compared to other popular KD methods [1] [2] [3] and achieves comparable or superior performance to its teacher model such as BERT [4] on total 13 tasks for the GLUE [5] and SuperGLUE [6] benchmarks.

Moyan Mei and Rohit Sroch. 2022. SEAD: Simple ensemble and knowledge distillation framework for natural language understanding. Lattice, THE MACHINE LEARNING JOURNAL by Association of Data Scientists, 3(1).

SEAD-L-6_H-256_A-8-wnli

This is a student model distilled from BERT base as teacher by using SEAD framework on wnli task. For weights initialization, we used microsoft/xtremedistil-l6-h256-uncased

All SEAD Checkpoints

Other Community Checkpoints: here

Intended uses & limitations

More information needed

Training hyperparameters

Please take a look at the training_args.bin file

$ import torch
$ hyperparameters = torch.load(os.path.join('training_args.bin'))

Evaluation results

eval_accuracy	eval_runtime	eval_samples_per_second	eval_steps_per_second	eval_loss	eval_samples
0.5634	1.2474	56.919	2.405	0.6859	71

Framework versions

Transformers >=4.8.0
Pytorch >=1.6.0
TensorFlow >=2.5.0
Flax >=0.3.5
Datasets >=1.10.2
Tokenizers >=0.11.6

If you use these models, please cite the following paper:

    ```
    @article{article, 
        author={Mei, Moyan and Sroch, Rohit}, 
        title={SEAD: Simple Ensemble and Knowledge Distillation Framework for Natural Language Understanding}, 
        volume={3}, 
        number={1}, 
        journal={Lattice, The Machine Learning Journal by Association of Data Scientists},
        day={26},
        year={2022}, 
        month={Feb},
        url = {www.adasci.org/journals/lattice-35309407/?volumes=true&open=621a3b18edc4364e8a96cb63}                                                  
    } 
    ```

Downloads last month: 6

Dataset used to train C5i/SEAD-L-6_H-256_A-8-wnli

Papers for C5i/SEAD-L-6_H-256_A-8-wnli

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Paper • 2002.10957 • Published Feb 25, 2020 • 2

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Paper • 1910.01108 • Published Oct 2, 2019 • 23

TinyBERT: Distilling BERT for Natural Language Understanding

Paper • 1909.10351 • Published Sep 23, 2019 • 4

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Paper • 1905.00537 • Published May 2, 2019 • 2

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 29