File size: 939 Bytes
42bdddf 99d616d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
---
license: apache-2.0
---
QuaLA-MiniLM: a Quantized Length Adaptive
MiniLM
The article discusses the challenge of making transformer-based models efficient enough for practical use,
given their size and computational requirements. The authors propose a new approach called QuaLA-MiniLM,
which combines knowledge distillation, the length-adaptive transformer (LAT) technique,
and low-bit quantization. This approach trains a single model that can adapt to any
inference scenario with a given computational budget, achieving a superior accuracy-efficiency
trade-off on the SQuAD1.1 dataset. The authors compare this approach to other efficient methods
and find that it achieves up to an x8.8 speedup with less than 1% accuracy loss.
The authors also provide their code publicly on GitHub. The article also discusses other related work
in the field, including dynamic transformers and other knowledge distillation approaches.
|