|
--- |
|
pipeline_tag: text-generation |
|
tags: |
|
- openvino |
|
- mpt |
|
- sparse |
|
- quantization |
|
library_name: "OpenVINO" |
|
--- |
|
|
|
The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512. |
|
|
|
Target HW: Intel 4th gen Xeon (Sapphire Rapids) |
|
|
|
SW |
|
``` |
|
git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70 |
|
pip install openvino==2024.2.0 |
|
``` |
|
|
|
## Benchmarking with OpenVINO |
|
|
|
1. ./benchmarkapp_w8a8.bash |
|
2. ./benchmarkapp_w8a8_sparse70.bash |
|
|
|
Note: do remove the numactl if your node does not support it. |
|
|
|
|
|
## Implementation of Sparse Weight Decompression in OpenVINO |
|
* This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. |
|
https://github.com/openvinotoolkit/oneDNN/pull/158/files |
|
|
|
* you can browse this via the left pane. |
|
|
|
* initialization: src/cpu/reorder/simple_sparse_reorder.hpp ([line 113](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-f1445f832cd9979d9756873e3d8c30716976f51b6ce4640eae12762a417284e3R113)) |
|
|
|
* decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp ([line 41](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-98844e424b6687de78d47737e62f206dc9befcec6887dac8b2c52d0303dd3576R41)) |
|
|
|
|
|
* If you'd like to build OpenVINO runtime from source for debug, [see wiki page](https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md). Benchmark_app is compiled as well. |
|
|
|
## Related materials: |
|
[OpenVINO blog on Sparse-Quantized BERT](https://blog.openvino.ai/blog-posts/accelerate-inference-of-sparse-transformer-models-with-openvino-tm-and-4th-gen-intel-r-xeon-r-scalable-processors) ([corresponding notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/116-sparsity-optimization/116-sparsity-optimization.ipynb)) |
|
|
|
|