INT8 GPT-J 6B
GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
This repository contains GPT-J 6B onnx model suitable for building TensorRT int8+fp32 engines. Quantization of model was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.
Metrics:
TensorRT INT8+FP32 | torch FP16 | torch FP32 | |
---|---|---|---|
Lambada Acc | 78.46% | 79.53% | - |
Model size (GB) | 8.5 | 12.1 | 24.2 |
Test environment
- GPU RTX 4090
- CPU 11th Gen Intel(R) Core(TM) i7-11700K
- TensorRT 8.5.3.1
- pytorch 1.13.1+cu116
Latency:
Input sequance length | Number of generated tokens | TensorRT INT8+FP32 ms | torch FP16 ms | Acceleration |
---|---|---|---|---|
64 | 64 | 1040 | 1610 | 1.55 |
64 | 128 | 2089 | 3224 | 1.54 |
64 | 256 | 4236 | 6479 | 1.53 |
128 | 64 | 1060 | 1619 | 1.53 |
128 | 128 | 2120 | 3241 | 1.53 |
128 | 256 | 4296 | 6510 | 1.52 |
256 | 64 | 1109 | 1640 | 1.49 |
256 | 128 | 2204 | 3276 | 1.49 |
256 | 256 | 4443 | 6571 | 1.49 |
Test environment
- GPU RTX 4090
- CPU 11th Gen Intel(R) Core(TM) i7-11700K
- TensorRT 8.5.3.1
- pytorch 1.13.1+cu116
How to use
Example of inference and accuracy test published on github:
git clone https://github.com/ENOT-AutoDL/ENOT-transformers
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.