|
--- |
|
base_model: |
|
- TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
|
--- |
|
# Model Card for TinyLlama-1.1B-Chat-v1.0 (Quantized) |
|
|
|
This is a quantized version of **TinyLlama-1.1B-Chat-v1.0**. |
|
|
|
### Performance Evaluation |
|
|
|
The quantized model was tested on the `hellaswag` dataset with the following results: |
|
|
|
| Metric | Base Model | Quantized Model | Change | |
|
|-------------------------|------------|-----------------|------------------| |
|
| hellaswag accuracy | 0.456 | 0.462 | unchanged | |
|
| hellaswag normalized accuracy | 0.64 | 0.64 | unchanged | |
|
| eval time (GPU) - seconds | 219.67 | 209.34 | 4.70% decrease | |
|
|
|
The quantized version of TinyLlama-1.1B-Chat-v1.0 maintains similar accuracy while achieving a 4.7% reduction in evaluation time. This evaluation was conducted using GPU resources on a subset of 100 `hellaswag` samples for expediency. For production purposes, it is recommended to perform a full evaluation. |
|
|
|
**Quantization Approach** |
|
The model was quantized to 4-bits using the Q4_K_M method with `llama.cpp`, specifically designed for optimized GPU performance. The following steps were used: |
|
|
|
1. Convert the original model to GGUF format: |
|
|
|
```bash |
|
python ./llama.cpp/convert_hf_to_gguf.py ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/ |
|
|
|
2. Quantize the GGUF model to 4-bit Q4_K_M: |
|
|
|
./llama.cpp/build/bin/llama-quantize ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/ggml-model-Q4_K_M.gguf q4_k_m |