pavel-tolstyko
/

ggml-model-Q4_K_M

Inference Endpoints

Model card Files Files and versions Community

ggml-model-Q4_K_M / README.md

pavel-tolstyko's picture

Update README.md

1fb0245 verified about 1 month ago

|

history blame contribute delete

1.49 kB

	---
	base_model:
	- TinyLlama/TinyLlama-1.1B-Chat-v1.0
	---
	# Model Card for TinyLlama-1.1B-Chat-v1.0 (Quantized)

	This is a quantized version of TinyLlama-1.1B-Chat-v1.0.

	### Performance Evaluation

	The quantized model was tested on the `hellaswag` dataset with the following results:

	\| Metric \| Base Model \| Quantized Model \| Change \|
	\|-------------------------\|------------\|-----------------\|------------------\|
	\| hellaswag accuracy \| 0.456 \| 0.462 \| unchanged \|
	\| hellaswag normalized accuracy \| 0.64 \| 0.64 \| unchanged \|
	\| eval time (GPU) - seconds \| 219.67 \| 209.34 \| 4.70% decrease \|

	The quantized version of TinyLlama-1.1B-Chat-v1.0 maintains similar accuracy while achieving a 4.7% reduction in evaluation time. This evaluation was conducted using GPU resources on a subset of 100 `hellaswag` samples for expediency. For production purposes, it is recommended to perform a full evaluation.

	Quantization Approach
	The model was quantized to 4-bits using the Q4_K_M method with `llama.cpp`, specifically designed for optimized GPU performance. The following steps were used:

	1. Convert the original model to GGUF format:

	```bash
	python ./llama.cpp/convert_hf_to_gguf.py ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/

	2. Quantize the GGUF model to 4-bit Q4_K_M:

	./llama.cpp/build/bin/llama-quantize ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/ggml-model-Q4_K_M.gguf q4_k_m