SmolLM-135M-Instruct-quantized.w4a16
Model Overview
- Model Architecture: SmolLM-135M-Instruct
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: INT4
- Intended Use Cases: Intended for commercial and research use in English. Similarly to SmolLM-135M-Instruct, this models is intended for assistant-like chat.
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- Release Date: 8/23/2024
- Version: 1.0
- License(s): Apache-2.0
- Model Developers: Neural Magic
Quantized version of SmolLM-135M-Instruct. It achieves an average score of 31.91 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 31.55.
Model Optimizations
This model was obtained by quantizing the weights of SmolLM-135M-Instruct to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. Quantization is performed with 10% damping factor, group-size as 64 and 512 sequences sampled from LLM Compression Calibration.
Creation
This model was created by using the llm-compressor library as presented in the code snipet below.
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationStrategy
from datasets import load_dataset
import random
model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
num_samples = 512
max_seq_len = 4096
tokenizer = AutoTokenizer.from_pretrained(model_id)
preprocess_fn = lambda example: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}
dataset_name = "neuralmagic/LLM_compression_calibration"
dataset = load_dataset(dataset_name, split="train")
ds = dataset.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
examples = [
tokenizer(
example["text"], padding=False, max_length=max_seq_len, truncation=True,
) for example in ds
]
# recipe = "w4a16_nohead_recipe.yaml"
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
dampening_frac=0.1,
)
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
print(model)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
oneshot_device="cuda:1,2,3",
)
model_name = model_id.split("/")[-1]
model.save_pretrained(f"{model_name}-quantized.w4a16")
tokenizer.save_pretrained(f"{model_name}-quantized.w4a16")
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the sparseML engine, using the following command:
lm_eval \
--model sparseml \
--model_args pretrained=nm-testing/SmolLM-1.7B-Instruct-quantized.w4a16,dtype=bfloat16,max_legth=2048,add_bos_token=True,parallelize=True \
--tasks openllm \
--batch_size auto
Accuracy
Open LLM Leaderboard evaluation scores
Benchmark | SmolLM-135M-Instruct | SmolLM-135M-Instruct-quantized.w4a16(this model) | Recovery |
MMLU (5-shot) | 26.220 | 25.202 | 96.12% |
ARC Challenge (25-shot) | 29.948 | 30.034 | 100.29% |
GSM-8K (5-shot, strict-match) | 1.289 | 1.971 | 152.91% |
Hellaswag (10-shot) | 41.41 | 40.81 | 98.55% |
Winogrande (5-shot) | 50.039 | 53.591 | 107.10% |
TruthfulQA (0-shot) | 40.38 | 39.87 | 98.74% |
Average | 31.55 | 31.91 | 101.16% |
- Downloads last month
- 8