Upload LlamaForCausalLM

938d259 verified 3 months ago

3.59 kB

	---
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	language:
	- en
	license: llama3.1
	pipeline_tag: text-generation
	tags:
	- int8
	- w8a8
	- text-generation
	---

	# Meta-Llama-3-8B-Instruct-quantized.w4a4

	## Model Overview
	- Model Architecture: Meta-Llama-3
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight and Activation Quantization: INT8 (W8A8)
	- Intended Use Cases: Intended for commercial and research use across multiple languages, designed to function as an assistant-like chat model.
	- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
	- Release Date: 9/2024
	- Version: 1.0
	- License(s): Llama3.1
	- Model Developers: Mahesh Yaddanapudi

	Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). This model is optimized using weight and activation quantization to INT8, drastically reducing memory usage and enabling deployment on extremely resource-constrained environments.

	### Model Optimizations

	This model was obtained by quantizing the weights and activations of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to INT8 (W8A8) data type. This optimization reduces the number of bits per parameter and activation from 16 to 8, significantly reducing disk size and memory requirements.

	The weights and activations of the linear operators within transformers blocks are quantized using the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm, which applies symmetric per-channel quantization with a 1% damping factor and 256 sequences of 8,192 random tokens.

	## Deployment

	This model can be deployed efficiently using various backends compatible with INT8 models, as shown in the example below.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "zzzmahesh/Meta-Llama-3-8B-Instruct-quantized.w8a8"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	low_cpu_mem_usage=True
	)

	prompt = "What are the benefits of model quantization in AI?"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs)

	print(tokenizer.decode(outputs[0]))
	```

	## Creation

	This model was created by using the GPTQ quantization method as implemented in the AutoGPTQ library, as demonstrated in the code snippet below.

	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
	import random

	model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

	# Create random examples for quantization calibration
	num_samples = 256
	max_seq_len = 8192
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	max_token_id = len(tokenizer.get_vocab()) - 1
	examples = [{"input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)], "attention_mask": max_seq_len * [1]} for _ in range(num_samples)]

	# Define quantization configuration for W8A8
	quantize_config = BaseQuantizeConfig(bits=8, group_size=-1, desc_act=True, model_file_base_name="model", damp_percent=0.01)

	# Load and quantize the model
	model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config, device_map="auto")
	model.quantize(examples)
	model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a8")
	```

	## Future Work

	Further evaluations are planned to compare this quantized model with its unquantized and higher-bit quantized counterparts, especially on benchmarks relevant to code generation and logical reasoning tasks.