Salesforce/wikitext
Viewer • Updated • 3.71M • 1.33M • 690
How to use smpanaro/Llama-2-7b-NuGPTQ with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)How to use smpanaro/Llama-2-7b-NuGPTQ with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "smpanaro/Llama-2-7b-NuGPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "smpanaro/Llama-2-7b-NuGPTQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/smpanaro/Llama-2-7b-NuGPTQ
How to use smpanaro/Llama-2-7b-NuGPTQ with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "smpanaro/Llama-2-7b-NuGPTQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "smpanaro/Llama-2-7b-NuGPTQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "smpanaro/Llama-2-7b-NuGPTQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "smpanaro/Llama-2-7b-NuGPTQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use smpanaro/Llama-2-7b-NuGPTQ with Docker Model Runner:
docker model run hf.co/smpanaro/Llama-2-7b-NuGPTQ
Non-uniform GPTQ (NuGPTQ) combines GPTQ, SqueezeLLM and output scaling for a competitive whole-tensor (no grouping) LLM compression method.
Results for Llama-2-7b-hf:
| Method | WikitextPPL (↓) | Delta |
|---|---|---|
| float16 | 8.7071 | 0 |
| AWQ | 8.9760 | 0.2689 |
| NuGPTQ (This) | 9.2754 | 0.5683 |
| GPTQ† | 9.4686 | 0.7615 |
| † g128, desc_act=True |
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install optimum
huggingface-cli login
# Set batch size based on your GPU.
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
--tasks wikitext \
--batch_size 1
# hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext| 2|none | 0|word_perplexity|8.7071|± |N/A |
# | | |none | 0|byte_perplexity|1.4989|± |N/A |
# | | |none | 0|bits_per_byte |0.5839|± |N/A |
lm_eval --model hf \
--model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \
--tasks wikitext \
--batch_size 1
# hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext| 2|none | 0|word_perplexity|9.2754|± |N/A |
# | | |none | 0|byte_perplexity|1.5167|± |N/A |
# | | |none | 0|bits_per_byte |0.6009|± |N/A |
pip install auto-gptq
lm_eval --model hf \
--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \
--tasks wikitext \
--batch_size 1
# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext| 2|none | 0|word_perplexity|9.4686|± |N/A |
# | | |none | 0|byte_perplexity|1.5225|± |N/A |
# | | |none | 0|bits_per_byte |0.6065|± |N/A |
lm_eval --model hf \
--model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \
--tasks wikitext \
--batch_size 1
# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext| 2|none | 0|word_perplexity|9.3801|± |N/A |
# | | |none | 0|byte_perplexity|1.5199|± |N/A |
# | | |none | 0|bits_per_byte |0.6040|± |N/A |
pip install autoawq
lm_eval --model hf \
--model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \
--tasks wikitext \
--batch_size 1
# hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1
# | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext| 2|none | 0|word_perplexity|8.9760|± |N/A |
# | | |none | 0|byte_perplexity|1.5074|± |N/A |
# | | |none | 0|bits_per_byte |0.5921|± |N/A |
The model is fake quantized which means each weight has <= 16 (24) unique values, but they are stored in float16. The uniqueness can be checked as follows:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)
linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
count = 0
for key, tensor in model.state_dict().items():
if "weight" not in key:
continue
if any([l in key for l in linear_layers]):
assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values"
print("✓", end="", flush=True)
count += 1
print()
# 32 model layers * 7 linear layers
print(f"{count} out of 224 linear layers have 16 unique values.")
Base model
meta-llama/Llama-2-7b-hf
docker model run hf.co/smpanaro/Llama-2-7b-NuGPTQ