Instructions to use smpanaro/Llama-2-7b-NuGPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use smpanaro/Llama-2-7b-NuGPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use smpanaro/Llama-2-7b-NuGPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "smpanaro/Llama-2-7b-NuGPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smpanaro/Llama-2-7b-NuGPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/smpanaro/Llama-2-7b-NuGPTQ

SGLang

How to use smpanaro/Llama-2-7b-NuGPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "smpanaro/Llama-2-7b-NuGPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smpanaro/Llama-2-7b-NuGPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "smpanaro/Llama-2-7b-NuGPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smpanaro/Llama-2-7b-NuGPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use smpanaro/Llama-2-7b-NuGPTQ with Docker Model Runner:
```
docker model run hf.co/smpanaro/Llama-2-7b-NuGPTQ
```

Non-uniform GPTQ (NuGPTQ) combines GPTQ, SqueezeLLM and output scaling for a competitive whole-tensor (no grouping) LLM compression method.

Results for Llama-2-7b-hf:

Method	WikitextPPL (↓)	Delta
float16	8.7071	0
AWQ	8.9760	0.2689
NuGPTQ (This)	9.2754	0.5683
GPTQ†	9.4686	0.7615
_{† g128, desc_act=True}

perplexity reproduction steps

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install optimum

huggingface-cli login

# Set batch size based on your GPU.
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
    --tasks wikitext \
    --batch_size 1

# hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext|      2|none  |     0|word_perplexity|8.7071|±  |N/A   |
# |        |       |none  |     0|byte_perplexity|1.4989|±  |N/A   |
# |        |       |none  |     0|bits_per_byte  |0.5839|±  |N/A   |

lm_eval --model hf \
    --model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \
    --tasks wikitext \
    --batch_size 1

# hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext|      2|none  |     0|word_perplexity|9.2754|±  |N/A   |
# |        |       |none  |     0|byte_perplexity|1.5167|±  |N/A   |
# |        |       |none  |     0|bits_per_byte  |0.6009|±  |N/A   |

pip install auto-gptq
lm_eval --model hf \
    --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \
    --tasks wikitext \
    --batch_size 1

# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext|      2|none  |     0|word_perplexity|9.4686|±  |N/A   |
# |        |       |none  |     0|byte_perplexity|1.5225|±  |N/A   |
# |        |       |none  |     0|bits_per_byte  |0.6065|±  |N/A   |

lm_eval --model hf \
    --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \
    --tasks wikitext \
    --batch_size 1

# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext|      2|none  |     0|word_perplexity|9.3801|±  |N/A   |
# |        |       |none  |     0|byte_perplexity|1.5199|±  |N/A   |
# |        |       |none  |     0|bits_per_byte  |0.6040|±  |N/A   |

pip install autoawq
lm_eval --model hf \
    --model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \
    --tasks wikitext \
    --batch_size 1

# hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1
# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
# |--------|------:|------|-----:|---------------|-----:|---|------|
# |wikitext|      2|none  |     0|word_perplexity|8.9760|±  |N/A   |
# |        |       |none  |     0|byte_perplexity|1.5074|±  |N/A   |
# |        |       |none  |     0|bits_per_byte  |0.5921|±  |N/A   |

The model is fake quantized which means each weight has <= 16 (2⁴) unique values, but they are stored in float16. The uniqueness can be checked as follows:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ", trust_remote_code=True)
linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
count = 0
for key, tensor in model.state_dict().items():
    if "weight" not in key:
        continue
    if any([l in key for l in linear_layers]):
        assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values"
        print("✓", end="", flush=True)
        count += 1

print()
# 32 model layers * 7 linear layers
print(f"{count} out of 224 linear layers have 16 unique values.")

Downloads last month: 1

Safetensors

Model size

7B params

Tensor type

F16

Model tree for smpanaro/Llama-2-7b-NuGPTQ

Base model

meta-llama/Llama-2-7b-hf

Quantized

(75)

this model

Dataset used to train smpanaro/Llama-2-7b-NuGPTQ

Papers for smpanaro/Llama-2-7b-NuGPTQ

SqueezeLLM: Dense-and-Sparse Quantization

Paper • 2306.07629 • Published Jun 13, 2023 • 4

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Paper • 2210.17323 • Published Oct 31, 2022 • 10