File size: 1,302 Bytes
dd26e3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1090b5b
 
dd26e3d
1090b5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
**This model has been quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel).**

- bits: 4
- group_size: 128
- desc_act: true
- static_groups: false
- sym: true
- lm_head: false
- damp_percent: 0.01
- true_sequential: true
- model_name_or_path: ""
- model_file_base_name: "model"
- quant_method: "gptq"
- checkpoint_format: "gptq"
- meta:
  - quantizer: "gptqmodel:0.9.9-dev0"
 

**Currently, only vllm can load the quantized gemma2-27b for proper inference. Here is an example:**
```python
import os
# Gemma-2 use Flashinfer backend for models with logits_soft_cap. Otherwise, the output might be wrong.
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "ModelCloud/gemma-2-27b-it-gptq-4bit"

prompt = [{"role": "user", "content": "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"}]

tokenizer = AutoTokenizer.from_pretrained(model_name)

llm = LLM(
    model=model_name,
)
sampling_params = SamplingParams(temperature=0.95, max_tokens=128)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
```