File size: 6,989 Bytes
ae9f583
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04a4be4
ae9f583
750c6d3
 
 
ae9f583
 
750c6d3
175ea31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae9f583
 
175ea31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae9f583
 
 
750c6d3
ae9f583
 
175ea31
ae9f583
 
 
 
 
 
 
175ea31
 
ae9f583
175ea31
 
 
ae9f583
 
 
175ea31
 
 
 
ae9f583
175ea31
ae9f583
 
175ea31
ae9f583
175ea31
 
 
ae9f583
 
 
 
 
 
 
 
 
 
 
 
 
 
750c6d3
ae9f583
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
- name: Octopus-V2-2B
  results: []
tags:
- function calling
- on-device language model
- android
inference: false
space: false
spaces: false
language:
- en
---

# Quantized Octopus V2: On-device language model for super agent

This repo includes two types of quantized models: **GGUF** and **AWQ**, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)

<p align="center" width="100%">
  <a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
</p>


# GGUF Qauntization
## (Recommended) Run with [llama.cpp](https://github.com/ggerganov/llama.cpp)

1. **Clone and compile:**

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Compile the source code:
make
```

2. **Prepare the Input Prompt File:**
   
Navigate to the `prompt` folder inside the `llama.cpp`, and create a new file named `chat-with-octopus.txt`.

   `chat-with-octopus.txt`:

```bash
User: 
```
  
3. **Execute the Model:**

Run the following command in the terminal:

```bash
./main -m ./path/to/octopus-v2-Q4_K_M.gguf -c 512 -b 2048 -n 256 -t 1 --repeat_penalty 1.0 --top_k 0 --top_p 1.0 --color -i -r "User:" -f prompts/chat-with-octopus.txt
```

Example prompt to interact  
```bash
<|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>
```

## Run with [Ollama](https://github.com/ollama/ollama)
1. Create a `Modelfile` in your directory and include a `FROM` statement with the path to your local model:

```bash
FROM ./path/to/octopus-v2-Q4_K_M.gguf
PARAMETER temperature 0
PARAMETER num_ctx 1024
PARAMETER stop <nexa_end>
```

2. Use the following command to add the model to Ollama:
```bash
ollama create octopus-v2-Q4_K_M -f Modelfile
```

3. Verify that the model has been successfully imported:
```bash
ollama ls
```

### Run the model
```bash
ollama run octopus-v2-Q4_K_M "<|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>"
```

# AWQ Quantization
Python example:

```python
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import torch
import time
import numpy as np

def inference(input_text):
    start_time = time.time()
    input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
    input_length = input_ids["input_ids"].shape[1]
    generation_output = model.generate(
        input_ids["input_ids"],
        do_sample=False,
        max_length=1024
    )
    end_time = time.time()

    # Decode only the generated part
    generated_sequence = generation_output[:, input_length:].tolist()
    res = tokenizer.decode(generated_sequence[0])

    latency = end_time - start_time
    num_output_tokens = len(generated_sequence[0])
    throughput = num_output_tokens / latency

    return {"output": res, "latency": latency, "throughput": throughput}

# Initialize tokenizer and model
model_id = "/home/mingyuanma/Octopus-v2-AWQ-NexaAIDev"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)

prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]

avg_throughput = []
for prompt in prompts:
    out = inference(prompt)
    avg_throughput.append(out["throughput"])
    print("nexa model result:\n", out["output"])

print("avg throughput:", np.mean(avg_throughput))
```

# Quantized GGUF & AWQ Models Benchmark

| Name                   | Quant method | Bits | Size     | Response (t/s) | Use Cases                           |
| ---------------------- | ------------ | ---- | -------- | -------------- | ----------------------------------- |
| Octopus-v2-AWQ         | AWQ          | 4    | 3.00 GB  | 63.83          | fast, high quality, recommended     |
| Octopus-v2-Q2_K.gguf   | Q2_K         | 2    | 1.16 GB  | 57.81          | fast but high loss, not recommended |
| Octopus-v2-Q3_K.gguf   | Q3_K         | 3    | 1.38 GB  | 57.81          | extremely not recommended           |
| Octopus-v2-Q3_K_S.gguf | Q3_K_S       | 3    | 1.19 GB  | 52.13          | extremely not recommended           |
| Octopus-v2-Q3_K_M.gguf | Q3_K_M       | 3    | 1.38 GB  | 58.67          | moderate loss, not very recommended |
| Octopus-v2-Q3_K_L.gguf | Q3_K_L       | 3    | 1.47 GB  | 56.92          | not very recommended                |
| Octopus-v2-Q4_0.gguf   | Q4_0         | 4    | 1.55 GB  | 68.80          | moderate speed, recommended         |
| Octopus-v2-Q4_1.gguf   | Q4_1         | 4    | 1.68 GB  | 68.09          | moderate speed, recommended         |
| Octopus-v2-Q4_K.gguf   | Q4_K         | 4    | 1.63 GB  | 64.70          | moderate speed, recommended         |
| Octopus-v2-Q4_K_S.gguf | Q4_K_S       | 4    | 1.56 GB  | 62.16          | fast and accurate, very recommended |
| Octopus-v2-Q4_K_M.gguf | Q4_K_M       | 4    | 1.63 GB  | 64.74          | fast, recommended                   |
| Octopus-v2-Q5_0.gguf   | Q5_0         | 5    | 1.80 GB  | 64.80          | fast, recommended                   |
| Octopus-v2-Q5_1.gguf   | Q5_1         | 5    | 1.92 GB  | 63.42          | very big, prefer Q4                 |
| Octopus-v2-Q5_K.gguf   | Q5_K         | 5    | 1.84 GB  | 61.28          | big, recommended                    |
| Octopus-v2-Q5_K_S.gguf | Q5_K_S       | 5    | 1.80 GB  | 62.16          | big, recommended                    |
| Octopus-v2-Q5_K_M.gguf | Q5_K_M       | 5    | 1.71 GB  | 61.54          | big, recommended                    |
| Octopus-v2-Q6_K.gguf   | Q6_K         | 6    | 2.06 GB  | 55.94          | very big, not very recommended      |
| Octopus-v2-Q8_0.gguf   | Q8_0         | 8    | 2.67 GB  | 56.35          | very big, not very recommended      |
| Octopus-v2-f16.gguf    | f16          | 16   | 5.02 GB  | 36.27          | extremely big                       |
| Octopus-v2.gguf        |              |      | 10.00 GB |                |                                     |

_Quantized with llama.cpp_


**Acknowledgement**:  
We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.