Shu

Update README.md

175ea31 verified 5 months ago

6.99 kB

	---
	license: cc-by-nc-4.0
	base_model: google/gemma-2b
	model-index:
	- name: Octopus-V2-2B
	results: []
	tags:
	- function calling
	- on-device language model
	- android
	inference: false
	space: false
	spaces: false
	language:
	- en
	---

	# Quantized Octopus V2: On-device language model for super agent

	This repo includes two types of quantized models: GGUF and AWQ, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)

	<p align="center" width="100%">
	<a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
	</p>


	# GGUF Qauntization
	## (Recommended) Run with [llama.cpp](https://github.com/ggerganov/llama.cpp)

	1. Clone and compile:

	```bash
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp
	# Compile the source code:
	make
	```

	2. Prepare the Input Prompt File:

	Navigate to the `prompt` folder inside the `llama.cpp`, and create a new file named `chat-with-octopus.txt`.

	`chat-with-octopus.txt`:

	```bash
	User:
	```

	3. Execute the Model:

	Run the following command in the terminal:

	```bash
	./main -m ./path/to/octopus-v2-Q4_K_M.gguf -c 512 -b 2048 -n 256 -t 1 --repeat_penalty 1.0 --top_k 0 --top_p 1.0 --color -i -r "User:" -f prompts/chat-with-octopus.txt
	```

	Example prompt to interact
	```bash
	<\|system\|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<\|end\|><\|user\|>Query: Take a selfie for me with front camera<\|end\|><\|assistant\|>
	```

	## Run with [Ollama](https://github.com/ollama/ollama)
	1. Create a `Modelfile` in your directory and include a `FROM` statement with the path to your local model:

	```bash
	FROM ./path/to/octopus-v2-Q4_K_M.gguf
	PARAMETER temperature 0
	PARAMETER num_ctx 1024
	PARAMETER stop <nexa_end>
	```

	2. Use the following command to add the model to Ollama:
	```bash
	ollama create octopus-v2-Q4_K_M -f Modelfile
	```

	3. Verify that the model has been successfully imported:
	```bash
	ollama ls
	```

	### Run the model
	```bash
	ollama run octopus-v2-Q4_K_M "<\|system\|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<\|end\|><\|user\|>Query: Take a selfie for me with front camera<\|end\|><\|assistant\|>"
	```

	# AWQ Quantization
	Python example:

	```python
	from transformers import AutoTokenizer
	from awq import AutoAWQForCausalLM
	import torch
	import time
	import numpy as np

	def inference(input_text):
	start_time = time.time()
	input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
	input_length = input_ids["input_ids"].shape[1]
	generation_output = model.generate(
	input_ids["input_ids"],
	do_sample=False,
	max_length=1024
	)
	end_time = time.time()

	# Decode only the generated part
	generated_sequence = generation_output[:, input_length:].tolist()
	res = tokenizer.decode(generated_sequence[0])

	latency = end_time - start_time
	num_output_tokens = len(generated_sequence[0])
	throughput = num_output_tokens / latency

	return {"output": res, "latency": latency, "throughput": throughput}

	# Initialize tokenizer and model
	model_id = "/home/mingyuanma/Octopus-v2-AWQ-NexaAIDev"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
	model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
	trust_remote_code=False, safetensors=True)

	prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]

	avg_throughput = []
	for prompt in prompts:
	out = inference(prompt)
	avg_throughput.append(out["throughput"])
	print("nexa model result:\n", out["output"])

	print("avg throughput:", np.mean(avg_throughput))
	```

	# Quantized GGUF & AWQ Models Benchmark

	\| Name \| Quant method \| Bits \| Size \| Response (t/s) \| Use Cases \|
	\| ---------------------- \| ------------ \| ---- \| -------- \| -------------- \| ----------------------------------- \|
	\| Octopus-v2-AWQ \| AWQ \| 4 \| 3.00 GB \| 63.83 \| fast, high quality, recommended \|
	\| Octopus-v2-Q2_K.gguf \| Q2_K \| 2 \| 1.16 GB \| 57.81 \| fast but high loss, not recommended \|
	\| Octopus-v2-Q3_K.gguf \| Q3_K \| 3 \| 1.38 GB \| 57.81 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_S.gguf \| Q3_K_S \| 3 \| 1.19 GB \| 52.13 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_M.gguf \| Q3_K_M \| 3 \| 1.38 GB \| 58.67 \| moderate loss, not very recommended \|
	\| Octopus-v2-Q3_K_L.gguf \| Q3_K_L \| 3 \| 1.47 GB \| 56.92 \| not very recommended \|
	\| Octopus-v2-Q4_0.gguf \| Q4_0 \| 4 \| 1.55 GB \| 68.80 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_1.gguf \| Q4_1 \| 4 \| 1.68 GB \| 68.09 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K.gguf \| Q4_K \| 4 \| 1.63 GB \| 64.70 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K_S.gguf \| Q4_K_S \| 4 \| 1.56 GB \| 62.16 \| fast and accurate, very recommended \|
	\| Octopus-v2-Q4_K_M.gguf \| Q4_K_M \| 4 \| 1.63 GB \| 64.74 \| fast, recommended \|
	\| Octopus-v2-Q5_0.gguf \| Q5_0 \| 5 \| 1.80 GB \| 64.80 \| fast, recommended \|
	\| Octopus-v2-Q5_1.gguf \| Q5_1 \| 5 \| 1.92 GB \| 63.42 \| very big, prefer Q4 \|
	\| Octopus-v2-Q5_K.gguf \| Q5_K \| 5 \| 1.84 GB \| 61.28 \| big, recommended \|
	\| Octopus-v2-Q5_K_S.gguf \| Q5_K_S \| 5 \| 1.80 GB \| 62.16 \| big, recommended \|
	\| Octopus-v2-Q5_K_M.gguf \| Q5_K_M \| 5 \| 1.71 GB \| 61.54 \| big, recommended \|
	\| Octopus-v2-Q6_K.gguf \| Q6_K \| 6 \| 2.06 GB \| 55.94 \| very big, not very recommended \|
	\| Octopus-v2-Q8_0.gguf \| Q8_0 \| 8 \| 2.67 GB \| 56.35 \| very big, not very recommended \|
	\| Octopus-v2-f16.gguf \| f16 \| 16 \| 5.02 GB \| 36.27 \| extremely big \|
	\| Octopus-v2.gguf \| \| \| 10.00 GB \| \| \|

	_Quantized with llama.cpp_


	Acknowledgement:
	We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.