license: apache-2.0
language:
- en
tags:
- gemma
- function calling
- on-device language model
- android
- conversational
Octopus V1: On-device language model for function calling of software APIs
- Nexa AI Product - ArXiv
Introducing Octopus-V2-2B
Octopus-V2-2B, an advanced open-source language model with 2 billion parameters, represents Nexa AI's research breakthrough in the application of large language models (LLMs) for function calling, specifically tailored for Android APIs. Unlike Retrieval-Augmented Generation (RAG) methods, which require detailed descriptions of potential function arguments—sometimes needing up to tens of thousands of input tokens—Octopus-V2-2B introduces a unique functional token strategy for both its training and inference stages. This approach not only allows it to achieve performance levels comparable to GPT-4 but also significantly enhances its inference speed beyond that of RAG-based methods, making it especially beneficial for edge computing devices.
📱 On-device Applications: Octopus-V2-2B is engineered to operate seamlessly on Android devices, extending its utility across a wide range of applications, from Android system management to the orchestration of multiple devices. Further demonstrations of its capabilities are available on the Nexa AI Research Page, showcasing its adaptability and potential for on-device integration.
🚀 Inference Speed: When benchmarked, Octopus-V2-2B demonstrates a remarkable inference speed, outperforming the combination of "Llama7B + RAG solution" by a factor of 36X on a single A100 GPU. Furthermore, compared to GPT-4-turbo (gpt-4-0125-preview), which relies on clusters A100/H100 GPUs, Octopus-V2-2B is 168% faster. This efficiency is attributed to our functional token design.
🐙 Accuracy: Octopus-V2-2B not only excels in speed but also in accuracy, surpassing the "Llama7B + RAG solution" in function call accuracy by 31%. It achieves a function call accuracy comparable to GPT-4 and RAG + GPT-3.5, with scores ranging between 98% and 100% across benchmark datasets.
💪 Function Calling Capabilities: Octopus-V2-2B is capable of generating individual, nested, and parallel function calls across a variety of complex scenarios.
Example Use Cases
You can run the model on a GPU using the following code.
from gemma.modeling_gemma import GemmaForCausalLM
from transformers import AutoTokenizer
import torch
import time
def inference(input_text):
start_time = time.time()
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
input_length = input_ids["input_ids"].shape[1]
outputs = model.generate(
input_ids=input_ids["input_ids"],
max_length=1024,
do_sample=False)
generated_sequence = outputs[:, input_length:].tolist()
res = tokenizer.decode(generated_sequence[0])
end_time = time.time()
return {"output": res, "latency": end_time - start_time}
model_id = "NexaAIDev/android_API_10k_data"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GemmaForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
input_text = "Take a selfie for me with front camera"
nexa_query = f"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: {input_text} \n\nResponse:"
start_time = time.time()
print("nexa model result:\n", inference(nexa_query))
print("latency:", time.time() - start_time," s")
Evaluation
License
This model was trained on commercially viable data and is under the Nexa AI community disclaimer.
References
We thank the Google Gemma team for their amazing models!
@misc{gemma-2023-open-models,
author = {{Gemma Team, Google DeepMind}},
title = {Gemma: Open Models Based on Gemini Research and Technology},
url = {https://goo.gle/GemmaReport},
year = {2023},
}
Citation
@misc{TODO}
Contact
Please contact us to reach out for any issues and comments!