File size: 4,882 Bytes
598a0e1
 
 
 
 
 
 
 
 
 
 
d1499cf
f629412
 
 
d1499cf
73d737f
f060c87
598a0e1
f060c87
fe336c2
 
2d434dd
fe336c2
 
1245a27
 
fcdc671
 
 
 
 
 
 
598a0e1
 
 
 
 
 
f060c87
 
fcdc671
f060c87
 
 
 
598a0e1
f060c87
 
598a0e1
 
 
f060c87
598a0e1
f060c87
598a0e1
 
f060c87
 
 
 
598a0e1
 
 
 
f060c87
df04ba4
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: apache-2.0
language:
- en
tags:
- audio-text-to-text
- chat
- audio
- GGUF
---
# OmniAudio-2.6B

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/d7Rzpm0cgCToXjtE7_U2u.png" alt="Example" style="width:100px;" />
</p>

OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices.

Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.

## Quick Links
1. Interactive Demo in our [HuggingFace Space](https://huggingface.co/spaces/NexaAIDev/omni-audio-demo)
2. [Quickstart for local setup](#how-to-use-on-device)
3. Learn more in our [Blogs](https://nexa.ai/blogs/OmniAudio-2.6B)

**Feedback:** Send questions or suggestions about the model in our [Discord](https://discord.gg/nexa-ai)

## Demo

<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/538_aQ2hRexTlXFL-cYhW.mp4"></video>

## Performance Benchmarks on Consumer Hardware
On a 2024 Mac Mini M4 Pro, **Qwen2-Audio-7B-Instruct** running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while **Omni-Audio-2.6B** through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering **5.5x to 10.3x faster performance** on consumer hardware.

## Use Cases
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
* **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
* **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
* **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
* **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.


## How to Use On Device
Step 1: Install Nexa-SDK (local on-device inference framework)

[🚀 Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)

> ***Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.***

Step 2: Then run the following code in your terminal
```bash
nexa run omniaudio -st
```

💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.

## Training
We developed OmniAudio through a three-stage training pipeline:
* **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
* **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
* **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.

## What's Next for OmniAudio?
OmniAudio is in active development and we are working to advance its capabilities:
* Building direct audio generation for two-way voice communication
* Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
  
In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.

## Join Community
[Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)