|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- audio-text-to-text |
|
- chat |
|
- audio |
|
- GGUF |
|
--- |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/d7Rzpm0cgCToXjtE7_U2u.png" alt="Example" style="width:400px;"/> |
|
|
|
# OmniAudio-2.6B |
|
OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: **Gemma-2-2b**, **Whisper turbo**, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices. |
|
Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead. |
|
|
|
On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, **Qwen2-Audio-7B** processes 1.69 tokens/second while OmniAudio-2.6B achieves 4.97 tokens/second, demonstrating nearly **3x faster performance** on consumer hardware. |
|
|
|
## Quick Links |
|
1. Interactive Demo in our [HuggingFace Space](). |
|
2. [Quickstart for local setup]() |
|
3. Learn more in our [Blogs]() |
|
|
|
## Use Cases |
|
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity. |
|
* **Voice-in Conversation**: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening. |
|
* **Creative Content Generation**: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input. |
|
* **Recording Summary**: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries. |
|
* **Voice Tone Modification**: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message. |
|
|
|
## Run OmniAudio-2.6B on Your Device |
|
**Step 1: Install Nexa-SDK (local on-device inference framework)** |
|
[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer) |
|
> ***Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.*** |
|
**Step 2: Then run the following code in your terminal** |
|
```bash |
|
nexa run omniaudio -st |
|
``` |
|
💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space. |
|
|
|
## Training |
|
We developed OmniAudio through a three-stage training pipeline: |
|
**Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases. |
|
**Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding. |
|
**Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs. |
|
|
|
## What's Next for OmniAudio? |
|
OmniAudio is in active development and we are working to advance its capabilities: |
|
* Building direct audio generation for two-way voice communication |
|
* Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration |
|
In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing. |