metadata

license: apache-2.0
language:
  - en
tags:
  - audio-text-to-text
  - chat
  - audio
  - GGUF

OmniAudio-2.6B

Example

OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment - a 2.6B-parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma-2-2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio-text processing directly on edge devices.

Unlike traditional approaches that chain ASR and LLM models together, OmniAudio-2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.

Quick Links

Interactive Demo in our HuggingFace Space
Quickstart for local setup
Learn more in our Blogs

Demo

Performance Benchmarks on Consumer Hardware

On a 2024 Mac Mini M4 Pro, Qwen2-Audio-7B-Instruct running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while Omni-Audio-2.6B through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering 5.5x to 10.3x faster performance on consumer hardware.

Use Cases

Voice QA without Internet: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
Voice-in Conversation: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
Creative Content Generation: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
Recording Summary: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
Voice Tone Modification: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.

How to Use On Device

Step 1: Install Nexa-SDK (local on-device inference framework)

🚀 Install Nexa-SDK

Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

Step 2: Then run the following code in your terminal

nexa run omniaudio -st

💻 OmniAudio-2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.

Training

We developed OmniAudio through a three-stage training pipeline:

Pretraining: The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
Supervised Fine-tuning (SFT): We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
Direct Preference Optimization (DPO): The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.

What's Next for OmniAudio?

OmniAudio is in active development and we are working to advance its capabilities:

Building direct audio generation for two-way voice communication
Implementing function calling support via Octopus_v2 integration

In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.

Join Community

Discord | X(Twitter)