Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,7 @@ On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, **Qwen2-Audio-7B**
|
|
20 |
1. Interactive Demo in our [HuggingFace Space]().
|
21 |
2. [Quickstart for local setup]()
|
22 |
3. Learn more in our [Blogs]()
|
|
|
23 |
|
24 |
## Use Cases
|
25 |
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
|
@@ -40,12 +41,16 @@ nexa run omniaudio -st
|
|
40 |
|
41 |
## Training
|
42 |
We developed OmniAudio through a three-stage training pipeline:
|
43 |
-
|
44 |
-
**
|
45 |
-
**
|
|
|
46 |
|
47 |
## What's Next for OmniAudio?
|
48 |
OmniAudio is in active development and we are working to advance its capabilities:
|
49 |
* Building direct audio generation for two-way voice communication
|
50 |
* Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
|
51 |
-
In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
|
|
|
|
|
|
|
|
20 |
1. Interactive Demo in our [HuggingFace Space]().
|
21 |
2. [Quickstart for local setup]()
|
22 |
3. Learn more in our [Blogs]()
|
23 |
+
4. **Feedback**: Send questions or suggestions about the model in our [Discord](https://discord.gg/nexa-ai)
|
24 |
|
25 |
## Use Cases
|
26 |
* **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
|
|
|
41 |
|
42 |
## Training
|
43 |
We developed OmniAudio through a three-stage training pipeline:
|
44 |
+
|
45 |
+
* **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
|
46 |
+
* **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
|
47 |
+
* **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
|
48 |
|
49 |
## What's Next for OmniAudio?
|
50 |
OmniAudio is in active development and we are working to advance its capabilities:
|
51 |
* Building direct audio generation for two-way voice communication
|
52 |
* Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
|
53 |
+
In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
|
54 |
+
|
55 |
+
## Join Community
|
56 |
+
[Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)
|