alanzhuly commited on
Commit
df04ba4
1 Parent(s): 2d794ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -20,6 +20,7 @@ On a 2024 Mac Mini M4 Pro using Q4_K_M quantized GGUF model, **Qwen2-Audio-7B**
20
  1. Interactive Demo in our [HuggingFace Space]().
21
  2. [Quickstart for local setup]()
22
  3. Learn more in our [Blogs]()
 
23
 
24
  ## Use Cases
25
  * **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
@@ -40,12 +41,16 @@ nexa run omniaudio -st
40
 
41
  ## Training
42
  We developed OmniAudio through a three-stage training pipeline:
43
- **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
44
- **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
45
- **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
 
46
 
47
  ## What's Next for OmniAudio?
48
  OmniAudio is in active development and we are working to advance its capabilities:
49
  * Building direct audio generation for two-way voice communication
50
  * Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
51
- In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
 
 
 
 
20
  1. Interactive Demo in our [HuggingFace Space]().
21
  2. [Quickstart for local setup]()
22
  3. Learn more in our [Blogs]()
23
+ 4. **Feedback**: Send questions or suggestions about the model in our [Discord](https://discord.gg/nexa-ai)
24
 
25
  ## Use Cases
26
  * **Voice QA without Internet**: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
 
41
 
42
  ## Training
43
  We developed OmniAudio through a three-stage training pipeline:
44
+
45
+ * **Pretraining:** The initial stage focuses on core audio-text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
46
+ * **Supervised Fine-tuning (SFT):** We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio-text pairs for effective dialogue understanding.
47
+ * **Direct Preference Optimization (DPO):** The final stage refines model quality using GPT-4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.
48
 
49
  ## What's Next for OmniAudio?
50
  OmniAudio is in active development and we are working to advance its capabilities:
51
  * Building direct audio generation for two-way voice communication
52
  * Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus-v2) integration
53
+ In the long term, we aim to establish OmniAudio as a comprehensive solution for edge-based audio-language processing.
54
+
55
+ ## Join Community
56
+ [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/nexa_ai)