Update README.md
#1
by
ivanfioravanti
- opened
README.md
CHANGED
|
@@ -34,12 +34,23 @@ Currently optimized for English, French, and German.
|
|
| 34 |
|
| 35 |
## Using MLX
|
| 36 |
|
|
|
|
|
|
|
| 37 |
```bash
|
| 38 |
pip install -U mlx-audio
|
| 39 |
-
|
| 40 |
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
|
| 41 |
```
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
# Model Description
|
| 44 |
|
| 45 |
Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
|
|
|
|
| 34 |
|
| 35 |
## Using MLX
|
| 36 |
|
| 37 |
+
Real audio streaming:
|
| 38 |
+
|
| 39 |
```bash
|
| 40 |
pip install -U mlx-audio
|
| 41 |
+
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2 --stream \
|
| 42 |
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
|
| 43 |
```
|
| 44 |
|
| 45 |
+
Voice cloning:
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2 --stream \
|
| 49 |
+
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
You can pass any audio to clone the voice from or download sample audio file from [here](https://huggingface.co/mlx-community/csm-1b/tree/main/prompts).
|
| 53 |
+
|
| 54 |
# Model Description
|
| 55 |
|
| 56 |
Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
|