Marvis-AI
/

marvis-tts-250m-v0.2

Model card Files Files and versions

Update README.md

#1

by ivanfioravanti - opened Nov 5

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +12 -1

README.md CHANGED Viewed

@@ -34,12 +34,23 @@ Currently optimized for English, French, and German.
 ## Using MLX
 ```bash
 pip install -U mlx-audio
-python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
  --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
 ```
 # Model Description
 Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:

 ## Using MLX
+Real audio streaming:
 ```bash
 pip install -U mlx-audio
+mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
  --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
 ```
+Voice cloning:
+```bash
+mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.2  --stream \
+ --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." --ref_audio ./conversational_a.wav
+```
+You can pass any audio to clone the voice from or download sample audio file from [here](https://huggingface.co/mlx-community/csm-1b/tree/main/prompts).
 # Model Description
 Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach: