parler-tts
/

parler-tts-mini-multilingual-v1.1

@@ -25,25 +25,6 @@ datasets:
 - ylacombe/cml-tts-filtered-annotated
 - PHBJT/cml-tts-filtered
 ---
-{
-    "2450": "Mark",
-    "496": "Jessica",
-    "3060": "Daniel",
-    "12709": "Christine",
-    "1897": "Christopher",
-    "10148": "Nicole",
-    "4998": "Richard",
-    "4649": "Julia",
-    "6892": "Alex",
-    "7014": "Natalie",
-    "4367": "Nicholas",
-    "2961": "Sophia",
-    "3946": "Steven",
-    "10246": "Olivia",
-    "11772": "Megan",
-    "4174": "Michelle"
-  }
 <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
@@ -56,6 +37,8 @@ datasets:
 **Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
 It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
 In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
@@ -68,7 +51,8 @@ Thanks to its **better prompt tokenizer**, it can easily be extended to other la
 ## 📖 Quick Index
 * [👨‍💻 Installation](#👨‍💻-installation)
-* [🎯 Inference](#inference)
 * [Motivation](#motivation)
 * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
@@ -84,10 +68,9 @@ Using Parler-TTS is as simple as "bonjour". Simply install the library once:
 pip install git+https://github.com/huggingface/parler-tts.git
 ```
-### Inference
-**Parler-TTS** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
 ```py
 import torch
@@ -101,7 +84,7 @@ model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts
 tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
 description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
-prompt = "Hey, how are you doing today?"
 description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
 input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
@@ -112,6 +95,37 @@ audio_arr = generation.cpu().numpy().squeeze()
 sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
 ```
 **Tips**:
 * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
 * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise

 - ylacombe/cml-tts-filtered-annotated
 - PHBJT/cml-tts-filtered
 ---
 <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
 **Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
+🚨 As compared to [Mini Multilingual v1](https://huggingface.co/parler-tts/parler-tts-mini-multilingual), this version was trained with some consistent speaker names and with better format for descriptions. 🚨
 It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
 In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
 ## 📖 Quick Index
 * [👨‍💻 Installation](#👨‍💻-installation)
+* [🎲 Using a random voice](#🎲-random-voice)
+* [🎯 Using a specific speaker](#🎯-using-a-specific-speaker)
 * [Motivation](#motivation)
 * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
 pip install git+https://github.com/huggingface/parler-tts.git
 ```
+### 🎲 Random voice
+**Parler-TTS Mini Multilingual** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
 ```py
 import torch
 tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
 description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
+prompt = "Salut toi, comment vas-tu aujourd'hui?"
 description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
 input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
 sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
 ```
+### 🎯 Using a specific speaker
+To ensure speaker consistency across generations, this checkpoint was also trained on 16 speakers, characterized by name (e.g. Daniel, Christine, Richard, Nicole, ...).
+To take advantage of this, simply adapt your text description to specify which speaker to use: `Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
+```py
+import torch
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer
+import soundfile as sf
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device)
+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
+description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
+prompt = "Salut toi, comment vas-tu aujourd'hui?"
+description = "Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
+input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
+prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
+audio_arr = generation.cpu().numpy().squeeze()
+sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
+```
+You can choose a speaker from this list: Mark, Jessica, Daniel, Christine, Christopher, Nicole, Richard, Julia, Alex, Natalie, Nicholas, Sophia, Steven, Olivia, Megan and Michelle.
 **Tips**:
 * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
 * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise