ylacombe commited on
Commit
b9686d5
β€’
1 Parent(s): 9a64251

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -24
README.md CHANGED
@@ -25,25 +25,6 @@ datasets:
25
  - ylacombe/cml-tts-filtered-annotated
26
  - PHBJT/cml-tts-filtered
27
  ---
28
-
29
- {
30
- "2450": "Mark",
31
- "496": "Jessica",
32
- "3060": "Daniel",
33
- "12709": "Christine",
34
- "1897": "Christopher",
35
- "10148": "Nicole",
36
- "4998": "Richard",
37
- "4649": "Julia",
38
- "6892": "Alex",
39
- "7014": "Natalie",
40
- "4367": "Nicholas",
41
- "2961": "Sophia",
42
- "3946": "Steven",
43
- "10246": "Olivia",
44
- "11772": "Megan",
45
- "4174": "Michelle"
46
- }
47
 
48
  <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
49
 
@@ -56,6 +37,8 @@ datasets:
56
 
57
  **Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
58
 
 
 
59
  It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
60
  In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
61
 
@@ -68,7 +51,8 @@ Thanks to its **better prompt tokenizer**, it can easily be extended to other la
68
 
69
  ## πŸ“– Quick Index
70
  * [πŸ‘¨β€πŸ’» Installation](#πŸ‘¨β€πŸ’»-installation)
71
- * [🎯 Inference](#inference)
 
72
  * [Motivation](#motivation)
73
  * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
74
 
@@ -84,10 +68,9 @@ Using Parler-TTS is as simple as "bonjour". Simply install the library once:
84
  pip install git+https://github.com/huggingface/parler-tts.git
85
  ```
86
 
87
- ### Inference
88
 
89
-
90
- **Parler-TTS** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
91
 
92
  ```py
93
  import torch
@@ -101,7 +84,7 @@ model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts
101
  tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
102
  description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
103
 
104
- prompt = "Hey, how are you doing today?"
105
  description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
106
 
107
  input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
@@ -112,6 +95,37 @@ audio_arr = generation.cpu().numpy().squeeze()
112
  sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
113
  ```
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  **Tips**:
116
  * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
117
  * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
 
25
  - ylacombe/cml-tts-filtered-annotated
26
  - PHBJT/cml-tts-filtered
27
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
30
 
 
37
 
38
  **Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
39
 
40
+ 🚨 As compared to [Mini Multilingual v1](https://huggingface.co/parler-tts/parler-tts-mini-multilingual), this version was trained with some consistent speaker names and with better format for descriptions. 🚨
41
+
42
  It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
43
  In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
44
 
 
51
 
52
  ## πŸ“– Quick Index
53
  * [πŸ‘¨β€πŸ’» Installation](#πŸ‘¨β€πŸ’»-installation)
54
+ * [🎲 Using a random voice](#🎲-random-voice)
55
+ * [🎯 Using a specific speaker](#🎯-using-a-specific-speaker)
56
  * [Motivation](#motivation)
57
  * [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
58
 
 
68
  pip install git+https://github.com/huggingface/parler-tts.git
69
  ```
70
 
71
+ ### 🎲 Random voice
72
 
73
+ **Parler-TTS Mini Multilingual** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
 
74
 
75
  ```py
76
  import torch
 
84
  tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
85
  description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
86
 
87
+ prompt = "Salut toi, comment vas-tu aujourd'hui?"
88
  description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
89
 
90
  input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
 
95
  sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
96
  ```
97
 
98
+ ### 🎯 Using a specific speaker
99
+
100
+ To ensure speaker consistency across generations, this checkpoint was also trained on 16 speakers, characterized by name (e.g. Daniel, Christine, Richard, Nicole, ...).
101
+
102
+ To take advantage of this, simply adapt your text description to specify which speaker to use: `Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
103
+
104
+ ```py
105
+ import torch
106
+ from parler_tts import ParlerTTSForConditionalGeneration
107
+ from transformers import AutoTokenizer
108
+ import soundfile as sf
109
+
110
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
111
+
112
+ model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device)
113
+ tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
114
+ description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
115
+
116
+ prompt = "Salut toi, comment vas-tu aujourd'hui?"
117
+ description = "Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
118
+
119
+ input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
120
+ prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
121
+
122
+ generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
123
+ audio_arr = generation.cpu().numpy().squeeze()
124
+ sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
125
+ ```
126
+
127
+ You can choose a speaker from this list: Mark, Jessica, Daniel, Christine, Christopher, Nicole, Richard, Julia, Alex, Natalie, Nicholas, Sophia, Steven, Olivia, Megan and Michelle.
128
+
129
  **Tips**:
130
  * We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
131
  * Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise