Update README.md
Browse files
README.md
CHANGED
@@ -25,25 +25,6 @@ datasets:
|
|
25 |
- ylacombe/cml-tts-filtered-annotated
|
26 |
- PHBJT/cml-tts-filtered
|
27 |
---
|
28 |
-
|
29 |
-
{
|
30 |
-
"2450": "Mark",
|
31 |
-
"496": "Jessica",
|
32 |
-
"3060": "Daniel",
|
33 |
-
"12709": "Christine",
|
34 |
-
"1897": "Christopher",
|
35 |
-
"10148": "Nicole",
|
36 |
-
"4998": "Richard",
|
37 |
-
"4649": "Julia",
|
38 |
-
"6892": "Alex",
|
39 |
-
"7014": "Natalie",
|
40 |
-
"4367": "Nicholas",
|
41 |
-
"2961": "Sophia",
|
42 |
-
"3946": "Steven",
|
43 |
-
"10246": "Olivia",
|
44 |
-
"11772": "Megan",
|
45 |
-
"4174": "Michelle"
|
46 |
-
}
|
47 |
|
48 |
<img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
49 |
|
@@ -56,6 +37,8 @@ datasets:
|
|
56 |
|
57 |
**Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
|
58 |
|
|
|
|
|
59 |
It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
|
60 |
In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
|
61 |
|
@@ -68,7 +51,8 @@ Thanks to its **better prompt tokenizer**, it can easily be extended to other la
|
|
68 |
|
69 |
## π Quick Index
|
70 |
* [π¨βπ» Installation](#π¨βπ»-installation)
|
71 |
-
* [
|
|
|
72 |
* [Motivation](#motivation)
|
73 |
* [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
|
74 |
|
@@ -84,10 +68,9 @@ Using Parler-TTS is as simple as "bonjour". Simply install the library once:
|
|
84 |
pip install git+https://github.com/huggingface/parler-tts.git
|
85 |
```
|
86 |
|
87 |
-
###
|
88 |
|
89 |
-
|
90 |
-
**Parler-TTS** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
|
91 |
|
92 |
```py
|
93 |
import torch
|
@@ -101,7 +84,7 @@ model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts
|
|
101 |
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
|
102 |
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
|
103 |
|
104 |
-
prompt = "
|
105 |
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
|
106 |
|
107 |
input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
|
@@ -112,6 +95,37 @@ audio_arr = generation.cpu().numpy().squeeze()
|
|
112 |
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
|
113 |
```
|
114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
**Tips**:
|
116 |
* We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
|
117 |
* Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
|
|
|
25 |
- ylacombe/cml-tts-filtered-annotated
|
26 |
- PHBJT/cml-tts-filtered
|
27 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
<img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
30 |
|
|
|
37 |
|
38 |
**Parler-TTS Mini Multilingual v1.1** is a multilingual extension of [Parler-TTS Mini](https://huggingface.co/parler-tts/parler-tts-mini-v1.1).
|
39 |
|
40 |
+
π¨ As compared to [Mini Multilingual v1](https://huggingface.co/parler-tts/parler-tts-mini-multilingual), this version was trained with some consistent speaker names and with better format for descriptions. π¨
|
41 |
+
|
42 |
It is a fine-tuned version, trained on a [cleaned version](https://huggingface.co/datasets/PHBJT/cml-tts-filtered) of [CML-TTS](https://huggingface.co/datasets/ylacombe/cml-tts) and on the non-English version of [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech).
|
43 |
In all, this represents some 9,200 hours of non-English data. To retain English capabilities, we also added back the [LibriTTS-R English dataset](https://huggingface.co/datasets/parler-tts/libritts_r_filtered), some 580h of high-quality English data.
|
44 |
|
|
|
51 |
|
52 |
## π Quick Index
|
53 |
* [π¨βπ» Installation](#π¨βπ»-installation)
|
54 |
+
* [π² Using a random voice](#π²-random-voice)
|
55 |
+
* [π― Using a specific speaker](#π―-using-a-specific-speaker)
|
56 |
* [Motivation](#motivation)
|
57 |
* [Optimizing inference](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
|
58 |
|
|
|
68 |
pip install git+https://github.com/huggingface/parler-tts.git
|
69 |
```
|
70 |
|
71 |
+
### π² Random voice
|
72 |
|
73 |
+
**Parler-TTS Mini Multilingual** has been trained to generate speech with features that can be controlled with a simple text prompt, for example:
|
|
|
74 |
|
75 |
```py
|
76 |
import torch
|
|
|
84 |
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
|
85 |
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
|
86 |
|
87 |
+
prompt = "Salut toi, comment vas-tu aujourd'hui?"
|
88 |
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
|
89 |
|
90 |
input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
|
|
|
95 |
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
|
96 |
```
|
97 |
|
98 |
+
### π― Using a specific speaker
|
99 |
+
|
100 |
+
To ensure speaker consistency across generations, this checkpoint was also trained on 16 speakers, characterized by name (e.g. Daniel, Christine, Richard, Nicole, ...).
|
101 |
+
|
102 |
+
To take advantage of this, simply adapt your text description to specify which speaker to use: `Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.`
|
103 |
+
|
104 |
+
```py
|
105 |
+
import torch
|
106 |
+
from parler_tts import ParlerTTSForConditionalGeneration
|
107 |
+
from transformers import AutoTokenizer
|
108 |
+
import soundfile as sf
|
109 |
+
|
110 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
111 |
+
|
112 |
+
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device)
|
113 |
+
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
|
114 |
+
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
|
115 |
+
|
116 |
+
prompt = "Salut toi, comment vas-tu aujourd'hui?"
|
117 |
+
description = "Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
|
118 |
+
|
119 |
+
input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
|
120 |
+
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
|
121 |
+
|
122 |
+
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
|
123 |
+
audio_arr = generation.cpu().numpy().squeeze()
|
124 |
+
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
|
125 |
+
```
|
126 |
+
|
127 |
+
You can choose a speaker from this list: Mark, Jessica, Daniel, Christine, Christopher, Nicole, Richard, Julia, Alex, Natalie, Nicholas, Sophia, Steven, Olivia, Megan and Michelle.
|
128 |
+
|
129 |
**Tips**:
|
130 |
* We've set up an [inference guide](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) to make generation faster. Think SDPA, torch.compile, batching and streaming!
|
131 |
* Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
|