Update README.md
Browse files
README.md
CHANGED
@@ -3,18 +3,17 @@ language:
|
|
3 |
- ca
|
4 |
licence:
|
5 |
- apache-2.0
|
|
|
6 |
tags:
|
7 |
- matcha-tts
|
8 |
- acoustic modelling
|
9 |
- speech
|
10 |
- multispeaker
|
11 |
pipeline_tag: text-to-speech
|
12 |
-
|
13 |
-
- projecte-aina/festcat_trimmed_denoised
|
14 |
-
- projecte-aina/openslr-slr69-ca-trimmed-denoised
|
15 |
---
|
16 |
|
17 |
-
# Matcha-TTS Catalan
|
18 |
|
19 |
## Table of Contents
|
20 |
<details>
|
@@ -53,7 +52,7 @@ This may be due to the sensitivity of the model in learning specific frequencies
|
|
53 |
|
54 |
### Installation
|
55 |
|
56 |
-
|
57 |
The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
|
58 |
|
59 |
Create a virtual environment:
|
@@ -123,11 +122,10 @@ python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon d
|
|
123 |
|
124 |
#### ONNX
|
125 |
|
126 |
-
We also release
|
127 |
|
128 |
### For Training
|
129 |
-
|
130 |
-
The entire checkpoint is also released to continue training or finetuning.
|
131 |
See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
|
132 |
|
133 |
|
@@ -135,25 +133,23 @@ See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-
|
|
135 |
|
136 |
### Training data
|
137 |
|
138 |
-
The model was trained on
|
139 |
|
140 |
| Dataset | Language | Hours | Num. Speakers |
|
141 |
|---------------------|----------|---------|-----------------|
|
142 |
-
| [
|
143 |
-
| [OpenSLR69](https://huggingface.co/datasets/projecte-aina/openslr-slr69-ca-trimmed-denoised) | ca | 5 | 36 |
|
144 |
|
145 |
### Training procedure
|
146 |
|
147 |
-
***Catalan Matcha-TTS*** was finetuned from
|
148 |
-
which was trained with the [VCTK dataset](https://huggingface.co/datasets/vctk) and provided by the model authors.
|
149 |
|
150 |
-
The embedding layer was initialized with the number of catalan speakers (
|
151 |
|
152 |
### Training Hyperparameters
|
153 |
|
154 |
* batch size: 32 (x2 GPUs)
|
155 |
* learning rate: 1e-4
|
156 |
-
* number of speakers:
|
157 |
* n_fft: 1024
|
158 |
* n_feats: 80
|
159 |
* sample_rate: 22050
|
@@ -174,7 +170,7 @@ Validation values obtained from tensorboard from epoch 2399*:
|
|
174 |
* val_prior_loss_epoch: 0.97
|
175 |
* val_diff_loss_epoch: 2.195
|
176 |
|
177 |
-
|
178 |
|
179 |
## Citation
|
180 |
|
|
|
3 |
- ca
|
4 |
licence:
|
5 |
- apache-2.0
|
6 |
+
base_model: BSC-LT/matcha-tts-cat-multispeaker
|
7 |
tags:
|
8 |
- matcha-tts
|
9 |
- acoustic modelling
|
10 |
- speech
|
11 |
- multispeaker
|
12 |
pipeline_tag: text-to-speech
|
13 |
+
|
|
|
|
|
14 |
---
|
15 |
|
16 |
+
# Matcha-TTS Catalan Multiaccent
|
17 |
|
18 |
## Table of Contents
|
19 |
<details>
|
|
|
52 |
|
53 |
### Installation
|
54 |
|
55 |
+
Models have been trained using the espeak-ng open source text-to-speech software.
|
56 |
The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
|
57 |
|
58 |
Create a virtual environment:
|
|
|
122 |
|
123 |
#### ONNX
|
124 |
|
125 |
+
We also release ONNXs version of the models
|
126 |
|
127 |
### For Training
|
128 |
+
|
|
|
129 |
See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
|
130 |
|
131 |
|
|
|
133 |
|
134 |
### Training data
|
135 |
|
136 |
+
The model was trained on a **Multiaccent Catalan** speech dataset
|
137 |
|
138 |
| Dataset | Language | Hours | Num. Speakers |
|
139 |
|---------------------|----------|---------|-----------------|
|
140 |
+
| [Lafrescat comming soon]---() | ca | 3.5 | 8 |
|
|
|
141 |
|
142 |
### Training procedure
|
143 |
|
144 |
+
***Multiaccent Catalan Matcha-TTS*** was finetuned from a catalan central [multispeaker checkpoint](https://huggingface.co/BSC-LT/matcha-tts-cat-multispeaker),
|
|
|
145 |
|
146 |
+
The embedding layer was initialized with the number of catalan speakers per accent (2) and the original hyperparameters were kept.
|
147 |
|
148 |
### Training Hyperparameters
|
149 |
|
150 |
* batch size: 32 (x2 GPUs)
|
151 |
* learning rate: 1e-4
|
152 |
+
* number of speakers: 2
|
153 |
* n_fft: 1024
|
154 |
* n_feats: 80
|
155 |
* sample_rate: 22050
|
|
|
170 |
* val_prior_loss_epoch: 0.97
|
171 |
* val_diff_loss_epoch: 2.195
|
172 |
|
173 |
+
|
174 |
|
175 |
## Citation
|
176 |
|