wetdog commited on
Commit
63571fb
1 Parent(s): 96c711d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -16
README.md CHANGED
@@ -3,18 +3,17 @@ language:
3
  - ca
4
  licence:
5
  - apache-2.0
 
6
  tags:
7
  - matcha-tts
8
  - acoustic modelling
9
  - speech
10
  - multispeaker
11
  pipeline_tag: text-to-speech
12
- datasets:
13
- - projecte-aina/festcat_trimmed_denoised
14
- - projecte-aina/openslr-slr69-ca-trimmed-denoised
15
  ---
16
 
17
- # Matcha-TTS Catalan Multispeaker
18
 
19
  ## Table of Contents
20
  <details>
@@ -53,7 +52,7 @@ This may be due to the sensitivity of the model in learning specific frequencies
53
 
54
  ### Installation
55
 
56
- This model has been trained using the espeak-ng open source text-to-speech software.
57
  The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
58
 
59
  Create a virtual environment:
@@ -123,11 +122,10 @@ python3 matcha_vocos_inference.py --output_path=/output/path --text_input="Bon d
123
 
124
  #### ONNX
125
 
126
- We also release a ONNX version of the model
127
 
128
  ### For Training
129
-
130
- The entire checkpoint is also released to continue training or finetuning.
131
  See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
132
 
133
 
@@ -135,25 +133,23 @@ See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-
135
 
136
  ### Training data
137
 
138
- The model was trained on 2 **Catalan** speech datasets
139
 
140
  | Dataset | Language | Hours | Num. Speakers |
141
  |---------------------|----------|---------|-----------------|
142
- | [Festcat](https://huggingface.co/datasets/projecte-aina/festcat_trimmed_denoised) | ca | 22 | 11 |
143
- | [OpenSLR69](https://huggingface.co/datasets/projecte-aina/openslr-slr69-ca-trimmed-denoised) | ca | 5 | 36 |
144
 
145
  ### Training procedure
146
 
147
- ***Catalan Matcha-TTS*** was finetuned from the English multispeaker checkpoint,
148
- which was trained with the [VCTK dataset](https://huggingface.co/datasets/vctk) and provided by the model authors.
149
 
150
- The embedding layer was initialized with the number of catalan speakers (47) and the original hyperparameters were kept.
151
 
152
  ### Training Hyperparameters
153
 
154
  * batch size: 32 (x2 GPUs)
155
  * learning rate: 1e-4
156
- * number of speakers: 47
157
  * n_fft: 1024
158
  * n_feats: 80
159
  * sample_rate: 22050
@@ -174,7 +170,7 @@ Validation values obtained from tensorboard from epoch 2399*:
174
  * val_prior_loss_epoch: 0.97
175
  * val_diff_loss_epoch: 2.195
176
 
177
- (Note that the finetuning started from epoch 1864, as previous ones were trained with VCTK dataset)
178
 
179
  ## Citation
180
 
 
3
  - ca
4
  licence:
5
  - apache-2.0
6
+ base_model: BSC-LT/matcha-tts-cat-multispeaker
7
  tags:
8
  - matcha-tts
9
  - acoustic modelling
10
  - speech
11
  - multispeaker
12
  pipeline_tag: text-to-speech
13
+
 
 
14
  ---
15
 
16
+ # Matcha-TTS Catalan Multiaccent
17
 
18
  ## Table of Contents
19
  <details>
 
52
 
53
  ### Installation
54
 
55
+ Models have been trained using the espeak-ng open source text-to-speech software.
56
  The espeak-ng containing the Catalan phonemizer can be found [here](https://github.com/projecte-aina/espeak-ng)
57
 
58
  Create a virtual environment:
 
122
 
123
  #### ONNX
124
 
125
+ We also release ONNXs version of the models
126
 
127
  ### For Training
128
+
 
129
  See the [repo instructions](https://github.com/langtech-bsc/Matcha-TTS/tree/dev-cat)
130
 
131
 
 
133
 
134
  ### Training data
135
 
136
+ The model was trained on a **Multiaccent Catalan** speech dataset
137
 
138
  | Dataset | Language | Hours | Num. Speakers |
139
  |---------------------|----------|---------|-----------------|
140
+ | [Lafrescat comming soon]---() | ca | 3.5 | 8 |
 
141
 
142
  ### Training procedure
143
 
144
+ ***Multiaccent Catalan Matcha-TTS*** was finetuned from a catalan central [multispeaker checkpoint](https://huggingface.co/BSC-LT/matcha-tts-cat-multispeaker),
 
145
 
146
+ The embedding layer was initialized with the number of catalan speakers per accent (2) and the original hyperparameters were kept.
147
 
148
  ### Training Hyperparameters
149
 
150
  * batch size: 32 (x2 GPUs)
151
  * learning rate: 1e-4
152
+ * number of speakers: 2
153
  * n_fft: 1024
154
  * n_feats: 80
155
  * sample_rate: 22050
 
170
  * val_prior_loss_epoch: 0.97
171
  * val_diff_loss_epoch: 2.195
172
 
173
+
174
 
175
  ## Citation
176