YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

StyleTTS2-lite

Base Model

If you have a better checkpoint, your contribution would be greatly appreciated!

English:

Vietnamese:

Model Component Parameter Summary

Component Parameters Used in Inference
Decoder 54,289,492 βœ… Yes
Predictor 16,194,612 βœ… Yes
Style Encoder 13,845,440 βœ… Yes
Text Encoder 5,606,400 βœ… Yes
Text Aligner (ASR) 7,865,252 ❌ No
Pitch Extractor (JDC) 5,248,067 ❌ No
mpd (Discriminator) 41,105,770 ❌ No
msd (Discriminator) 280,902 ❌ No
Total 144,435,935

How To Start Finetuning

1. Install The requirements

pip install .

2. Download Model And Config File

Download and place the base model in /Models/Finetune and the corresponding config file in /Configs.

3. Format Your Dataset.

Format: filename.wav | transcription

For reference, see val.txt in LibriTTS dataset.

4. (Optional) Extend The Token Set To Support Additional Languages

If you plan to train on a new language with symbols not included in the original token set (see config file that comes with the pretrained you downloaded), follow these steps after completing step 2.

  • Locate the extend.ipynb, set the "extend_to" variable to the total number of symbols you want to support (including the new ones), then run the notebook. You may over-extend (i.e reserve extra slots beyond your current needs), but it's strongly recommended to only extend up to the actual number of new symbols you plan to use to avoid unnecessary memory usage or complexity.
  • Find the extended weights in /Extend/New_Weights/, replace the original weights with it.
  • Add new symbols to the "_extend_list" in the config file. You may also want to set:
    load_only_params: true

#Prevent loading old optimizer state

  • ⚠️ Important: Do not add any new symbols to the config file before running extend.ipynb. This will lead to misalignment between the model and the symbol set.

5. Adjust Your Configs File

For a single GPU with 24 GB VRAM, i find the following works well.

batch_size: 2 
max_len: 310 # maximum number of frames

Change the location of your dataset. For example:

data_params:
  train_data: ../Data_Speech/LibriTTS/train.txt
  val_data: ../Data_Speech/LibriTTS/val.txt
  root_path: ../Data_Speech/

6. Start training

python train.py

Disclaimer

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

References

License

Code: MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support