NaturalSpeech2 Recipe
In this recipe, we will show how to train NaturalSpeech2 using Amphion's infrastructure. NaturalSpeech2 is a zero-shot TTS architecture that predicts latent representations of a neural audio codec.
There are three stages in total:
- Data processing
- Training
- Inference
NOTE: You need to run every command of this recipe in the
Amphion
root path:
cd Amphion
1. Data processing
You can use the commonly used TTS dataset to train NaturalSpeech2 model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train NaturalSpeech2 model for the first time. How to download dataset is detailed here.
You can follow other Amphion TTS recipes for the data processing.
3. Training
sh egs/tts/NaturalSpeech2/run_train.sh
4. Inference
bash egs/tts/NaturalSpeech2/run_inference.sh --text "[The text you want to generate]"
We released a pre-trained Amphion NatrualSpeech2 model. So you can download the pre-trained model here and generate speech following the above inference instruction.
We also provided an online demo, feel free to try it!
@article{shen2023naturalspeech,
title={Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers},
author={Shen, Kai and Ju, Zeqian and Tan, Xu and Liu, Yanqing and Leng, Yichong and He, Lei and Qin, Tao and Zhao, Sheng and Bian, Jiang},
journal={arXiv preprint arXiv:2304.09116},
year={2023}
}