NaturalSpeech2 Recipe

In this recipe, we will show how to train NaturalSpeech2 using Amphion's infrastructure. NaturalSpeech2 is a zero-shot TTS architecture that predicts latent representations of a neural audio codec.

There are three stages in total:

Data processing
Training
Inference

NOTE: You need to run every command of this recipe in the Amphion root path:
cd Amphion

1. Data processing

You can use the commonly used TTS dataset to train NaturalSpeech2 model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train NaturalSpeech2 model for the first time. How to download dataset is detailed here.

You can follow other Amphion TTS recipes for the data processing.

3. Training

sh egs/tts/NaturalSpeech2/run_train.sh

4. Inference

bash egs/tts/NaturalSpeech2/run_inference.sh --text "[The text you want to generate]"

We released a pre-trained Amphion NatrualSpeech2 model. So you can download the pre-trained model here and generate speech following the above inference instruction.

We also provided an online demo, feel free to try it!

@article{shen2023naturalspeech,
  title={Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers},
  author={Shen, Kai and Ju, Zeqian and Tan, Xu and Liu, Yanqing and Leng, Yichong and He, Lei and Qin, Tao and Zhao, Sheng and Bian, Jiang},
  journal={arXiv preprint arXiv:2304.09116},
  year={2023}
}