maskgct / egs /tts /VALLE_V2 /README.md
Hecheng0625's picture
Upload 167 files
8c92a11 verified

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

VALL-E

Introduction

This is an unofficial PyTorch implementation of VALL-E, a zero-shot voice cloning model via neural codec language modeling (paper link). If trained properly, this model could match the performance specified in the original paper.

Change notes

This is a refined version compared to the first version of VALL-E in Amphion, we have changed the underlying implementation to Llama to provide better model performance, faster training speed, and more readable codes. This can be a great tool if you want to learn speech language models and its implementation.

Installation requirement

Set up your environemnt as in Amphion README (you'll need a conda environment, and we recommend using Linux). A GPU is recommended if you want to train this model yourself. For inferencing our pretrained models, you could generate samples even without a GPU. To ensure your transformers library can run the code, we recommend additionally running:

pip install -U transformers==4.41.2

Inferencing pretrained VALL-E models

Download pretrained weights

You need to download our pretrained weights from huggingface.

Script to download AR and NAR model checkpoint:

huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts

Script to download codec model (SpeechTokenizer) checkpoint:

mkdir -p ckpts/speechtokenizer_hubert_avg && huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg

If you cannot access huggingface, consider using the huggingface mirror to download:

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
mkdir -p ckpts/speechtokenizer_hubert_avg && HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg

Inference in IPython notebook

We provide our pretrained VALL-E model that is trained on 45k hours MLS dataset, which contains 10-20s English speech. The "demo.ipynb" file provides a working example of inferencing our pretrained VALL-E model. Give it a try!

Examining the model files

Examining the model files of VALL-E is a great way to learn how it works. We provide examples that allows you to overfit a single batch (so no dataset downloading is required).

The AR model is essentially a causal language model that "continues" a speech. The NAR model is a modification from the AR model that allows for bidirectional attention.

File valle_ar.py and valle_nar.py in "models/tts/valle_v2" folder are models files, these files can be run directly via python -m models.tts.valle_v2.valle_ar (or python -m models.tts.valle_v2.valle_nar). This will invoke a test which overfits it to a single example.

Training VALL-E from scratch

Preparing LibriTTS or LibriTTS-R dataset files

We have tested our training script on LibriTTS and LibriTTS-R. You could download LibriTTS-R at this link and LibriTTS at this link. The "train-clean-360" split is currently used by our configuration. You can test dataset.py by run python -m models.tts.valle_v2.libritts_dataset.

For your reference, our unzipped dataset files has a file structure like this:

/path/to/LibriTTS_R
β”œβ”€β”€ BOOKS.txt
β”œβ”€β”€ CHAPTERS.txt
β”œβ”€β”€ dev-clean
β”‚   β”œβ”€β”€ 2412
β”‚   β”‚   β”œβ”€β”€ 153947
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.normalized.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.original.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000014_000000.wav
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.normalized.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.original.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000001.wav
β”‚   β”‚   β”‚   β”œβ”€β”€ 2412_153947_000017_000005.normalized.txt
β”œβ”€β”€ train-clean-360
    β”œβ”€β”€ 422
β”‚   β”‚   └── 122949
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.normalized.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.original.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000009_000007.wav
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.normalized.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.original.txt
β”‚   β”‚       β”œβ”€β”€ 422_122949_000013_000010.wav
β”‚   β”‚       β”œβ”€β”€ 422_122949.book.tsv
β”‚   β”‚       └── 422_122949.trans.tsv

Alternativelly, you could write your own dataloader for your dataset. You can reference the __getitem__ method in models/tts/VALLE_V2/mls_dataset.py It should return a dict of a 1-dimensional tensor 'speech', which is a 16kHz speech; and a 1-dimensional tensor of 'phone', which is the phoneme sequence of the speech. As long as your dataset returns this in __getitem__, it should work.

Changing batch size and dataset path in configuration file

Our configuration file for training VALL-E AR model is at "egs/tts/VALLE_V2/exp_ar_libritts.json", and NAR model at "egs/tts/VALLE_V2/exp_nar_libritts.json"

To train your model, you need to modify the dataset variable in the json configurations. Currently it's at line 40, you should modify the "data_dir" to your dataset's root directory.

    "dataset": {
      "dataset_list":["train-clean-360"], // You can also change to other splits like "dev-clean"
      "data_dir": "/path/to/your/LibriTTS_R",
    },

You should also select a reasonable batch size at the "batch_size" entry (currently it's set at 5).

You can change other experiment settings in the /egs/tts/VALLE_V2/exp_ar_libritts.json such as the learning rate, optimizer and the dataset.

Run the command to Train AR model

(Make sure your current directory is at the Amphion root directory). Run:

sh egs/tts/VALLE_V2/train_ar_libritts.sh

Your initial model checkpoint could be found in places such as ckpt/VALLE_V2/ar_libritts/checkpoint/epoch-0000_step-0000000_loss-7.397293/pytorch_model.bin

Resume from existing checkpoint

Our framework supports resuming from existing checkpoint.

Run:

sh egs/tts/VALLE_V2/train_ar_libritts.sh --resume

Finetuning based on our AR model

We provide our AR model optimizer, and random_states checkpoints to support finetuning (No need to download these files if you're only inferencing from the pretrained model). First rename the models as "pytorch_model.bin", "optimizer.bin", and "random_states_0.pkl", then you could resume from these checkpoints. Link to AR optimizer checkpoint and Link to random_states.pkl.

Run the command to Train NAR model

(Make sure your current directory is at the Amphion root directory). Run:

sh egs/tts/VALLE_V2/train_nar_libritts.sh

Inference your models

Since our inference script is already given, you can change the paths from our pretrained model to you newly trained models and do the inference.

Future plans

  • Support more languages
  • More are coming...