FAcodec

Pytorch implementation for the training of FAcodec, which was proposed in paper NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

A dedicated repository for the FAcodec model can also be find here.

This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.

Model storage

We provide pretrained checkpoints on 50k hours speech data.

Model type	Link
FAcodec

Demo

Try our model on !

Training

Prepare your data and put them under one folder, internal file structure does not matter.
Then, change the dataset in ./egs/codec/FAcodec/exp_custom_data.json to the path of your data folder.
Finally, run the following command:

sh ./egs/codec/FAcodec/train.sh

Inference

To reconstruct a speech file, run:

python ./bins/codec/inference.py --source <source_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>

To use zero-shot voice conversion, run:

python ./bins/codec/inference.py --source <source_wav> --reference <reference_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path>

Feature extraction

When running ./bins/codec/inference.py, check the returned results of the FAcodecInference class: a tuple of (quantized, codes)

quantized is the quantized representation of the input speech file.
quantized[0] is the quantized representation of prosody
quantized[1] is the quantized representation of content
codes is the discrete code representation of the input speech file.
codes[0] is the discrete code representation of prosody
codes[1] is the discrete code representation of content

For the most clean content representation without any timbre, we suggest to use codes[1][:, 0, :], which is the first layer of content codebooks.