---
language: en
tags:
- speech quantization
license: mit
datasets:
- LibriTTS
---

# Highlights
This model is used for speech codec or quantization on English utterances.
- Lower frame rate, 25 token/s for each quantizer
- Achieving higher codec quality under low band widths
- Training with structured dropout, enabling various band widths during inference with a single model
- Quantizing a raw speech waveform into a sequence of discrete tokens

# FunCodec model
This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec), 
an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
This repository provides a pre-trained model on the LibriTTS corpus.
It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis 
and other academic research topics.
Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312), 
the following improved techniques are utilized to train the model, resulting in higher codec quality and 
[ViSQOL](https://github.com/google/visqol) scores under the same band width:
- The magnitude spectrum loss is employed to enhance the middle and high frequency signals
- Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
- Codes are initialized by k-means clusters rather than random values
- Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.

## Model description
This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain 
several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
<p align="center">
<img src="fig/framework.png" alt="FunCodec architecture"/>
</p>

In general, FunCodec models consist of five modules: a domain transformation module, 
an encoder, a RVQ module, a decoder and a domain inversion module.
- Domain Transformation：transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
- Encoder：encode signals into compact representations with stacked convolutional and LSTM layers.
- Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
- RVQ：quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
- Decoder：decode quantized embeddings into different signal domains the same as inputs.
- Domain Inversion：re-synthesize perceptible waveforms from different domains.

More details can be found at：
- Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
- Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)

## Intended uses & sceneries
### Inference with FunCodec

You can extract codecs and reconstruct them back to waveforms with FunCodec repository.

#### FunCodec installation
```sh
# Install Pytorch GPU (version >= 1.12.0):
conda install pytorch==1.12.0
# for other versions, please refer: https://pytorch.org/get-started/locally

# Download codebase:
git clone https://github.com/alibaba-damo-academy/FunCodec.git

# Install FunCodec codebase:
cd FunCodec
pip install --editable ./
```

#### Codec extraction
```sh
# Enter the example directory 
cd egs/LibriTTS/codec
# Specify the model name
model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
# Download the model
git lfs install
git clone https://huggingface.co/alibaba-damo/${model_name}
mkdir exp
mv ${model_name} exp/$model_name
# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp input_wav.scp --out_dir outputs/codecs
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...
```

### Reconstruct waveforms from codecs
```shell
# Reconstruct waveforms into "outputs/recon_wavs"
bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs 
# codecs.txt is the output of stage 1, which has the following format：
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
# ...
```

### Inference with Huggingface Transformers
Inference with Huggingface transformers package is under development.

### Application sceneries
Running environment
- Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.

Intended using sceneries
- This model is suitable for academic usages
- Speech quantization, codec and tokenization for English utterances

## Evaluation results

### Training configuration
- Feature info: raw waveform input
- Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
- Loss info: L1, L2, discriminative loss
- Model info: SEANet, Conv, LSTM
- Train config: config.yaml
- Model size: 57.83 M parameters

### Experimental Results

Test set: LibriTTS-test, ViSQOL scores
| testset  | 50 tk/s  | 100 tk/s | 200 tk/s | 400 tk/s |
|:--------:|:--------:|:--------:|:--------:|:--------:|
| LibriTTS |   3.64   |   3.94   |   4.16   |   4.29   |

### Limitations and bias
- Not very robust to background noises and reverberation

### BibTeX entry and citation info
```BibTeX
@misc{du2023funcodec,
      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
      year={2023},
      eprint={2309.07405},
      archivePrefix={arXiv},
      primaryClass={cs.Sound}
}
```