Spaces:
Runtime error
Runtime error
File size: 5,025 Bytes
5548515 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
# Text-to-Audio with Latent Diffusion Model
This is the quicktour for training a text-to-audio model with the popular and powerful generative model: [Latent Diffusion Model](https://arxiv.org/abs/2112.10752). Specially, this recipe is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper "[AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models](https://arxiv.org/abs/2304.00830)". You can check the last part of [AUDIT demos](https://audit-demo.github.io/) to see same text-to-audio examples.
<br>
<div align="center">
<img src="../../imgs/tta/DiffusionTTA.png" width="65%">
</div>
<br>
We train this latent diffusion model in two stages:
1. In the first stage, we aims to obtain a high-quality VAE (called `AutoencoderKL` in Amphion), in order that we can project
the input mel-spectrograms to an efficient, low-dimensional latent space. Specially, we train the VAE with GAN loss to improve the reconstruction quality.
1. In the second stage, we aims to obtain a text-controllable diffusion model (called `AudioLDM` in Amphion). We use U-Net architecture diffusion model, and use T5 encoder as text encoder.
There are four stages in total for training the text-to-audio model:
1. Data preparation and processing
2. Train the VAE model
3. Train the latent diffusion model
4. Inference
> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
> ```bash
> cd Amphion
> ```
## Overview
```sh
# Train the VAE model
sh egs/tta/autoencoderkl/run_train.sh
# Train the latent diffusion model
sh egs/tta/audioldm/run_train.sh
# Inference
sh egs/tta/audioldm/run_inference.sh
```
## 1. Data preparation and processing
### Dataset Download
We take [AudioCaps](https://audiocaps.github.io/) as an example, AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information. You can download the dataset [here](https://github.com/cdjkim/audiocaps).
<!-- How to download AudioCaps is detailed [here](../datasets/README.md) -->
<!-- You can downlaod the dataset [here](https://github.com/cdjkim/audiocaps). -->
### Data Processing
- Download AudioCaps dataset to `[Your path to save tta dataset]` and modify `preprocess.processed_dir` in `egs/tta/.../exp_config.json`.
```json
{
"dataset": [
"AudioCaps"
],
"preprocess": {
// Specify the output root path to save the processed data
"processed_dir": "[Your path to save tta dataset]",
...
}
}
```
The folder structure of your downloaded data should be similar to:
```plaintext
.../[Your path to save tta dataset]
β£ AudioCpas
βΒ Β β£ wav
β β β£ ---1_cCGK4M_0_10000.wav
β β β£ ---lTs1dxhU_30000_40000.wav
β β β£ ...
```
- Then you may process the data to mel-specgram and save it as `.npy` format. If you use the data we provide, we have processed all the wav data.
- Generate a json file to save the metadata, the json file is like:
```json
[
{
"Dataset": "AudioCaps",
"Uid": "---1_cCGK4M_0_10000",
"Caption": "Idling car, train blows horn and passes"
},
{
"Dataset": "AudioCaps",
"Uid": "---lTs1dxhU_30000_40000",
"Caption": "A racing vehicle engine is heard passing by"
},
...
]
```
- Finally, the folder structure is like:
```plaintext
.../[Your path to save tta dataset]
β£ AudioCpas
βΒ Β β£ wav
β β β£ ---1_cCGK4M_0_10000.wav
β β β£ ---lTs1dxhU_30000_40000.wav
β β β£ ...
βΒ Β β£ mel
β β β£ ---1_cCGK4M_0_10000.npy
β β β£ ---lTs1dxhU_30000_40000.npy
β β β£ ...
βΒ Β β£ train.json
βΒ Β β£ valid.json
βΒ Β β£ ...
```
## 2. Training the VAE Model
The first stage model is a VAE model trained with GAN loss (called `AutoencoderKL` in Amphion), run the follow commands:
```sh
sh egs/tta/autoencoderkl/run_train.sh
```
## 3. Training the Latent Diffusion Model
The second stage model is a condition diffusion model with a T5 text encoder (called `AudioLDM` in Amphion), run the following commands:
```sh
sh egs/tta/audioldm/run_train.sh
```
## 4. Inference
Now you can generate audio with your pre-trained latent diffusion model, run the following commands and modify the `text` argument.
```sh
sh egs/tta/audioldm/run_inference.sh \
--text "A man is whistling"
```
## Citations
```bibtex
@article{wang2023audit,
title={AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models},
author={Wang, Yuancheng and Ju, Zeqian and Tan, Xu and He, Lei and Wu, Zhizheng and Bian, Jiang and Zhao, Sheng},
journal={NeurIPS 2023},
year={2023}
}
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
}
``` |