File size: 5,570 Bytes
be9690e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our [paper](https://arxiv.org/abs/2010.05646),
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
We provide our implementation and pretrained models as open source in this repository.

**Abstract :**
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
Although such methods improve the sampling efficiency and memory usage,
their sample quality has not yet reached that of autoregressive and flow-based generative models.
In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
As speech audio consists of sinusoidal signals with various periods,
we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.

## Pre-requisites

1. Python >= 3.6
2. Clone this repository.
3. Install python requirements. Please refer [requirements.txt](requirements.txt)
4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
   And move all wav files to `LJSpeech-1.1/wavs`

## Training

```
python train.py --config config_v1.json
```

To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
You can change the path by adding `--checkpoint_path` option.

Validation loss during training with V1 generator.<br>
![validation loss](./validation_loss.png)

## Pretrained Model

You can also use pretrained models we provide.<br/>
[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
Details of each folder are as in follows:

| Folder Name  | Generator | Dataset   | Fine-Tuned                                             |
| ------------ | --------- | --------- | ------------------------------------------------------ |
| LJ_V1        | V1        | LJSpeech  | No                                                     |
| LJ_V2        | V2        | LJSpeech  | No                                                     |
| LJ_V3        | V3        | LJSpeech  | No                                                     |
| LJ_FT_T2_V1  | V1        | LJSpeech  | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| LJ_FT_T2_V2  | V2        | LJSpeech  | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| LJ_FT_T2_V3  | V3        | LJSpeech  | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| VCTK_V1      | V1        | VCTK      | No                                                     |
| VCTK_V2      | V2        | VCTK      | No                                                     |
| VCTK_V3      | V3        | VCTK      | No                                                     |
| UNIVERSAL_V1 | V1        | Universal | No                                                     |

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

## Fine-Tuning

1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
   The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
   Example:
   `   Audio File : LJ001-0001.wav
Mel-Spectrogram File : LJ001-0001.npy`
2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
3. Run the following command.
   ```
   python train.py --fine_tuning True --config config_v1.json
   ```
   For other command line options, please refer to the training section.

## Inference from wav file

1. Make `test_files` directory and copy wav files into the directory.
2. Run the following command.
   `   python inference.py --checkpoint_file [generator checkpoint file path]`
   Generated wav files are saved in `generated_files` by default.<br>
   You can change the path by adding `--output_dir` option.

## Inference for end-to-end speech synthesis

1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
   You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
   [Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
2. Run the following command.
   `   python inference_e2e.py --checkpoint_file [generator checkpoint file path]`
   Generated wav files are saved in `generated_files_from_mel` by default.<br>
   You can change the path by adding `--output_dir` option.

## Acknowledgements

We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.