subhankarg commited on
Commit
c9c3f2f
1 Parent(s): cd71f63

Update README/model card.

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md CHANGED
@@ -1,3 +1,112 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ library_name: nemo
5
+ datasets:
6
+ - ljspeech
7
+ thumbnail: null
8
+ tags:
9
+ - text-to-speech
10
+ - speech
11
+ - audio
12
+ - Vocoder
13
+ - GAN
14
+ - pytorch
15
+ - NeMo
16
+ - Riva
17
  license: cc-by-4.0
18
  ---
19
+ # NVIDIA Hifigan Vocoder (en-US)
20
+ <style>
21
+ img {
22
+ display: inline;
23
+ }
24
+ </style>
25
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-HiFiGAN--GAN-lightgrey#model-badge)](#model-architecture)
26
+ | [![Model size](https://img.shields.io/badge/Params-85M-lightgrey#model-badge)](#model-architecture)
27
+ | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets)
28
+ | [![Riva Compatible](https://img.shields.io/badge/NVIDIA%20Riva-compatible-brightgreen#model-badge)](#deployment-with-nvidia-riva) |
29
+
30
+ HiFiGAN [1] is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel spectrograms to audio.
31
+
32
+ ## Usage
33
+
34
+ The model is available for use in the NeMo toolkit [2] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
35
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
36
+
37
+ ```
38
+ pip install nemo_toolkit['all']
39
+ ```
40
+
41
+ ### Automatically instantiate the model
42
+
43
+ NOTE: In order to generate audio, you also need a spectrogram generator from NeMo. This example uses the FastPitch model.
44
+
45
+ ```python
46
+ # Load FastPitch
47
+ from nemo.collections.tts.models import FastPitchModel
48
+ spec_generator = FastPitchModel.from_pretrained("tts_en_fastpitch")
49
+
50
+ # Load vocoder
51
+ from nemo.collections.tts.models import HifiGanModel
52
+ model = HifiGanModel.from_pretrained(model_name="tts_hifigan")
53
+ ```
54
+
55
+ ### Generate audio
56
+
57
+ ```python
58
+ import soundfile as sf
59
+ parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
60
+ spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
61
+ audio = model.convert_spectrogram_to_audio(spec=spectrogram)
62
+ ```
63
+
64
+ ### Save the generated audio file
65
+
66
+ ```python
67
+ # Save the audio to disk in a file called speech.wav
68
+ sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
69
+ ```
70
+
71
+ ### Input
72
+
73
+ This model accepts batches of mel spectrograms.
74
+
75
+ ### Output
76
+
77
+ This model outputs audio at 22050Hz.
78
+
79
+ ## Model Architecture
80
+
81
+ HiFi-GAN [1] consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for
82
+ improving training stability and model performance.
83
+
84
+ ## Training
85
+
86
+ The NeMo toolkit [3] was used for training the models for several epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml).
87
+
88
+ ### Datasets
89
+
90
+ This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
91
+
92
+ ## Performance
93
+
94
+ No performance information is available at this time.
95
+
96
+ ## Limitations
97
+
98
+ There are no known limitations at this time.
99
+
100
+ ## Deployment with NVIDIA Riva
101
+
102
+ For the best real-time accuracy, latency, and throughput, deploy the model with [NVIDIA Riva](https://developer.nvidia.com/riva), an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, at the edge, and embedded.
103
+ Additionally, Riva provides:
104
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
105
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
106
+ * Streaming speech recognition, Kubernetes compatible scaling, and Enterprise-grade support
107
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
108
+
109
+ ## References
110
+
111
+ - [1] [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
112
+ - [2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)