File size: 4,397 Bytes
731ad3d
 
ef659d6
b81081a
 
 
 
 
731ad3d
ef659d6
 
 
 
 
 
5bb2e7d
 
 
 
 
 
8e9d2c5
 
 
5bb2e7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef659d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: bsd-3-clause
language: en
tags:
- pytorch
- audio
- text-to-speech
- maui
---

# Maui TTS base model (Tacotron2 + Hifi-gan)

We use a recurrent sequence-to-sequence Mel-spectrogram prediction network based on Google's Tacotron2 as a baseline. It achieves good performance, has been tested in multiple different contexts and gives a very realistic rendering of a speaker's characteristics. On the vocoder side we replaced Google's original Wavenet synthesizer with the recent Hifi-Gan vocoder for improved realistic speech prosody and faster training.

### How to use
Usage based on Maui python library

```python

# IMPORTS
import json
import os
from omegaconf import OmegaConf

from maui.utils.tacotron2 import text2seq
from maui.utils.hifigan import AttrDict
from maui.models.hifigan import Generator
from maui.models.tacotron2 import Tacotron2

# PATHS
MODEL_DIR = "models"
TACO_CONF = os.path.join(MODEL_DIR, "maui-tacotron2.yaml")
OBAMA_CKPT = os.path.join(MODEL_DIR, "obama", "checkpoint_9000")
HIFIGAN_CKPT = os.path.join(MODEL_DIR, "hifigan", "UNIVERSAL_V1", "g_02500000")
HIFIGAN_CONF = os.path.join(MODEL_DIR,"hifigan", "UNIVERSAL_V1","config.json")

# DEVICE FOR INFERENCE
device = torch.device('cuda')
#device = torch.device('cpu')

# MODEL SETUP
taco_cfg = OmegaConf.load(TACO_CONF)
taco = Tacotron2(taco_cfg.model).to(device).share_memory()
taco = taco.setup_inference(OBAMA_CKPT, device=device)
with open(HIFIGAN_CONF, 'r') as f:
    data = f.read()
hifigan_cfg = AttrDict(json.loads(data))
hifigan = Generator(hifigan_cfg).to(device).share_memory()
hifigan = hifigan.setup_inference(HIFIGAN_CKPT, device=device)

# TEXT NORMALIZATION
text = "This is the sentence that will be spoken in the voice of Pr. Obama"
sequence = text2seq(text, device=device)

# INFERENCE
_, mel_outputs_postnet, _, _ = taco.inference(sequence)
audio = hifigan.inference(mel_outputs_postnet, device=device)

```

## Training data

To train these models, we need parallel corpora of speech and corresponding transcripts of the speaker we are modeling. The training data can be extracted from audiobooks. In this case ["A promised land"](https://www.amazon.com/A-Promised-Land-Obama-Audiobook/dp/B08HGH9JMF) purchased on Audible.

## Training procedure

### Ground Truth: 
The ground truth for the transcripts is obtained by sending the full Webex meetings to a third-party transcription service.
Text Normalization: We are modeling a speaker's voice so we need to normalize the text transcript in a way that makes the pronunciation of every word explicit. We need audio and its corresponding text representation to match as much as possible. For instance written numbers like '2020' should normalize to 'twenty twenty', 'Mr.' to 'Mister', etc. 
### Speech Segmentation:
The first step is to segment the speech into sentence-like snippets of audio of no more than 10 s. so the data can fit on our GPUs and segments represent a rather natural input length. For this task, we used the energy-based WebRTC Voice Activity Detection (VAD) because of its speed, flexibility through the 'aggressiveness' parameter and robustness to noise.
### Speech and Text Alignments
Once we have sentence-like audio segments of speech extracted by the VAD, we need to map these to the corresponding text transcript. The challenge at stake here is that we have a global transcript for the whole meeting giving by the third-party but need a segment-based transcript. One way we solved this is by using Voicea's transcription service for each audio segment. Since the transcription is not 100% accurate we further string-matched it with the global third-party ground-truth to correct potential mistakes in the segment-based automatic transcription.
### Mel spectrogram
Our model operates at an intermediary representation of audio, the Mel-spectrogram, which can be easily computed offline using Fast Fourier Transform-based algorithms.
For each sentence-like utterance we now have an audio waveform, its Mel-spectrogram representation and an accurate normalized text transcription. We are ready to train!

### BibTeX entry and citation info

```bibtex
@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}
```