slegroux commited on
Commit
d9407e4
2 Parent(s): 99e4e51 ef659d6

Merge branch 'main' of https://huggingface.co/cisco-ai/tts-maui-obama into main

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
2
  license: bsd-3-clause
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause
3
+ language: en
4
  ---
5
+
6
+ # Maui TTS base model (Tacotron2 + Hifi-gan)
7
+
8
+ We use a recurrent sequence-to-sequence Mel-spectrogram prediction network based on Google's Tacotron2 as a baseline. It achieves good performance, has been tested in multiple different contexts and gives a very realistic rendering of a speaker's characteristics. On the vocoder side we replaced Google's original Wavenet synthesizer with the recent Hifi-Gan vocoder for improved realistic speech prosody and faster training.
9
+
10
+ ### How to use
11
+
12
+
13
+ ## Training data
14
+
15
+ To train these models, we need parallel corpora of speech and corresponding transcripts of the speaker we are modeling. The training data can be extracted from audiobooks. In this case ["A promised land"](https://www.amazon.com/A-Promised-Land-Obama-Audiobook/dp/B08HGH9JMF) purchased on Audible.
16
+
17
+ ## Training procedure
18
+
19
+ ### Ground Truth:
20
+ The ground truth for the transcripts is obtained by sending the full Webex meetings to a third-party transcription service.
21
+ Text Normalization: We are modeling a speaker's voice so we need to normalize the text transcript in a way that makes the pronunciation of every word explicit. We need audio and its corresponding text representation to match as much as possible. For instance written numbers like '2020' should normalize to 'twenty twenty', 'Mr.' to 'Mister', etc. 
22
+ ### Speech Segmentation:
23
+ The first step is to segment the speech into sentence-like snippets of audio of no more than 10 s. so the data can fit on our GPUs and segments represent a rather natural input length. For this task, we used the energy-based WebRTC Voice Activity Detection (VAD) because of its speed, flexibility through the 'aggressiveness' parameter and robustness to noise.
24
+ ### Speech and Text Alignments
25
+ Once we have sentence-like audio segments of speech extracted by the VAD, we need to map these to the corresponding text transcript. The challenge at stake here is that we have a global transcript for the whole meeting giving by the third-party but need a segment-based transcript. One way we solved this is by using Voicea's transcription service for each audio segment. Since the transcription is not 100% accurate we further string-matched it with the global third-party ground-truth to correct potential mistakes in the segment-based automatic transcription.
26
+ ### Mel spectrogram
27
+ Our model operates at an intermediary representation of audio, the Mel-spectrogram, which can be easily computed offline using Fast Fourier Transform-based algorithms.
28
+ For each sentence-like utterance we now have an audio waveform, its Mel-spectrogram representation and an accurate normalized text transcription. We are ready to train!
29
+
30
+ ### BibTeX entry and citation info
31
+
32
+ ```bibtex
33
+ @article{Sanh2019DistilBERTAD,
34
+ title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
35
+ author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
36
+ journal={ArXiv},
37
+ year={2019},
38
+ volume={abs/1910.01108}
39
+ }
40
+ ```