Files changed (1) hide show
  1. README.md +136 -140
README.md CHANGED
@@ -23,9 +23,10 @@ datasets:
23
  thumbnail: null
24
  tags:
25
  - automatic-speech-recognition
 
26
  - speech
27
  - audio
28
- - Transducer
29
  - FastConformer
30
  - Conformer
31
  - pytorch
@@ -37,62 +38,8 @@ widget:
37
  - example_title: Librispeech sample 2
38
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
39
  model-index:
40
- - name: parakeet_rnnt_1.1b
41
  results:
42
- - task:
43
- name: Automatic Speech Recognition
44
- type: automatic-speech-recognition
45
- dataset:
46
- name: AMI (Meetings test)
47
- type: edinburghcstr/ami
48
- config: ihm
49
- split: test
50
- args:
51
- language: en
52
- metrics:
53
- - name: Test WER
54
- type: wer
55
- value: 17.10
56
- - task:
57
- name: Automatic Speech Recognition
58
- type: automatic-speech-recognition
59
- dataset:
60
- name: Earnings-22
61
- type: revdotcom/earnings22
62
- split: test
63
- args:
64
- language: en
65
- metrics:
66
- - name: Test WER
67
- type: wer
68
- value: 14.11
69
- - task:
70
- name: Automatic Speech Recognition
71
- type: automatic-speech-recognition
72
- dataset:
73
- name: GigaSpeech
74
- type: speechcolab/gigaspeech
75
- split: test
76
- args:
77
- language: en
78
- metrics:
79
- - name: Test WER
80
- type: wer
81
- value: 9.96
82
- - task:
83
- name: Automatic Speech Recognition
84
- type: automatic-speech-recognition
85
- dataset:
86
- name: LibriSpeech (clean)
87
- type: librispeech_asr
88
- config: other
89
- split: test
90
- args:
91
- language: en
92
- metrics:
93
- - name: Test WER
94
- type: wer
95
- value: 1.46
96
  - task:
97
  name: Automatic Speech Recognition
98
  type: automatic-speech-recognition
@@ -106,7 +53,7 @@ model-index:
106
  metrics:
107
  - name: Test WER
108
  type: wer
109
- value: 2.47
110
  - task:
111
  type: Automatic Speech Recognition
112
  name: automatic-speech-recognition
@@ -120,27 +67,13 @@ model-index:
120
  metrics:
121
  - name: Test WER
122
  type: wer
123
- value: 3.11
124
  - task:
125
  type: Automatic Speech Recognition
126
  name: automatic-speech-recognition
127
  dataset:
128
- name: tedlium-v3
129
- type: LIUM/tedlium
130
- config: release1
131
- split: test
132
- args:
133
- language: en
134
- metrics:
135
- - name: Test WER
136
- type: wer
137
- value: 3.92
138
- - task:
139
- name: Automatic Speech Recognition
140
- type: automatic-speech-recognition
141
- dataset:
142
- name: Vox Populi
143
- type: facebook/voxpopuli
144
  config: en
145
  split: test
146
  args:
@@ -148,24 +81,10 @@ model-index:
148
  metrics:
149
  - name: Test WER
150
  type: wer
151
- value: 5.39
152
- - task:
153
- type: Automatic Speech Recognition
154
- name: automatic-speech-recognition
155
- dataset:
156
- name: Mozilla Common Voice 9.0
157
- type: mozilla-foundation/common_voice_9_0
158
- config: en
159
- split: test
160
- args:
161
- language: en
162
- metrics:
163
- - name: Test WER
164
- type: wer
165
- value: 5.79
166
-
167
  metrics:
168
  - wer
 
169
  pipeline_tag: automatic-speech-recognition
170
  ---
171
 
@@ -185,94 +104,172 @@ img {
185
  | [![Language](https://img.shields.io/badge/Language-es-lightgrey#model-badge)](#datasets)
186
  | [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
187
 
188
- CHANGE FROM HERE
189
- `parakeet-rnnt-1.1b` is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Suno.ai](https://www.suno.ai/) teams.
190
- It is an XXL version of FastConformer Transducer [1] (around 1.1B parameters) model.
191
- See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
 
 
 
 
 
 
192
 
193
- ## NVIDIA NeMo: Training
194
 
195
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
196
  ```
197
  pip install nemo_toolkit['all']
198
- ```
 
199
 
200
  ## How to Use this Model
201
 
202
- The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
203
 
204
- ### Automatically instantiate the model
205
 
206
  ```python
207
- import nemo.collections.asr as nemo_asr
208
- asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/parakeet-rnnt-1.1b")
 
 
 
 
 
 
 
209
  ```
210
 
211
- ### Transcribing using Python
212
- First, let's get a sample
 
 
 
 
 
 
213
  ```
214
- wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
 
 
 
 
 
 
215
  ```
216
- Then simply do:
 
 
 
 
 
 
 
 
 
 
 
217
  ```
218
- asr_model.transcribe(['2086-149220-0033.wav'])
 
 
 
 
 
 
219
  ```
220
 
221
- ### Transcribing many audio files
222
 
223
- ```shell
224
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
225
- pretrained_name="nvidia/parakeet-rnnt-1.1b"
226
- audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  ```
228
 
 
229
  ### Input
230
 
231
- This model accepts 16000 Hz mono-channel audio (wav files) as input.
232
 
233
  ### Output
234
 
235
- This model provides transcribed speech as a string for a given audio sample.
236
 
237
- ## Model Architecture
238
 
239
- FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with a Transducer decoder (RNNT) loss. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
240
 
241
  ## Training
242
 
243
- The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
244
 
245
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
246
 
 
247
  ### Datasets
248
 
249
- The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
250
 
251
- The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
252
 
253
- - Librispeech 960 hours of English speech
254
- - Fisher Corpus
255
- - Switchboard-1 Dataset
256
- - WSJ-0 and WSJ-1
257
- - National Speech Corpus (Part 1, Part 6)
258
- - VCTK
259
- - VoxPopuli (EN)
260
- - Europarl-ASR (EN)
261
- - Multilingual Librispeech (MLS EN) - 2,000 hour subset
262
- - Mozilla Common Voice (v7.0)
263
- - People's Speech - 12,000 hour subset
264
 
265
  ## Performance
266
 
267
- The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
 
 
 
 
 
268
 
269
- The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
 
 
270
 
271
- |**Version**|**Tokenizer**|**Vocabulary Size**|**AMI**|**Earnings-22**|**Giga Speech**|**LS test-clean**|**SPGI Speech**|**TEDLIUM-v3**|**Vox Populi**|**Common Voice**|
272
- |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-----|-------|------|------|
273
- | 1.22.0 | SentencePiece Unigram | 1024 | 17.10 | 14.11 | 9.96 | 1.46 | 2.47 | 3.11 | 3.92 | 5.39 | 5.79 |
274
 
275
- These are greedy WER numbers without external LM. More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
 
 
 
 
 
 
 
 
 
276
 
277
  ## NVIDIA Riva: Deployment
278
 
@@ -283,21 +280,20 @@ Additionally, Riva provides:
283
  * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
284
  * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
285
 
286
- Although this model isn’t supported yet by Riva, the [list of supported models is here](https://huggingface.co/models?other=Riva).
287
  Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
288
 
 
289
  ## References
290
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
291
 
292
- [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
293
-
294
- [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
295
 
296
- [4] [Suno.ai](https://suno.ai/)
297
 
298
- [5] [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
299
 
300
 
301
  ## Licence
302
 
303
- License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
 
23
  thumbnail: null
24
  tags:
25
  - automatic-speech-recognition
26
+ - automatic-speech-translation
27
  - speech
28
  - audio
29
+ - Transformer
30
  - FastConformer
31
  - Conformer
32
  - pytorch
 
38
  - example_title: Librispeech sample 2
39
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
40
  model-index:
41
+ - name: canary-1b
42
  results:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  - task:
44
  name: Automatic Speech Recognition
45
  type: automatic-speech-recognition
 
53
  metrics:
54
  - name: Test WER
55
  type: wer
56
+ value: 2.89
57
  - task:
58
  type: Automatic Speech Recognition
59
  name: automatic-speech-recognition
 
67
  metrics:
68
  - name: Test WER
69
  type: wer
70
+ value: 4.79
71
  - task:
72
  type: Automatic Speech Recognition
73
  name: automatic-speech-recognition
74
  dataset:
75
+ name: Mozilla Common Voice 16.1
76
+ type: mozilla-foundation/common_voice_16_1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  config: en
78
  split: test
79
  args:
 
81
  metrics:
82
  - name: Test WER
83
  type: wer
84
+ value: 3.99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  metrics:
86
  - wer
87
+ - bleu
88
  pipeline_tag: automatic-speech-recognition
89
  ---
90
 
 
104
  | [![Language](https://img.shields.io/badge/Language-es-lightgrey#model-badge)](#datasets)
105
  | [![Language](https://img.shields.io/badge/Language-fr-lightgrey#model-badge)](#datasets)
106
 
107
+ NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
108
+
109
+ ## Model Architecture
110
+ Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
111
+ With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
112
+ are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
113
+ SentencePiece [3] tokenizers of each language, which makes it easy to scale up to more languages.
114
+ The Canay-1B model has 24 encoder layers and 24 layers of decoder layers in total.
115
+
116
+
117
 
118
+ ## NVIDIA NeMo
119
 
120
  To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
121
  ```
122
  pip install nemo_toolkit['all']
123
+ ```
124
+
125
 
126
  ## How to Use this Model
127
 
128
+ The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
129
 
130
+ ### Loading the Model
131
 
132
  ```python
133
+ from nemo.collections.asr.models import EncDecMultiTaskModel
134
+
135
+ # load model
136
+ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
137
+
138
+ # update dcode params
139
+ decode_cfg = canary_model.cfg.decoding
140
+ decode_cfg.beam.beam_size = 5 # default is greedy with beam_size=1
141
+ canary_model.change_decoding_strategy(decode_cfg)
142
  ```
143
 
144
+ ### Input Format
145
+ The input to the model can be a directory containing audio files, in which case the model will perform ASR on English and produces text with punctuation and capitalization:
146
+
147
+ ```python
148
+ predicted_text = canary_model.trancribe(
149
+ audio_dir="<path to directory containing audios>",
150
+ batch_size=16, # batch size to run the inference with
151
+ )
152
  ```
153
+
154
+ or use:
155
+
156
+ ```bash
157
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
158
+ pretrained_name="nvidia/canary-1b"
159
+ audio_dir="<path to audio directory>"
160
  ```
161
+
162
+ Another recommended option is to use a json manifest as input, where each line in the file is a dictionary containing the following fields:
163
+ ```yaml
164
+ # Example of a line in input_manifest.json
165
+ {
166
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
167
+ "duration": 10000.0, # duration of the audio
168
+ "taskname": "asr", # use "s2t_translation" for AST
169
+ "source_lang": "en", # Set `source_lang`=`target_lang` for ASR, choices=['en','de','es','fr']
170
+ "target_lang": "de", # choices=['en','de','es','fr']
171
+ "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
172
+ }
173
  ```
174
+
175
+ and then use:
176
+ ```python
177
+ predicted_text = canary_model.trancribe(
178
+ paths2audio_files="<path to input manifest file>",
179
+ batch_size=16, # batch size to run the inference with
180
+ )
181
  ```
182
 
183
+ or use:
184
 
185
+ ```bash
186
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
187
+ pretrained_name="nvidia/canary-1b"
188
+ dataset_manifest="<path to manifest file>"
189
+ ```
190
+
191
+ ### Automatic Speech-to-text Recognition (ASR)
192
+
193
+ An example manifest for transcribing English audios can be:
194
+
195
+ ```yaml
196
+ # Example of a line in input_manifest.json
197
+ {
198
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
199
+ "duration": 10000.0, # duration of the audio
200
+ "taskname": "asr",
201
+ "source_lang": "en",
202
+ "target_lang": "en",
203
+ "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
204
+ }
205
+ ```
206
+
207
+
208
+ ### Automatic Speech-to-text Translation (AST)
209
+
210
+ An example manifest for transcribing English audios into German text can be:
211
+
212
+ ```yaml
213
+ # Example of a line in input_manifest.json
214
+ {
215
+ "audio_filepath": "/path/to/audio.wav", # path to the audio file
216
+ "duration": 10000.0, # duration of the audio
217
+ "taskname": "s2t_translation",
218
+ "source_lang": "en",
219
+ "target_lang": "de",
220
+ "pnc": yes, # whether to have PnC output, choices=['yes', 'no']
221
+ }
222
  ```
223
 
224
+
225
  ### Input
226
 
227
+ This model accepts single channel (mono) audio sampled at 16000 Hz, along with the task/languages/PnC tags as input.
228
 
229
  ### Output
230
 
231
+ The model outputs the transcribed/translated text corresponding to the input audio, in the specified target language and with or without punctuation and capitalization.
232
 
 
233
 
 
234
 
235
  ## Training
236
 
237
+ Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs in 24 hrs. The model can be trained using this example script and base config.
238
 
239
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
240
 
241
+
242
  ### Datasets
243
 
244
+ The Canary-1B model is trained on 70K hours of speech audio with transcriptions in their original languages for ASR, and machine-generated translations for each supported language for speech translation.
245
 
246
+ The training data contains 43K hours of English speech collected and prepared by NVIDIA NeMo and [Suno](https://suno.ai/) teams, and an inhouse subset with 27K hours of English/German/Spanish/French speech.
247
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
  ## Performance
250
 
251
+ The ASR performance is measured with word error rate (WER) on different datasets, whereas the AST performance is measured with BLEU score. Predictions were generated using beam search with width 5 and length penalty 1.0.
252
+
253
+ ### ASR Performance (w/o PnC)
254
+
255
+ We use [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
256
+
257
 
258
+ | **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
259
+ |:---------:|:-----------:|:------:|:------:|:------:|:------:|
260
+ | 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
261
 
 
 
 
262
 
263
+ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
264
+
265
+ ### AST Performance
266
+
267
+ We evaluate on the FLEURS test sets and use the native annotations with punctuation and capitalization.
268
+
269
+ | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
270
+ |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
271
+ | 1.23.0 | canary-1b | 22.66 | 41.11 | 40.76 | 32.64 | 32.15 | 23.57 |
272
+
273
 
274
  ## NVIDIA Riva: Deployment
275
 
 
280
  * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
281
  * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
282
 
283
+ Although this model isn’t supported yet by Riva, the [list of supported models](https://huggingface.co/models?other=Riva) is here.
284
  Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
285
 
286
+
287
  ## References
288
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
289
 
290
+ [2] [Attention is all you need](https://arxiv.org/abs/1706.03762)
 
 
291
 
292
+ [3] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
293
 
294
+ [4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
295
 
296
 
297
  ## Licence
298
 
299
+ License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en#:~:text=NonCommercial%20%E2%80%94%20You%20may%20not%20use,doing%20anything%20the%20license%20permits.). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.