Commit
•
ec98b37
1
Parent(s):
e2ec446
Update README.md (#6)
Browse files- Update README.md (cb8e9db32f0d568ec167b39d8283a7d790de2f7d)
Co-authored-by: He Huang <steveheh@users.noreply.huggingface.co>
README.md
CHANGED
@@ -304,7 +304,7 @@ canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
|
|
304 |
|
305 |
# update dcode params
|
306 |
decode_cfg = canary_model.cfg.decoding
|
307 |
-
decode_cfg.beam.beam_size =
|
308 |
canary_model.change_decoding_strategy(decode_cfg)
|
309 |
```
|
310 |
|
@@ -332,10 +332,10 @@ Another recommended option is to use a json manifest as input, where each line i
|
|
332 |
{
|
333 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
334 |
"duration": 10000.0, # duration of the audio
|
335 |
-
"taskname": "asr", # use "
|
336 |
-
"source_lang": "en", # Set `source_lang
|
337 |
-
"target_lang": "
|
338 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
339 |
}
|
340 |
```
|
341 |
|
@@ -367,7 +367,7 @@ An example manifest for transcribing English audios can be:
|
|
367 |
"taskname": "asr",
|
368 |
"source_lang": "en",
|
369 |
"target_lang": "en",
|
370 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
371 |
}
|
372 |
```
|
373 |
|
@@ -381,10 +381,10 @@ An example manifest for transcribing English audios into German text can be:
|
|
381 |
{
|
382 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
383 |
"duration": 10000.0, # duration of the audio
|
384 |
-
"taskname": "
|
385 |
"source_lang": "en",
|
386 |
"target_lang": "de",
|
387 |
-
"pnc": yes, # whether to have PnC output, choices=['yes', 'no']
|
388 |
}
|
389 |
```
|
390 |
|
@@ -401,7 +401,8 @@ The model outputs the transcribed/translated text corresponding to the input aud
|
|
401 |
|
402 |
## Training
|
403 |
|
404 |
-
Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs
|
|
|
405 |
|
406 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
407 |
|
@@ -410,6 +411,38 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
410 |
|
411 |
The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
|
412 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
413 |
|
414 |
## Performance
|
415 |
|
@@ -417,23 +450,47 @@ In both ASR and AST experiments, predictions were generated using beam search wi
|
|
417 |
|
418 |
### ASR Performance (w/o PnC)
|
419 |
|
420 |
-
The ASR performance is measured with word error rate (WER)
|
421 |
|
|
|
422 |
|
423 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
424 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
425 |
| 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
|
426 |
|
427 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
428 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
429 |
|
430 |
### AST Performance
|
431 |
|
432 |
-
We evaluate AST performance with BLEU score
|
|
|
|
|
433 |
|
434 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
435 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
436 |
-
| 1.23.0 | canary-1b | 22.66
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
437 |
|
438 |
|
439 |
## NVIDIA Riva: Deployment
|
|
|
304 |
|
305 |
# update dcode params
|
306 |
decode_cfg = canary_model.cfg.decoding
|
307 |
+
decode_cfg.beam.beam_size = 1
|
308 |
canary_model.change_decoding_strategy(decode_cfg)
|
309 |
```
|
310 |
|
|
|
332 |
{
|
333 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
334 |
"duration": 10000.0, # duration of the audio
|
335 |
+
"taskname": "asr", # use "ast" for speech-to-text translation
|
336 |
+
"source_lang": "en", # Set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
|
337 |
+
"target_lang": "en", # Language of the text output, choices=['en','de','es','fr']
|
338 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
339 |
}
|
340 |
```
|
341 |
|
|
|
367 |
"taskname": "asr",
|
368 |
"source_lang": "en",
|
369 |
"target_lang": "en",
|
370 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
371 |
}
|
372 |
```
|
373 |
|
|
|
381 |
{
|
382 |
"audio_filepath": "/path/to/audio.wav", # path to the audio file
|
383 |
"duration": 10000.0, # duration of the audio
|
384 |
+
"taskname": "ast",
|
385 |
"source_lang": "en",
|
386 |
"target_lang": "de",
|
387 |
+
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
|
388 |
}
|
389 |
```
|
390 |
|
|
|
401 |
|
402 |
## Training
|
403 |
|
404 |
+
Canary-1B is trained using the NVIDIA NeMo toolkit [4] for 150k steps with dynamic bucketing and a batch duration of 360s per GPU on 128 NVIDIA A100 80GB GPUs.
|
405 |
+
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/speech_multitask/speech_to_text_aed.py) and [base config](https://github.com/NVIDIA/NeMo/blob/canary-2/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml).
|
406 |
|
407 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
408 |
|
|
|
411 |
|
412 |
The Canary-1B model is trained on a total of 85k hrs of speech data. It consists of 31k hrs of public data, 20k hrs collected by [Suno](https://suno.ai/), and 34k hrs of in-house data.
|
413 |
|
414 |
+
The constituents of public data are as follows.
|
415 |
+
|
416 |
+
#### English (25.5k hours)
|
417 |
+
- Librispeech 960 hours
|
418 |
+
- Fisher Corpus
|
419 |
+
- Switchboard-1 Dataset
|
420 |
+
- WSJ-0 and WSJ-1
|
421 |
+
- National Speech Corpus (Part 1, Part 6)
|
422 |
+
- VCTK
|
423 |
+
- VoxPopuli (EN)
|
424 |
+
- Europarl-ASR (EN)
|
425 |
+
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
|
426 |
+
- Mozilla Common Voice (v7.0)
|
427 |
+
- People's Speech - 12,000 hour subset
|
428 |
+
- Mozilla Common Voice (v11.0) - 1,474 hour subset
|
429 |
+
|
430 |
+
#### German (2.5k hours)
|
431 |
+
- Mozilla Common Voice (v12.0) - 800 hour subset
|
432 |
+
- Multilingual Librispeech (MLS DE) - 1,500 hour subset
|
433 |
+
- VoxPopuli (DE) - 200 hr subset
|
434 |
+
|
435 |
+
#### Spanish (1.4k hours)
|
436 |
+
- Mozilla Common Voice (v12.0) - 395 hour subset
|
437 |
+
- Multilingual Librispeech (MLS ES) - 780 hour subset
|
438 |
+
- VoxPopuli (ES) - 108 hour subset
|
439 |
+
- Fisher - 141 hour subset
|
440 |
+
|
441 |
+
#### French (1.8k hours)
|
442 |
+
- Mozilla Common Voice (v12.0) - 708 hour subset
|
443 |
+
- Multilingual Librispeech (MLS FR) - 926 hour subset
|
444 |
+
- VoxPopuli (FR) - 165 hour subset
|
445 |
+
|
446 |
|
447 |
## Performance
|
448 |
|
|
|
450 |
|
451 |
### ASR Performance (w/o PnC)
|
452 |
|
453 |
+
The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
|
454 |
|
455 |
+
WER on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test set:
|
456 |
|
457 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
458 |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
459 |
| 1.23.0 | canary-1b | 7.97 | 4.61 | 3.99 | 6.53 |
|
460 |
|
461 |
|
462 |
+
WER on [MLS](https://huggingface.co/datasets/facebook/multilingual_librispeech) test set:
|
463 |
+
|
464 |
+
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
465 |
+
|:---------:|:-----------:|:------:|:------:|:------:|:------:|
|
466 |
+
| 1.23.0 | canary-1b | 3.06 | 4.19 | 3.15 | 4.12 |
|
467 |
+
|
468 |
+
|
469 |
More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
|
470 |
|
471 |
### AST Performance
|
472 |
|
473 |
+
We evaluate AST performance with BLEU score and use their native annotations with punctuation and capitalization.
|
474 |
+
|
475 |
+
BLEU score on [FLEURS](https://huggingface.co/datasets/google/fleurs) test set:
|
476 |
|
477 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
478 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
479 |
+
| 1.23.0 | canary-1b | 22.66 | 41.11 | 40.76 | 32.64 | 32.15 | 23.57 |
|
480 |
+
|
481 |
+
|
482 |
+
BLEU score on [COVOST-v2](https://github.com/facebookresearch/covost) test set:
|
483 |
+
|
484 |
+
| **Version** | **Model** | **De->En** | **Es->En** | **Fr->En** |
|
485 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
486 |
+
| 1.23.0 | canary-1b | 37.67 | 40.7 | 40.42 |
|
487 |
+
|
488 |
+
BLEU score on [mExpresso](https://huggingface.co/facebook/seamless-expressive#mexpresso-multilingual-expresso) test set:
|
489 |
+
|
490 |
+
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** |
|
491 |
+
|:-----------:|:---------:|:----------:|:----------:|:----------:|
|
492 |
+
| 1.23.0 | canary-1b | 23.84 | 35.74 | 28.29 |
|
493 |
+
|
494 |
|
495 |
|
496 |
## NVIDIA Riva: Deployment
|