nvidia
/

canary-1b

@@ -107,6 +107,7 @@ img {
 NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
 ## Model Architecture
 Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
 With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
 are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
@@ -248,11 +249,11 @@ The training data contains 43K hours of English speech collected and prepared by
 ## Performance
-The ASR performance is measured with word error rate (WER) on different datasets, whereas the AST performance is measured with BLEU score. Predictions were generated using beam search with width 5 and length penalty 1.0.
 ### ASR Performance (w/o PnC)
-We use [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
 | **Version** | **Model**     | **En**   | **De**   | **Es**   | **Fr**   |
@@ -264,7 +265,7 @@ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https:/
 ### AST Performance
-We evaluate on the FLEURS test sets and use the native annotations with punctuation and capitalization.
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|

 NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
 ## Model Architecture
 Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
 With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
 are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
 ## Performance
+In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.
 ### ASR Performance (w/o PnC)
+The ASR performance is measured with word error rate (WER) on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
 | **Version** | **Model**     | **En**   | **De**   | **Es**   | **Fr**   |
 ### AST Performance
+We evaluate AST performance with BLEU score on the [FLEURS](https://huggingface.co/datasets/google/fleurs) test sets on four languages and use their native annotations with punctuation and capitalization.
 | **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
 |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|