Update README.md
#3
by
steveheh
- opened
README.md
CHANGED
@@ -107,6 +107,7 @@ img {
|
|
107 |
NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
|
108 |
|
109 |
## Model Architecture
|
|
|
110 |
Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
|
111 |
With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
|
112 |
are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
|
@@ -248,11 +249,11 @@ The training data contains 43K hours of English speech collected and prepared by
|
|
248 |
|
249 |
## Performance
|
250 |
|
251 |
-
|
252 |
|
253 |
### ASR Performance (w/o PnC)
|
254 |
|
255 |
-
|
256 |
|
257 |
|
258 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
@@ -264,7 +265,7 @@ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https:/
|
|
264 |
|
265 |
### AST Performance
|
266 |
|
267 |
-
We evaluate on the FLEURS test sets and use
|
268 |
|
269 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
270 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
|
|
107 |
NVIDIA NeMo Canary is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
|
108 |
|
109 |
## Model Architecture
|
110 |
+
|
111 |
Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
|
112 |
With audio features extracted from the encoder, task tokens such as `<source language>`, `<target language>`, `<task>` and `<toggle PnC>`
|
113 |
are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer from individual
|
|
|
249 |
|
250 |
## Performance
|
251 |
|
252 |
+
In both ASR and AST experiments, predictions were generated using beam search with width 5 and length penalty 1.0.
|
253 |
|
254 |
### ASR Performance (w/o PnC)
|
255 |
|
256 |
+
The ASR performance is measured with word error rate (WER) on [MCV-16.1](https://commonvoice.mozilla.org/en/datasets) test sets on four languages, and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/).
|
257 |
|
258 |
|
259 |
| **Version** | **Model** | **En** | **De** | **Es** | **Fr** |
|
|
|
265 |
|
266 |
### AST Performance
|
267 |
|
268 |
+
We evaluate AST performance with BLEU score on the [FLEURS](https://huggingface.co/datasets/google/fleurs) test sets on four languages and use their native annotations with punctuation and capitalization.
|
269 |
|
270 |
| **Version** | **Model** | **En->De** | **En->Es** | **En->Fr** | **De->En** | **Es->En** | **Fr->En** |
|
271 |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|