---
license: cc-by-nc-4.0
language:
- en
- de
- es
- fr
library_name: nemo
datasets:
- librispeech_asr
- fisher_corpus
- Switchboard-1
- WSJ-0
- WSJ-1
- National-Singapore-Corpus-Part-1
- National-Singapore-Corpus-Part-6
- vctk
- voxpopuli
- europarl
- multilingual_librispeech
- mozilla-foundation/common_voice_8_0
- MLCommons/peoples_speech
thumbnail: null
tags:
- automatic-speech-recognition
- automatic-speech-translation
- speech
- audio
- Transformer
- FastConformer
- Conformer
- pytorch
- NeMo
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: canary-1b
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 2.89
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: SPGI Speech
type: kensho/spgispeech
config: test
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 4.79
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: en
split: test
args:
language: en
metrics:
- name: Test WER (En)
type: wer
value: 7.97
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: de
split: test
args:
language: de
metrics:
- name: Test WER (De)
type: wer
value: 4.61
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: es
split: test
args:
language: es
metrics:
- name: Test WER (ES)
type: wer
value: 3.99
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: Mozilla Common Voice 16.1
type: mozilla-foundation/common_voice_16_1
config: fr
split: test
args:
language: fr
metrics:
- name: Test WER (Fr)
type: wer
value: 6.53
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->De)
type: bleu
value: 22.66
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->Es)
type: bleu
value: 41.11
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: en_us
split: test
args:
language: en-de
metrics:
- name: Test BLEU (En->Fr)
type: bleu
value: 40.76
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: de_de
split: test
args:
language: de-en
metrics:
- name: Test BLEU (De->En)
type: bleu
value: 32.64
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: es_419
split: test
args:
language: es-en
metrics:
- name: Test BLEU (Es->En)
type: bleu
value: 32.15
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: FLEURS
type: google/fleurs
config: fr_fr
split: test
args:
language: fr-en
metrics:
- name: Test BLEU (Fr->En)
type: bleu
value: 23.57
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: de_de
split: test
args:
language: de-en
metrics:
- name: Test BLEU (De->En)
type: bleu
value: 37.67
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: es_419
split: test
args:
language: es-en
metrics:
- name: Test BLEU (Es->En)
type: bleu
value: 40.7
- task:
type: Automatic Speech Translation
name: automatic-speech-translation
dataset:
name: COVOST
type: covost2
config: fr_fr
split: test
args:
language: fr-en
metrics:
- name: Test BLEU (Fr->En)
type: bleu
value: 40.42
metrics:
- wer
- bleu
pipeline_tag: automatic-speech-recognition
---
# Canary 1B
[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-1B-lightgrey#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
NVIDIA [NeMo Canary](https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/) is a family of multi-lingual multi-tasking models that achieves state-of-the art performance on multiple benchmarks. With 1 billion parameters, Canary-1B supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
## Model Architecture
Canary is an encoder-decoder model with FastConformer [1] encoder and Transformer Decoder [2].
With audio features extracted from the encoder, task tokens such as `