patrickvonplaten's picture
Update README.md (#1)
402bde6
metadata
language:
  - multilingual
  - fr
  - de
  - es
  - ca
  - it
  - ru
  - zh
  - pt
  - fa
  - et
  - mn
  - nl
  - tr
  - ar
  - sv
  - lv
  - sl
  - ta
  - ja
  - id
  - cy
  - en
datasets:
  - common_voice
  - multilingual_librispeech
  - covost2
tags:
  - speech
  - xls_r
  - automatic-speech-recognition
  - xls_r_translation
pipeline_tag: automatic-speech-recognition
license: apache-2.0
widget:
  - example_title: Swedish
    src: https://cdn-media.huggingface.co/speech_samples/cv_swedish_1.mp3
  - example_title: Arabic
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3
  - example_title: Russian
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
  - example_title: German
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_de_17284683.mp3
  - example_title: French
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3
  - example_title: Indonesian
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_id_19051309.mp3
  - example_title: Italian
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_it_17415776.mp3
  - example_title: Japanese
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3
  - example_title: Mongolian
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3
  - example_title: Dutch
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3
  - example_title: Russian
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
  - example_title: Turkish
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3
  - example_title: Catalan
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3
  - example_title: English
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_en_18301577.mp3
  - example_title: Dutch
    src: >-
      https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3

Wav2Vec2-XLS-R-2B-22-16 (XLS-R-Any-to-Any)

Facebook's Wav2Vec2 XLS-R fine-tuned for Speech Translation.

model image

This is a SpeechEncoderDecoderModel model. The encoder was warm-started from the facebook/wav2vec2-xls-r-2b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder-decoder model was fine-tuned on {input_lang} -> {output_lang} translation pairs of the Covost2 dataset.

The model can translate from the following spoken languages {input_lang} to the following written languages {output_lang}:

{input_lang} -> {output_lang}

with {input_lang} one of:

{en, fr, de, es, ca, it, ru, zh-CN, pt, fa, et, mn, nl, tr, ar, sv-SE, lv, sl, ta, ja, id, cy}

and {output_lang}:

{en, de, tr, fa, sv-SE, mn, zh-CN, cy, ca, sl, et, id, ar, ta, lv, ja}

Usage

Demo

The model can be tested on this space. You can select the target language, record some audio in any of the above mentioned input languages, and then sit back and see how well the checkpoint can translate the input.

Example

As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

You can use the model directly via the ASR pipeline. By default, the checkpoint will translate spoken English to written German. To change the written target language, you need to pass the correct forced_bos_token_id to generate(...) to condition the decoder on the correct target language.

To select the correct forced_bos_token_id given your choosen language id, please make use of the following mapping:

MAPPING = {
    "en": 250004,
    "de": 250003,
    "tr": 250023,
    "fa": 250029,
    "sv": 250042,
    "mn": 250037,
    "zh": 250025,
    "cy": 250007,
    "ca": 250005,
    "sl": 250052,
    "et": 250006,
    "id": 250032,
    "ar": 250001,
    "ta": 250044,
    "lv": 250017,
    "ja": 250012,
}

As an example, if you would like to translate to Swedish, you can do the following:

from datasets import load_dataset
from transformers import pipeline

# select correct `forced_bos_token_id`
forced_bos_token_id = MAPPING["sv"]

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")

translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)

or step-by-step as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset

model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# select correct `forced_bos_token_id`
forced_bos_token_id = MAPPING["sv"]

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
transcription = processor.batch_decode(generated_ids)

More XLS-R models for {lang} -> en Speech Translation