Update README.md (#1)

286d047 over 2 years ago

5.42 kB

	---
	language:
	- multilingual
	- fr
	- de
	- es
	- ca
	- it
	- ru
	- zh
	- pt
	- fa
	- et
	- mn
	- nl
	- tr
	- ar
	- sv
	- lv
	- sl
	- ta
	- ja
	- id
	- cy
	- en
	datasets:
	- common_voice
	- multilingual_librispeech
	- covost2
	tags:
	- speech
	- xls_r
	- automatic-speech-recognition
	- xls_r_translation
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	widget:
	- example_title: Swedish
	src: https://cdn-media.huggingface.co/speech_samples/cv_swedish_1.mp3
	- example_title: Arabic
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3
	- example_title: Russian
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
	- example_title: German
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_de_17284683.mp3
	- example_title: French
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3
	- example_title: Indonesian
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_id_19051309.mp3
	- example_title: Italian
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_it_17415776.mp3
	- example_title: Japanese
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3
	- example_title: Mongolian
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3
	- example_title: Dutch
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3
	- example_title: Russian
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
	- example_title: Turkish
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3
	- example_title: Catalan
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3
	- example_title: English
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_en_18301577.mp3
	- example_title: Dutch
	src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3
	---

	# Wav2Vec2-XLS-R-2b-21-EN

	Facebook's Wav2Vec2 XLS-R fine-tuned for Speech Translation.

	![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)

	This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
	The encoder was warm-started from the [`facebook/wav2vec2-xls-r-1b`](https://huggingface.co/facebook/wav2vec2-xls-r-1b) checkpoint and
	the decoder from the [`facebook/mbart-large-50`](https://huggingface.co/facebook/mbart-large-50) checkpoint.
	Consequently, the encoder-decoder model was fine-tuned on 21 `{lang}` -> `en` translation pairs of the [Covost2 dataset](https://huggingface.co/datasets/covost2).

	The model can translate from the following spoken languages `{lang}` -> `en` (English):

	{`fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`} -> `en`

	For more information, please refer to Section 5.1.2 of the [official XLS-R paper](https://arxiv.org/abs/2111.09296).

	## Usage

	### Demo

	The model can be tested directly on the speech recognition widget on this model card!
	Simple record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input.

	### Example

	As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
	transcripts by passing the speech features to the model.

	You can use the model directly via the ASR pipeline

	```python
	from datasets import load_dataset
	from transformers import pipeline

	# replace following lines to load an audio file of your choice
	librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
	audio_file = librispeech_en[0]["file"]

	asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-1b-21-to-en")

	translation = asr(audio_file)
	```

	or step-by-step as follows:

	```python
	import torch
	from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
	from datasets import load_dataset

	model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
	processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")

	ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

	inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
	generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
	transcription = processor.batch_decode(generated_ids)
	```

	## Results `{lang}` -> `en`

	See the row of XLS-R (1B) for the performance on [Covost2](https://huggingface.co/datasets/covost2) for this model.

	![results image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/X-%3EEnglish.png)

	## More XLS-R models for `{lang}` -> `en` Speech Translation

	- [Wav2Vec2-XLS-R-300M-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-300m-21-to-en)
	- [Wav2Vec2-XLS-R-1B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-1b-21-to-en)
	- [Wav2Vec2-XLS-R-2B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-2b-21-to-en)
	- [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)