patrickvonplaten
commited on
Commit
•
5b59153
1
Parent(s):
9f16d8d
Update README.md
Browse files
README.md
CHANGED
@@ -12,4 +12,76 @@ pipeline_tag: automatic-speech-recognition
|
|
12 |
license: apache-2.0
|
13 |
---
|
14 |
|
15 |
-
# Wav2Vec2-XLS-R-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
license: apache-2.0
|
13 |
---
|
14 |
|
15 |
+
# Wav2Vec2-XLS-R-2b-21-EN
|
16 |
+
|
17 |
+
Facebook's Wav2Vec2 XLS-R fine-tuned for **Speech Translation.**
|
18 |
+
|
19 |
+
![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)
|
20 |
+
|
21 |
+
This is a [SpeechEncoderDecoderModel](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html) model.
|
22 |
+
The encoder was warm-started from the [**`facebook/wav2vec2-xls-r-1b`**](https://huggingface.co/facebook/wav2vec2-xls-r-1b) checkpoint and
|
23 |
+
the decoder from the [**`facebook/mbart-large-50`**](https://huggingface.co/facebook/mbart-large-50) checkpoint.
|
24 |
+
Consequently, the encoder-decoder model was fine-tuned on 21 `{lang}` -> `en` translation pairs of the [Covost2 dataset](https://huggingface.co/datasets/covost2).
|
25 |
+
|
26 |
+
The model can translate from the following spoken languages `{lang}` -> `en` (English):
|
27 |
+
|
28 |
+
{`fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`} -> `en`
|
29 |
+
|
30 |
+
For more information, please refer to Section *5.1.2* of the [official XLS-R paper](https://arxiv.org/abs/2111.09296).
|
31 |
+
|
32 |
+
## Usage
|
33 |
+
|
34 |
+
### Demo
|
35 |
+
|
36 |
+
The model can be tested directly on the speech recognition widget on this model card!
|
37 |
+
Simple record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input.
|
38 |
+
|
39 |
+
### Example
|
40 |
+
|
41 |
+
As this a standard sequence to sequence transformer model, you can use the `generate` method to generate the
|
42 |
+
transcripts by passing the speech features to the model.
|
43 |
+
|
44 |
+
You can use the model directly via the ASR pipeline
|
45 |
+
|
46 |
+
```python
|
47 |
+
from datasets import load_dataset
|
48 |
+
from transformers import pipeline
|
49 |
+
|
50 |
+
# replace following lines to load an audio file of your choice
|
51 |
+
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
52 |
+
audio_file = librispeech_en[0]["file"]
|
53 |
+
|
54 |
+
asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-1b-21-to-en")
|
55 |
+
|
56 |
+
translation = asr(audio_file)
|
57 |
+
```
|
58 |
+
|
59 |
+
or step-by-step as follows:
|
60 |
+
|
61 |
+
```python
|
62 |
+
import torch
|
63 |
+
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
|
64 |
+
from datasets import load_dataset
|
65 |
+
|
66 |
+
model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
|
67 |
+
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
|
68 |
+
|
69 |
+
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
70 |
+
|
71 |
+
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
|
72 |
+
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
|
73 |
+
transcription = processor.batch_decode(generated_ids)
|
74 |
+
```
|
75 |
+
|
76 |
+
## Results `{lang}` -> `en`
|
77 |
+
|
78 |
+
See the row of **XLS-R (1B)** for the performance on [Covost2](https://huggingface.co/datasets/covost2) for this model.
|
79 |
+
|
80 |
+
![results image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/X-%3EEnglish.png)
|
81 |
+
|
82 |
+
## More XLS-R models for `{lang}` -> `en` Speech Translation
|
83 |
+
|
84 |
+
- [Wav2Vec2-XLS-R-300M-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-300m-21-to-en)
|
85 |
+
- [Wav2Vec2-XLS-R-1B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-1b-21-to-en)
|
86 |
+
- [Wav2Vec2-XLS-R-2B-21-EN](https://huggingface.co/facebook/wav2vec2-xls-r-2b-21-to-en)
|
87 |
+
- [Wav2Vec2-XLS-R-2B-22-16](https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16)
|