File size: 5,673 Bytes
486046c
 
de900e6
486046c
 
5148621
 
f077bc0
 
 
 
 
cce9b62
 
 
 
 
 
 
486046c
 
 
08f0cc9
486046c
cce9b62
486046c
 
 
 
 
cce9b62
08f0cc9
cce9b62
 
08f0cc9
cce9b62
 
08f0cc9
cce9b62
 
08f0cc9
cce9b62
08f0cc9
4e0e3d3
cce9b62
4e0e3d3
 
 
 
 
cce9b62
08f0cc9
cce9b62
 
08f0cc9
cce9b62
 
08f0cc9
cce9b62
 
08f0cc9
cce9b62
486046c
 
dce7bbc
486046c
d966e9f
486046c
 
8dc7928
 
486046c
 
 
 
5d135e8
 
7897043
5d135e8
 
7897043
5d135e8
7897043
5d135e8
7897043
 
5d135e8
 
 
486046c
 
 
 
 
 
 
 
 
e66a3e4
486046c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e66a3e4
 
 
 
 
486046c
 
 
4e0e3d3
486046c
4e0e3d3
 
486046c
 
4e0e3d3
9a2efcb
4e0e3d3
 
30221ee
 
 
 
 
 
dce7bbc
 
30221ee
 
 
 
96d7e9b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
language: es
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- es
- hf-asr-leaderboard
- mozilla-foundation/common_voice_6_0
- robust-speech-event
- speech
- xlsr-fine-tuning-week
datasets:
- common_voice
- mozilla-foundation/common_voice_6_0
metrics:
- wer
- cer
base_model: facebook/wav2vec2-large-xlsr-53
model-index:
- name: XLSR Wav2Vec2 Spanish by Jonatas Grosman
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice es
      type: common_voice
      args: es
    metrics:
    - type: wer
      value: 8.82
      name: Test WER
    - type: cer
      value: 2.58
      name: Test CER
    - type: wer
      value: 6.27
      name: Test WER (+LM)
    - type: cer
      value: 2.06
      name: Test CER (+LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Robust Speech Event - Dev Data
      type: speech-recognition-community-v2/dev_data
      args: es
    metrics:
    - type: wer
      value: 30.19
      name: Dev WER
    - type: cer
      value: 13.56
      name: Dev CER
    - type: wer
      value: 24.71
      name: Dev WER (+LM)
    - type: cer
      value: 12.61
      name: Dev CER (+LM)
---

# Fine-tuned XLSR-53 large model for speech recognition in Spanish

Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Spanish using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).
When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

## Usage

The model can be used directly (without a language model) as follows...

Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:

```python
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-spanish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)
```

Writing your own inference script:

```python
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "es"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-spanish"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)
```

| Reference  | Prediction |
| ------------- | ------------- |
| HABITA EN AGUAS POCO PROFUNDAS Y ROCOSAS. | HABITAN AGUAS POCO PROFUNDAS Y ROCOSAS |
| OPERA PRINCIPALMENTE VUELOS DE CABOTAJE Y REGIONALES DE CARGA. | OPERA PRINCIPALMENTE VUELO DE CARBOTAJES Y REGIONALES DE CARGAN |
| PARA VISITAR CONTACTAR PRIMERO CON LA DIRECCIÓN. | PARA VISITAR CONTACTAR PRIMERO CON LA DIRECCIÓN |
| TRES | TRES |
| REALIZÓ LOS ESTUDIOS PRIMARIOS EN FRANCIA, PARA CONTINUAR LUEGO EN ESPAÑA. | REALIZÓ LOS ESTUDIOS PRIMARIOS EN FRANCIA PARA CONTINUAR LUEGO EN ESPAÑA |
| EN LOS AÑOS QUE SIGUIERON, ESTE TRABAJO ESPARTA PRODUJO DOCENAS DE BUENOS JUGADORES. | EN LOS AÑOS QUE SIGUIERON ESTE TRABAJO ESPARTA PRODUJO DOCENA DE BUENOS JUGADORES |
| SE ESTÁ TRATANDO DE RECUPERAR SU CULTIVO EN LAS ISLAS CANARIAS. | SE ESTÓ TRATANDO DE RECUPERAR SU CULTIVO EN LAS ISLAS CANARIAS |
| SÍ | SÍ |
| "FUE ""SACADA"" DE LA SERIE EN EL EPISODIO ""LEAD"", EN QUE ALEXANDRA CABOT REGRESÓ." | FUE SACADA DE LA SERIE EN EL EPISODIO LEED EN QUE ALEXANDRA KAOT REGRESÓ |
| SE UBICAN ESPECÍFICAMENTE EN EL VALLE DE MOKA, EN LA PROVINCIA DE BIOKO SUR. | SE UBICAN ESPECÍFICAMENTE EN EL VALLE DE MOCA EN LA PROVINCIA DE PÍOCOSUR |

## Evaluation

1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`

```bash
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-spanish --dataset mozilla-foundation/common_voice_6_0 --config es --split test
```

2. To evaluate on `speech-recognition-community-v2/dev_data`

```bash
python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-spanish --dataset speech-recognition-community-v2/dev_data --config es --split validation --chunk_length_s 5.0 --stride_length_s 1.0
```

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{grosman2021xlsr53-large-spanish,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {S}panish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish}},
  year={2021}
}
```