Update README.md
Browse files
README.md
CHANGED
@@ -10,8 +10,11 @@ tags:
|
|
10 |
- xlsr-fine-tuning-week
|
11 |
datasets:
|
12 |
- common_voice
|
|
|
|
|
|
|
13 |
model-index:
|
14 |
-
- name: Czech comodoro Wav2Vec2 XLSR 300M
|
15 |
results:
|
16 |
- task:
|
17 |
name: Automatic Speech Recognition
|
@@ -23,25 +26,29 @@ model-index:
|
|
23 |
metrics:
|
24 |
- name: Test WER
|
25 |
type: wer
|
26 |
-
value: 10.
|
27 |
- name: Test CER
|
28 |
type: cer
|
29 |
value: 2.6
|
30 |
---
|
31 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
32 |
-
should probably proofread and complete it, then remove this comment. -->
|
33 |
|
34 |
-
# wav2vec2-xls-r-300m-cs-
|
35 |
|
36 |
-
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice 8.0 dataset.
|
37 |
-
|
38 |
-
|
39 |
-
-
|
40 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
The `eval.py` script results using a LM are:
|
43 |
-
WER: 0.
|
44 |
-
CER: 0.
|
45 |
|
46 |
## Model description
|
47 |
|
@@ -59,8 +66,8 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
|
59 |
|
60 |
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "cs", split="test[:2%]")
|
61 |
|
62 |
-
processor = Wav2Vec2Processor.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-
|
63 |
-
model = Wav2Vec2ForCTC.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-
|
64 |
|
65 |
resampler = torchaudio.transforms.Resample(48_000, 16_000)
|
66 |
|
@@ -87,83 +94,35 @@ print("Reference:", test_dataset[:2]["sentence"])
|
|
87 |
|
88 |
The model can be evaluated using the attached `eval.py` script:
|
89 |
```
|
90 |
-
python eval.py --model_id comodoro/wav2vec2-xls-r-300m-cs-
|
91 |
```
|
92 |
|
93 |
## Training and evaluation data
|
94 |
|
95 |
-
The Common Voice 8.0 `train` and `validation` datasets were used for training
|
96 |
|
97 |
-
|
98 |
|
99 |
-
|
100 |
|
101 |
-
|
102 |
|
103 |
-
|
104 |
-
- train_batch_size: 32
|
105 |
-
- eval_batch_size: 8
|
106 |
-
- seed: 42
|
107 |
-
- gradient_accumulation_steps: 20
|
108 |
-
- total_train_batch_size: 640
|
109 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
110 |
-
- lr_scheduler_type: linear
|
111 |
-
- lr_scheduler_warmup_steps: 500
|
112 |
-
- num_epochs: 150
|
113 |
-
- mixed_precision_training: Native AMP
|
114 |
-
|
115 |
-
The following hyperparameters were used during second stage of training:
|
116 |
|
117 |
-
|
118 |
-
-
|
|
|
119 |
- eval_batch_size: 8
|
120 |
- seed: 42
|
121 |
-
- gradient_accumulation_steps: 20
|
122 |
-
- total_train_batch_size: 640
|
123 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
124 |
- lr_scheduler_type: linear
|
125 |
-
- lr_scheduler_warmup_steps:
|
126 |
- num_epochs: 50
|
127 |
- mixed_precision_training: Native AMP
|
128 |
|
129 |
-
### Training results
|
130 |
-
|
131 |
-
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
|
132 |
-
|:-------------:|:------:|:----:|:---------------:|:------:|:------:|
|
133 |
-
| 7.2926 | 8.06 | 250 | 3.8497 | 1.0 | 1.0 |
|
134 |
-
| 3.417 | 16.13 | 500 | 3.2852 | 1.0 | 0.9857 |
|
135 |
-
| 2.0264 | 24.19 | 750 | 0.7099 | 0.7342 | 0.1768 |
|
136 |
-
| 0.4018 | 32.25 | 1000 | 0.6188 | 0.6415 | 0.1551 |
|
137 |
-
| 0.2444 | 40.32 | 1250 | 0.6632 | 0.6362 | 0.1600 |
|
138 |
-
| 0.1882 | 48.38 | 1500 | 0.6070 | 0.5783 | 0.1388 |
|
139 |
-
| 0.153 | 56.44 | 1750 | 0.6425 | 0.5720 | 0.1377 |
|
140 |
-
| 0.1214 | 64.51 | 2000 | 0.6363 | 0.5546 | 0.1337 |
|
141 |
-
| 0.1011 | 72.57 | 2250 | 0.6310 | 0.5222 | 0.1224 |
|
142 |
-
| 0.0879 | 80.63 | 2500 | 0.6353 | 0.5258 | 0.1253 |
|
143 |
-
| 0.0782 | 88.7 | 2750 | 0.6078 | 0.4904 | 0.1127 |
|
144 |
-
| 0.0709 | 96.76 | 3000 | 0.6465 | 0.4960 | 0.1154 |
|
145 |
-
| 0.0661 | 104.82 | 3250 | 0.6622 | 0.4945 | 0.1166 |
|
146 |
-
| 0.0616 | 112.89 | 3500 | 0.6440 | 0.4786 | 0.1104 |
|
147 |
-
| 0.0579 | 120.95 | 3750 | 0.6815 | 0.4887 | 0.1144 |
|
148 |
-
| 0.0549 | 129.03 | 4000 | 0.6603 | 0.4780 | 0.1105 |
|
149 |
-
| 0.0527 | 137.09 | 4250 | 0.6652 | 0.4749 | 0.1090 |
|
150 |
-
| 0.0506 | 145.16 | 4500 | 0.6958 | 0.4846 | 0.1133 |
|
151 |
-
|
152 |
-
Further fine-tuning with slightly different architecture and higher learning rate:
|
153 |
-
|
154 |
-
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
|
155 |
-
|:-------------:|:-----:|:----:|:---------------:|:------:|:------:|
|
156 |
-
| 0.576 | 8.06 | 250 | 0.2411 | 0.2340 | 0.0502 |
|
157 |
-
| 0.2564 | 16.13 | 500 | 0.2305 | 0.2097 | 0.0492 |
|
158 |
-
| 0.2018 | 24.19 | 750 | 0.2371 | 0.2059 | 0.0494 |
|
159 |
-
| 0.1549 | 32.25 | 1000 | 0.2298 | 0.1844 | 0.0435 |
|
160 |
-
| 0.1224 | 40.32 | 1250 | 0.2288 | 0.1725 | 0.0407 |
|
161 |
-
| 0.1004 | 48.38 | 1500 | 0.2327 | 0.1608 | 0.0376 |
|
162 |
-
|
163 |
-
|
164 |
### Framework versions
|
165 |
|
166 |
-
- Transformers 4.16.
|
167 |
- Pytorch 1.10.1+cu102
|
168 |
-
- Datasets 1.
|
169 |
- Tokenizers 0.11.0
|
|
|
10 |
- xlsr-fine-tuning-week
|
11 |
datasets:
|
12 |
- common_voice
|
13 |
+
- ovm
|
14 |
+
- pscr
|
15 |
+
- vystadial2016
|
16 |
model-index:
|
17 |
+
- name: Czech comodoro Wav2Vec2 XLSR 300M 250h data
|
18 |
results:
|
19 |
- task:
|
20 |
name: Automatic Speech Recognition
|
|
|
26 |
metrics:
|
27 |
- name: Test WER
|
28 |
type: wer
|
29 |
+
value: 10.0
|
30 |
- name: Test CER
|
31 |
type: cer
|
32 |
value: 2.6
|
33 |
---
|
|
|
|
|
34 |
|
35 |
+
# Czech wav2vec2-xls-r-300m-cs-250
|
36 |
|
37 |
+
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice 8.0 dataset as well as other datasets listed below.
|
38 |
+
|
39 |
+
It achieves the following results on the evaluation set:
|
40 |
+
- eval_loss: 0.1304
|
41 |
+
- eval_wer: 0.1517
|
42 |
+
- eval_cer: 0.0326
|
43 |
+
- eval_runtime: 358.9895
|
44 |
+
- eval_samples_per_second: 20.243
|
45 |
+
- eval_steps_per_second: 2.532
|
46 |
+
- epoch: 3.13
|
47 |
+
- step: 31200
|
48 |
|
49 |
The `eval.py` script results using a LM are:
|
50 |
+
WER: 0.10053685691079459
|
51 |
+
CER: 0.025859623842234124
|
52 |
|
53 |
## Model description
|
54 |
|
|
|
66 |
|
67 |
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "cs", split="test[:2%]")
|
68 |
|
69 |
+
processor = Wav2Vec2Processor.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")
|
70 |
+
model = Wav2Vec2ForCTC.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")
|
71 |
|
72 |
resampler = torchaudio.transforms.Resample(48_000, 16_000)
|
73 |
|
|
|
94 |
|
95 |
The model can be evaluated using the attached `eval.py` script:
|
96 |
```
|
97 |
+
python eval.py --model_id comodoro/wav2vec2-xls-r-300m-cs-250 --dataset mozilla-foundation/common-voice_8_0 --split test --config cs
|
98 |
```
|
99 |
|
100 |
## Training and evaluation data
|
101 |
|
102 |
+
The Common Voice 8.0 `train` and `validation` datasets were used for training, as well as the following datasets:
|
103 |
|
104 |
+
- Šmídl, Luboš and Pražák, Aleš, 2013, OVM – Otázky Václava Moravce, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3.
|
105 |
|
106 |
+
- Pražák, Aleš and Šmídl, Luboš, 2012, Czech Parliament Meetings, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4.
|
107 |
|
108 |
+
- Plátek, Ondřej; Dušek, Ondřej and Jurčíček, Filip, 2016, Vystadial 2016 – Czech data, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1740.
|
109 |
|
110 |
+
### Training hyperparameters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
+
The following hyperparameters were used during training:
|
113 |
+
- learning_rate: 1e-05
|
114 |
+
- train_batch_size: 16
|
115 |
- eval_batch_size: 8
|
116 |
- seed: 42
|
|
|
|
|
117 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
118 |
- lr_scheduler_type: linear
|
119 |
+
- lr_scheduler_warmup_steps: 600
|
120 |
- num_epochs: 50
|
121 |
- mixed_precision_training: Native AMP
|
122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
### Framework versions
|
124 |
|
125 |
+
- Transformers 4.16.2
|
126 |
- Pytorch 1.10.1+cu102
|
127 |
+
- Datasets 1.18.3
|
128 |
- Tokenizers 0.11.0
|