metadata

language:
  - ro
license: apache-2.0
tags:
  - automatic-speech-recognition
  - robust-speech-event
datasets:
  - mozilla-foundation/common_voice_8_0
  - gigant/romanian_speech_synthesis_0_8_1
model-index:
  - name: wav2vec2-ro-300m_01
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Robust Speech Event
          type: speech-recognition-community-v2/dev_data
          args: ro
        metrics:
          - name: Dev WER (without LM)
            type: wer
            value: 46.99
          - name: Dev CER (without LM)
            type: cer
            value: 16.04
          - name: Dev WER (with LM)
            type: wer
            value: 38.63
          - name: Dev CER (with LM)
            type: cer
            value: 14.52
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice
          type: mozilla-foundation/common_voice_8_0
          args: ro
        metrics:
          - name: Test WER (without LM)
            type: wer
            value: 11.73
          - name: Test CER (without LM)
            type: cer
            value: 2.93
          - name: Test WER (with LM)
            type: wer
            value: 7.31
          - name: Test CER (with LM)
            type: cer
            value: 2.17

Romanian Wav2Vec2

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the Common Voice 8.0 - Romanian subset dataset, with extra training data from Romanian Speech Synthesis dataset.

Without the 5-gram Language Model optimization, it achieves the following results on the evaluation set (Common Voice 8.0, Romanian subset, test split):

Loss: 0.1553
Wer: 0.1174
Cer: 0.0294

Model description

The architecture is based on facebook/wav2vec2-xls-r-300m with a speech recognition CTC head and an added 5-gram language model (using pyctcdecode and kenlm) trained on the Romanian Corpora Parliament dataset. Those libraries are needed in order for the language model-boosted decoder to work.

Intended uses & limitations

More information needed

Training and evaluation data

Training data :

Common Voice 8.0 - Romanian subset : train + validation + other splits
Romanian Speech Synthesis : train + test splits

Evaluation data :

Common Voice 8.0 - Romanian subset : test split

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.003
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 3
total_train_batch_size: 48
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 50.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
2.9272	0.78	500	0.7603	0.7734	0.2355
0.6157	1.55	1000	0.4003	0.4866	0.1247
0.4452	2.33	1500	0.2960	0.3689	0.0910
0.3631	3.11	2000	0.2580	0.3205	0.0796
0.3153	3.88	2500	0.2465	0.2977	0.0747
0.2795	4.66	3000	0.2274	0.2789	0.0694
0.2615	5.43	3500	0.2277	0.2685	0.0675
0.2389	6.21	4000	0.2135	0.2518	0.0627
0.2229	6.99	4500	0.2054	0.2449	0.0614
0.2067	7.76	5000	0.2096	0.2378	0.0597
0.1977	8.54	5500	0.2042	0.2387	0.0600
0.1896	9.32	6000	0.2110	0.2383	0.0595
0.1801	10.09	6500	0.1909	0.2165	0.0548
0.174	10.87	7000	0.1883	0.2206	0.0559
0.1685	11.65	7500	0.1848	0.2097	0.0528
0.1591	12.42	8000	0.1851	0.2039	0.0514
0.1537	13.2	8500	0.1881	0.2065	0.0518
0.1504	13.97	9000	0.1840	0.1972	0.0499
0.145	14.75	9500	0.1845	0.2029	0.0517
0.1417	15.53	10000	0.1884	0.2003	0.0507
0.1364	16.3	10500	0.2010	0.2037	0.0517
0.1331	17.08	11000	0.1838	0.1923	0.0483
0.129	17.86	11500	0.1818	0.1922	0.0489
0.1198	18.63	12000	0.1760	0.1861	0.0465
0.1203	19.41	12500	0.1686	0.1839	0.0465
0.1225	20.19	13000	0.1828	0.1920	0.0479
0.1145	20.96	13500	0.1673	0.1784	0.0446
0.1053	21.74	14000	0.1802	0.1810	0.0456
0.1071	22.51	14500	0.1769	0.1775	0.0444
0.1053	23.29	15000	0.1920	0.1783	0.0457
0.1024	24.07	15500	0.1904	0.1775	0.0446
0.0987	24.84	16000	0.1793	0.1762	0.0446
0.0949	25.62	16500	0.1801	0.1766	0.0443
0.0942	26.4	17000	0.1731	0.1659	0.0423
0.0906	27.17	17500	0.1776	0.1698	0.0424
0.0861	27.95	18000	0.1716	0.1600	0.0406
0.0851	28.73	18500	0.1662	0.1630	0.0410
0.0844	29.5	19000	0.1671	0.1572	0.0393
0.0792	30.28	19500	0.1768	0.1599	0.0407
0.0798	31.06	20000	0.1732	0.1558	0.0394
0.0779	31.83	20500	0.1694	0.1544	0.0388
0.0718	32.61	21000	0.1709	0.1578	0.0399
0.0732	33.38	21500	0.1697	0.1523	0.0391
0.0708	34.16	22000	0.1616	0.1474	0.0375
0.0678	34.94	22500	0.1698	0.1474	0.0375
0.0642	35.71	23000	0.1681	0.1459	0.0369
0.0661	36.49	23500	0.1612	0.1411	0.0357
0.0629	37.27	24000	0.1662	0.1414	0.0355
0.0587	38.04	24500	0.1659	0.1408	0.0351
0.0581	38.82	25000	0.1612	0.1382	0.0352
0.0556	39.6	25500	0.1647	0.1376	0.0345
0.0543	40.37	26000	0.1658	0.1335	0.0337
0.052	41.15	26500	0.1716	0.1369	0.0343
0.0513	41.92	27000	0.1600	0.1317	0.0330
0.0491	42.7	27500	0.1671	0.1311	0.0328
0.0463	43.48	28000	0.1613	0.1289	0.0324
0.0468	44.25	28500	0.1599	0.1260	0.0315
0.0435	45.03	29000	0.1556	0.1232	0.0308
0.043	45.81	29500	0.1588	0.1240	0.0309
0.0421	46.58	30000	0.1567	0.1217	0.0308
0.04	47.36	30500	0.1533	0.1198	0.0302
0.0389	48.14	31000	0.1582	0.1185	0.0297
0.0387	48.91	31500	0.1576	0.1187	0.0297
0.0376	49.69	32000	0.1560	0.1182	0.0295

Framework versions

Transformers 4.16.2
Pytorch 1.10.0+cu111
Tokenizers 0.11.0
pyctcdecode 0.3.0
kenlm