XLS-R-based CTC model with 5-gram language model from Open Subtitles

This model is a version of facebook/wav2vec2-xls-r-2b-22-to-16 fine-tuned mainly on the CGN dataset, as well as the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - NL dataset (see details below), on which a large 5-gram language model is added based on the Open Subtitles Dutch corpus. This model achieves the following results on the evaluation set (of Common Voice 8.0):

Wer: 0.04057
Cer: 0.01222

Model description

The model takes 16kHz sound input, and uses a Wav2Vec2ForCTC decoder with 48 letters to output the letter-transcription probabilities per frame.

To improve accuracy, a beam-search decoder based on pyctcdecode is then used; it reranks the most promising alignments based on a 5-gram language model trained on the Open Subtitles Dutch corpus.

Intended uses & limitations

This model can be used to transcribe Dutch or Flemish spoken dutch to text (without punctuation).

Training and evaluation data

The model was:

initialized with the 2B parameter model from Facebook.
trained 5 epochs (6000 iterations of batch size 32) on the cv8/nl dataset.
trained 1 epoch (36000 iterations of batch size 32) on the cgn dataset.
trained 5 epochs (6000 iterations of batch size 32) on the cv8/nl dataset.

Framework versions

Transformers 4.16.0
Pytorch 1.10.2+cu102
Datasets 1.18.3
Tokenizers 0.11.0

Dataset used to train FremyCompany/xls-r-2b-nl-v2_lm-5gram-os

Evaluation results

Test WER on Common Voice 8
self-reported

4.060
Test CER on Common Voice 8
self-reported

1.220
Test WER on Robust Speech Event - Dev Data
self-reported

17.770
Test CER on Robust Speech Event - Dev Data
self-reported

9.770
Test WER on Robust Speech Event - Test Data
self-reported

16.320

View on Papers With Code