metadata

license: apache-2.0
language:
  - lv
pipeline_tag: automatic-speech-recognition
base_model:
  - openai/whisper-large-v3

General-purpose Latvian ASR model

This is a fine-tuned whisper-large-v3 model for Latvian, trained by AiLab.lv using two general-purpose speech datasets: the Latvian part of Common Voice 19.0, and the latest version of the Latvian broadcast dataset LATE-Media.

This version of the model supersedes the previous whisper-large-v3-lv-late-cv17 model.

We also provide 4-bit, 5-bit and 8-bit quantized versions of the model in the GGML format for the use with whisper.cpp, as well as an 8-bit quantized version for the use with CTranslate2.

Training

Fine-tuning was done using the Hugging Face Transformers library with a modified seq2seq script.

Training data	Hours
Latvian Common Voice 19.0 train set (the VW split)	212.6
LATE-Media 2.0 train set	69.8
Total	282.4

Evaluation

Testing data	WER	CER
Latvian Common Voice 19.0 test set (VW) - formatted	4.8	1.6
Latvian Common Voice 19.0 test set (VW) - normalized	3.2	1.0
LATE-Media 1.0 test set - formatted	19.2	7.6
LATE-Media 1.0 test set - normalized	12.8	5.3

The Latvian CV 19.0 test set is available here. The LATE-Media 1.0 test set is available here.

Citation

Please cite this paper if you use this model in your research:

@inproceedings{dargis-etal-2024-balsutalka-lv,
  author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
  title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low-Resource Languages}},
  booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
  publisher = {ELRA and ICCL},
  year = {2024},
  pages = {2080--2085},
  url = {https://aclanthology.org/2024.lrec-main.187}
}

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project LATE (VPP-LETONIKA-2021/1-0006).