whisper-large-et / README.md
Thierno Barry
moving model without weights
56b9925
|
raw
history blame
3.41 kB
metadata
license: cc-by-4.0
tags:
  - audio
  - automatic-speech-recognition
  - hf-asr-leaderboard
language: et
model-index:
  - name: TalTechNLP/whisper-large-et
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 11
          type: mozilla-foundation/common_voice_11_0
          config: et
          split: test
        metrics:
          - name: Test WER
            type: wer
            value: 12.03
          - name: Test CER
            type: cer
            value: 3.18
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 8
          type: mozilla-foundation/common_voice_8_0
          config: et
          split: test
        metrics:
          - name: Test WER
            type: wer
            value: 11.35
          - name: Test CER
            type: cer
            value: 2.75

Whisper-large-et

This is a Whisper-large-v2 model openai/whisper-large-v2 finetuned on around 1200 hours of diverse Estonian data.

Model description

This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

How to use

Recommended: use faster-whisper.

For example:

  • Convert the HF model to CT2 format:

    ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16

  • Decode:

    whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

  • Speech containing technical and other domain-specific terms
  • Children's speech
  • Non-native speech
  • Speech recorded under very noisy conditions or with a microphone far from the speaker
  • Very spontaneous and overlapping speech

Training data

Acoustic training data:

Type Amount (h)
Broadcast speech 991
Spontaneous speech 53
Elderly speech corpus 53
Talks, lectures 49
Parliament speeches 31
Total 1161

Training procedure

Finetuned using Espnet, and then comverted to transformers format using this script. Finetuning procedure is similar to this model. Finetuning was done for 3 epochs, with model averaging at the end of training.

Update: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and is therefore especially well suited to be used e.g. with faster-whisper to transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).

Evaluation results

WER

WER results below are obtained using greedy decoding (i.e., beam size 1).

Dataset WER
Common Voice 8.0 11.3
Common Voice 11.0 12.0