File size: 3,407 Bytes
56b9925 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: cc-by-4.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
language: et
model-index:
- name: TalTechNLP/whisper-large-et
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 11
type: mozilla-foundation/common_voice_11_0
config: et
split: test
metrics:
- name: Test WER
type: wer
value: 12.03
- name: Test CER
type: cer
value: 3.18
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 8
type: mozilla-foundation/common_voice_8_0
config: et
split: test
metrics:
- name: Test WER
type: wer
value: 11.35
- name: Test CER
type: cer
value: 2.75
---
# Whisper-large-et
This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data.
## Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
## Intended uses & limitations
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
## How to use
Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper).
For example:
* Convert the HF model to CT2 format:
`ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16`
* Decode:
`whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3`
#### Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
* Speech containing technical and other domain-specific terms
* Children's speech
* Non-native speech
* Speech recorded under very noisy conditions or with a microphone far from the speaker
* Very spontaneous and overlapping speech
## Training data
Acoustic training data:
| Type | Amount (h) |
|-----------------------|:------:|
| Broadcast speech | 991 |
| Spontaneous speech | 53 |
| Elderly speech corpus | 53 |
| Talks, lectures | 49 |
| Parliament speeches | 31 |
| *Total* | *1161* |
## Training procedure
Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script.
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
Finetuning was done for 3 epochs, with model averaging at the end of training.
*Update*: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and
is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to
transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).
## Evaluation results
### WER
WER results below are obtained using greedy decoding (i.e., beam size 1).
|Dataset | WER |
|---|---|
| Common Voice 8.0 | 11.3 |
| Common Voice 11.0 | 12.0 |
|