|
--- |
|
license: cc-by-4.0 |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
language: et |
|
model-index: |
|
- name: TalTechNLP/whisper-large-et |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 11 |
|
type: mozilla-foundation/common_voice_11_0 |
|
config: et |
|
split: test |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 12.03 |
|
- name: Test CER |
|
type: cer |
|
value: 3.18 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 8 |
|
type: mozilla-foundation/common_voice_8_0 |
|
config: et |
|
split: test |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.35 |
|
- name: Test CER |
|
type: cer |
|
value: 2.75 |
|
--- |
|
|
|
|
|
# Whisper-large-et |
|
|
|
This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data. |
|
|
|
## Model description |
|
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. |
|
|
|
## How to use |
|
|
|
Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper). |
|
|
|
For example: |
|
|
|
* Convert the HF model to CT2 format: |
|
|
|
`ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16` |
|
|
|
* Decode: |
|
|
|
`whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3` |
|
|
|
|
|
#### Limitations and bias |
|
|
|
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: |
|
* Speech containing technical and other domain-specific terms |
|
* Children's speech |
|
* Non-native speech |
|
* Speech recorded under very noisy conditions or with a microphone far from the speaker |
|
* Very spontaneous and overlapping speech |
|
|
|
## Training data |
|
Acoustic training data: |
|
|
|
| Type | Amount (h) | |
|
|-----------------------|:------:| |
|
| Broadcast speech | 991 | |
|
| Spontaneous speech | 53 | |
|
| Elderly speech corpus | 53 | |
|
| Talks, lectures | 49 | |
|
| Parliament speeches | 31 | |
|
| *Total* | *1161* | |
|
|
|
|
|
|
|
## Training procedure |
|
|
|
Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. |
|
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model. |
|
Finetuning was done for 3 epochs, with model averaging at the end of training. |
|
|
|
*Update*: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and |
|
is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to |
|
transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation). |
|
|
|
## Evaluation results |
|
|
|
### WER |
|
|
|
WER results below are obtained using greedy decoding (i.e., beam size 1). |
|
|
|
|Dataset | WER | |
|
|---|---| |
|
| Common Voice 8.0 | 11.3 | |
|
| Common Voice 11.0 | 12.0 | |
|
|