hthierno
/

whisper-large-et

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

whisper-large-et / README.md

Thierno Barry

moving model without weights

56b9925 12 months ago

|

3.41 kB

	---
	license: cc-by-4.0
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	language: et
	model-index:
	- name: TalTechNLP/whisper-large-et
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 11
	type: mozilla-foundation/common_voice_11_0
	config: et
	split: test
	metrics:
	- name: Test WER
	type: wer
	value: 12.03
	- name: Test CER
	type: cer
	value: 3.18
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 8
	type: mozilla-foundation/common_voice_8_0
	config: et
	split: test
	metrics:
	- name: Test WER
	type: wer
	value: 11.35
	- name: Test CER
	type: cer
	value: 2.75
	---


	# Whisper-large-et

	This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data.

	## Model description
	This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.


	## Intended uses & limitations

	This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

	## How to use

	Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper).

	For example:

	* Convert the HF model to CT2 format:

	`ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16`

	* Decode:

	`whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3`


	#### Limitations and bias

	Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
	* Speech containing technical and other domain-specific terms
	* Children's speech
	* Non-native speech
	* Speech recorded under very noisy conditions or with a microphone far from the speaker
	* Very spontaneous and overlapping speech

	## Training data
	Acoustic training data:

	\| Type \| Amount (h) \|
	\|-----------------------\|:------:\|
	\| Broadcast speech \| 991 \|
	\| Spontaneous speech \| 53 \|
	\| Elderly speech corpus \| 53 \|
	\| Talks, lectures \| 49 \|
	\| Parliament speeches \| 31 \|
	\| Total \| 1161 \|



	## Training procedure

	Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script.
	Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
	Finetuning was done for 3 epochs, with model averaging at the end of training.

	Update: 2023-10-03 version of the model is trained on long segments (like the original Whisper model) and
	is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to
	transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).

	## Evaluation results

	### WER

	WER results below are obtained using greedy decoding (i.e., beam size 1).

	\|Dataset \| WER \|
	\|---\|---\|
	\| Common Voice 8.0 \| 11.3 \|
	\| Common Voice 11.0 \| 12.0 \|