wtlow003

fix: add model index

8fbd0a8 5 months ago

4.23 kB

	---
	language:
	- en
	license: mit
	base_model: openai/whisper-small
	tags:
	- generated_from_trainer
	metrics:
	- wer
	model-index:
	- name: whisper-small-singlish-122k
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	name: NSC
	type: NSC
	metrics:
	- name: WER
	type: WER
	value: 9.69
	---

	# Whisper-small-singlish-122k.

	This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).

	The following results on the evaluation set (43,788k samples) are reported:

	- Loss: 0.171377
	- WER: 9.69

	## Model Details

	### Model Description

	- Developed by: [jensenlwt](https://huggingface.co/jensenlwt)
	- Model type: automatic-speech-recognition
	- License: MIT
	- Finetuned from model: [openai/whisper-small](https://huggingface.co/openai/whisper-small)

	## Uses

	The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).

	The recommended audio usage for testing should be:

	1. Involves local Singapore slang, dialect, names, and terms etc.
	2. Involves Singaporean accent.

	### Direct Use

	To use the model in an application, you can make use of `transformers`:

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
	```

	### Out-of-Scope Use

	- Long form audio
	- Broken Singlish (typically from older generation)
	- Poor quality audio (audio samples are recorded in a controlled environment)
	- Conversation (as the model is not trained on conversation)

	## Training Details

	### Training Data

	We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
	In specific, we made use of Part 2 – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.

	To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.

	### Training Procedure

	The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
	In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.

	#### Training Hyperparameters

	The following hyperparameters are used:

	- batch_size: 128
	- gradient_accumulation_steps: 1
	- learning_rate: 1e-5
	- warmup_steps: 500
	- max_steps: 5000
	- fp16: true
	- eval_batch_size: 32
	- eval_step: 500
	- max_grad_norm: 1.0
	- generation_max_length: 225

	#### Training Results

	\| Steps \| Epoch \| Train Loss \| Eval Loss \| WER \|
	\|:-----:\|:--------:\|:----------:\|:---------:\|:------------------:\|
	\| 500 \| 0.654450 \| 0.7418 \| 0.3889 \| 17.968250 \|
	\| 1000 \| 1.308901 \| 0.2831 \| 0.2519 \| 11.880948 \|
	\| 1500 \| 1.963351 \| 0.1960 \| 0.2038 \| 9.948440 \|
	\| 2000 \| 2.617801 \| 0.1236 \| 0.1872 \| 9.420248 \|
	\| 2500 \| 3.272251 \| 0.0970 \| 0.1791 \| 8.539280 \|
	\| 3000 \| 3.926702 \| 0.0728 \| 0.1714 \| 8.207827 \|
	\| 3500 \| 4.581152 \| 0.0484 \| 0.1741 \| 8.145801 \|
	\| 4000 \| 5.235602 \| 0.0401 \| 0.1773 \| 8.138047 \|

	The model with the lowest evaluation loss is used as the final checkpoint.

	### Testing Data, Factors & Metrics

	#### Testing Data

	To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.

	### Results

	\| Model \| WER \|
	\|:----------------------------:\|:-----:\|
	\| fine-tuned-122k-whisper-small\| 9.69% \|

	#### Summary


	## Technical Specifications

	### Model Architecture and Objective

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]