language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
- name: whisper-small-singlish-122k
results:
- task:
type: automatic-speech-recognition
dataset:
name: NSC
type: NSC
metrics:
- name: WER
type: WER
value: 9.69
Whisper-small-singlish-122k.
This model is a openai/whisper-small, fine-tuned on a subset (122k samples) of the National Speech Corpus.
The following results on the evaluation set (43,788k samples) are reported:
- Loss: 0.171377
- WER: 9.69
Model Details
Model Description
- Developed by: jensenlwt
- Model type: automatic-speech-recognition
- License: MIT
- Finetuned from model: openai/whisper-small
Uses
The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).
The recommended audio usage for testing should be:
- Involves local Singapore slang, dialect, names, and terms etc.
- Involves Singaporean accent.
Direct Use
To use the model in an application, you can make use of transformers
:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
Out-of-Scope Use
- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)
Training Details
Training Data
We made use of the National Speech Corpus for training. In specific, we made use of Part 2 – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.
To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.
Training Procedure
The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation. In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.
Training Hyperparameters
The following hyperparameters are used:
- batch_size: 128
- gradient_accumulation_steps: 1
- learning_rate: 1e-5
- warmup_steps: 500
- max_steps: 5000
- fp16: true
- eval_batch_size: 32
- eval_step: 500
- max_grad_norm: 1.0
- generation_max_length: 225
Training Results
Steps | Epoch | Train Loss | Eval Loss | WER |
---|---|---|---|---|
500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 |
1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 |
1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 |
2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 |
2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 |
3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 |
3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 |
4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 |
The model with the lowest evaluation loss is used as the final checkpoint.
Testing Data, Factors & Metrics
Testing Data
To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.
Results
Model | WER |
---|---|
fine-tuned-122k-whisper-small | 9.69% |
Summary
Technical Specifications
Model Architecture and Objective
Compute Infrastructure
[More Information Needed]
Hardware
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]