Model Card for WhisperLiveSubs

This model is a fine-tuned version of OpenAI's Whisper model on the Common Voice dataset for Urdu speech recognition. It is optimized for transcribing Urdu language audio.

Model Description

This model is a small variant of the Whisper model fine-tuned on the Common Voice dataset for the Urdu language. It is intended for automatic speech recognition (ASR) tasks and performs well in transcribing Urdu speech.

Developed by: codewithdark
Model type: Whisper-based model for ASR
Language(s) (NLP): Urdu (ur)
License: Apache 2.0
Finetuned from model : openai/whisper-small

Uses

Direct Use

This model can be used directly for transcribing Urdu audio into text. It is suitable for applications such as:

Voice-to-text transcription services
Captioning Urdu language videos
Speech analytics in Urdu

Out-of-Scope Use

The model may not perform well for:

Non-Urdu languages
Extremely noisy environments
Very long audio sequences without segmentation

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("codewithdark/WhisperLiveSubs")
model = WhisperForConditionalGeneration.from_pretrained("codewithdark/WhisperLiveSubs")

# Your transcription code here

Training Data

The model was fine-tuned on the Mozilla Common Voice dataset, specifically the Urdu subset. The dataset consists of approximately 141 hr of transcribed Urdu speech.

Preprocessing

The audio was resampled to 16kHz, and text was tokenized using the Whisper tokenizer configured for Urdu.

Training Hyperparameters

Training regime: Mixed precision (fp16)
Batch size: 8
Gradient accumulation steps: 2
Learning rate: 1e-5
Max steps: 4000

Metrics

Word Error Rate (WER) was the primary metric used to evaluate the model's performance.

Results

Training Loss: 0.2005
Validation Loss: 0.5342
WER: 51.06

This is my first time fine-tuning this model. Don't worry about the current performance; improvements can be made to enhance the model's accuracy and reduce the WER.

Hardware Type: P100 GPU
Hours used: 10 hr
Cloud Provider: Kaggle
Compute Region: PK

Model Architecture and Objective

The WhisperLiveSubs model is based on the Whisper architecture, designed for automatic speech recognition.

Software

Framework: PyTorch
Transformers Version:

Summary

The model demonstrates acceptable performance for Urdu transcription, but there is room for improvement in terms of WER, especially in noisy conditions or with diverse accents.

Model Card Contact

For inquiries, please contact codewithdark90@gmail.com

@Codewithdark. (2024). WhisperLiveSubs: An Urdu Automatic Speech Recognition Model. Retrieved from https://huggingface.co/codewithdark/WhisperLiveSubs