Model's Improvment
This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.
Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker
This model is a fine-tuned version of theainerd/Wav2Vec2-large-xlsr-hindi on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a Word Error Rate (WER) of 54%, compared to the base model’s WER of 72% on the same dataset.
Model description
This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.
Intended uses & limitations
This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.
Usage
The model can be used directly (without a language model) as follows:
import torch import torchaudio from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
Load the Hindi Common Voice dataset
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") resampler = torchaudio.transforms.Resample(48_000, 16_000)
Function to process the dataset
def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch
test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
Perform inference
with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1) print("Prediction:", processor.batch_decode(predicted_ids)) print("Reference:", test_dataset["sentence"][:2])
Evaluation
The model can be evaluated as follows on the Hindi test data of Common Voice.
import torch import torchaudio from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import re
Load the dataset and metrics
test_dataset = load_dataset("common_voice", "hi", split="test") wer = load_metric("wer")
Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model.to("cuda")
resampler = torchaudio.transforms.Resample(48_000, 16_000) chars_to_ignore_regex = '[,?.!-;:"\“]'
Function to preprocess the data
def speech_file_to_array_fn(batch): batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
Evaluation function
def evaluate(batch): inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits pred_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(pred_ids) return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8) print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Limitations:
- The model may face challenges with dialectal or regional variations within Hindi.
- Performance can degrade with noisy audio or overlapping speech.
- It is not intended for real-time transcription due to latency considerations.
Training and evaluation data
The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.
Training procedure
Hyperparameters and setup:
The following hyperparameters were used during training:
- Learning rate: 1e-4
- Batch size: 16 (per device)
- Gradient accumulation steps: 2
- Evaluation strategy: steps
- Max steps: 2500
- Mixed precision: FP16
- Save steps: 500
- Evaluation steps: 500
- Logging steps: 500
- Warmup steps: 500
- Save total limit: 1
Training output
- Global step: 2500
- Training runtime: Approximately 1 hour 21 minutes
- Epochs: 5-6
Training results
Step | Training Loss | Validation Loss | WER |
---|---|---|---|
500 | 5.603000 | 0.987691 | 0.7556 |
1000 | 0.720300 | 0.667561 | 0.6196 |
1500 | 0.507000 | 0.592814 | 0.5844 |
2000 | 0.431100 | 0.549786 | 0.5439 |
2500 | 0.395600 | 0.537703 | 0.5428 |
Framework versions
Transformers: 4.42.4 PyTorch: 2.3.1+cu121 Datasets: 2.20.0 Tokenizers: 0.19.1
- Downloads last month
- 18
Model tree for yash072/wav2vec2-large-XLSR-Hindi-YashR
Base model
theainerd/Wav2Vec2-large-xlsr-hindi