metadata

license: mit
tags:
  - audio
  - automatic-speech-recognition
widget:
  - example_title: sample 1
    src: >-
      https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3
  - example_title: sample 2
    src: >-
      https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3
  - example_title: sample 3
    src: >-
      https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
pipeline_tag: automatic-speech-recognition

Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper model using Bangla mozilla common voice dataset. For training this model used 40k training and 7k Validation of around 400 hours of data. We trained 12000 steps and get word error rate 4.58%. This model was whisper small[244 M] variant model.


import os
import librosa
import torch
import torchaudio
import numpy as np

from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"

model_path = "bangla-speech-processing/BanglaASR"


feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)


speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
predicted_ids = model.generate(inputs=input_features.to(device))[0]


transcription = processor.decode(predicted_ids, skip_special_tokens=True)

print(transcription)

Dataset

Used Mozilla common voice dataset around 400 hours data both training[40k] and validation[7k] mp3 samples. For more information about dataser please click here

Training Model Information

Size	Layers	Width	Heads	Parameters	Bangla-only	Training Status
tiny	4	384	6	39 M	X	X
base	6	512	8	74 M	X	X
small	12	768	12	244 M	✓	✓
medium	24	1024	16	769 M	X	X
large	32	1280	20	1550 M	X	X

Evaluation

Word Error Rate 4.58 %

For More please check the github

@misc{BanglaASR ,
  title={Transformer Based Whisper Bangla ASR Model},
  author={Md Saiful Islam},
  howpublished={},
  year={2023}
}