|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- bn |
|
library_name: nemo |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- ASR |
|
- Automatic Speech Recognition |
|
- Bangla ASR |
|
- Bengali ASR |
|
- bn asr |
|
- Bangla fastconformer |
|
- https://arxiv.org/abs/2311.03196 |
|
--- |
|
## Summary |
|
__titu_stt_bn_fastconformer__ is a [fastconformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#fast-conformer) based model trained on ~18K Hours [MegaBNSpeech]() corpus. |
|
|
|
Details on paper: [https://aclanthology.org/2023.banglalp-1.16/](https://aclanthology.org/2023.banglalp-1.16/) |
|
|
|
## Using method |
|
This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using [NeMo](https://github.com/NVIDIA/NeMo) framework. |
|
|
|
### Installation |
|
To install [NeMo](https://github.com/NVIDIA/NeMo) check NeMo documentation. |
|
|
|
``` |
|
pip install -q 'nemo_toolkit[asr]' |
|
``` |
|
|
|
### Inferencing |
|
[Download test_bn_fastconformer.wav](https://huggingface.co/hishab/hishab_bn_fastconformer/blob/main/test_bn_fastconformer.wav) |
|
```py |
|
# pip install -q 'nemo_toolkit[asr]' |
|
|
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/titu_stt_bn_fastconformer") |
|
|
|
auido_file = "test_bn_fastconformer.wav" |
|
transcriptions = asr_model.transcribe([auido_file]) |
|
print(transcriptions) |
|
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধা সরকারি স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে'] |
|
``` |
|
Colab Notebook for Infer: [Bangla FastConformer Infer.ipynb](https://colab.research.google.com/drive/1J3bxXlLBgSf1zOKVKbRYu1VrbEJFLlUc?usp=sharing) |
|
|
|
## Training Datasets |
|
|
|
| Channels Category | Hours | |
|
| ----------------- | ----------- | |
|
| News | 17,640.00 | |
|
| Talkshow | 688.82 | |
|
| Vlog | 0.02 | |
|
| Crime Show | 4.08 | |
|
| Total | 18,332.92 | |
|
|
|
|
|
## Training Details |
|
|
|
For training the model, the dataset we selected comprises 17.64k hours of news chan- nel content, 688.82 hours of talk shows, 0.02 hours of vlogs, and 4.08 hours of crime shows. |
|
|
|
## Evaluation |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/WvMlp95z2-GXT6AYfwW8Y.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64df9253cccd823564c3303b/O2RA9TAedIv1OTqgdIap5.png) |
|
|
|
## Citation |
|
``` |
|
@inproceedings{nandi-etal-2023-pseudo, |
|
title = "Pseudo-Labeling for Domain-Agnostic {B}angla Automatic Speech Recognition", |
|
author = "Nandi, Rabindra Nath and |
|
Menon, Mehadi and |
|
Muntasir, Tareq and |
|
Sarker, Sagor and |
|
Muhtaseem, Quazi Sarwar and |
|
Islam, Md. Tariqul and |
|
Chowdhury, Shammur and |
|
Alam, Firoj", |
|
editor = "Alam, Firoj and |
|
Kar, Sudipta and |
|
Chowdhury, Shammur Absar and |
|
Sadeque, Farig and |
|
Amin, Ruhul", |
|
booktitle = "Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)", |
|
month = dec, |
|
year = "2023", |
|
address = "Singapore", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2023.banglalp-1.16", |
|
doi = "10.18653/v1/2023.banglalp-1.16", |
|
pages = "152--162", |
|
abstract = "One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR", |
|
} |
|
``` |