File size: 4,306 Bytes
a910682
dae4012
 
a910682
 
 
 
 
 
 
dae4012
1a2302f
a910682
 
1a2302f
 
a910682
1a2302f
a910682
 
 
 
1a2302f
a910682
 
 
 
 
 
0b122bd
a910682
1a2302f
 
a910682
 
 
 
 
 
 
 
1a2302f
589f132
 
 
 
 
 
bcd144a
1a2302f
 
589f132
bcd144a
1a2302f
 
 
 
a910682
 
 
1a2302f
a910682
 
 
1a2302f
a910682
 
 
 
1a2302f
 
a910682
 
 
 
 
 
 
 
 
 
 
 
 
1a2302f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

---
license: mit
language:
- bn
metrics:
- wer
- cer
library_name: nemo
pipeline_tag: automatic-speech-recognition
---
## Model
**Conformer-CTC** model trained on the *OOD-Speech dataset* to transcribe speech from Bangla audio. This is a large variant of the model, with ~121M parameters. To know more about the model architecture see the NeMo Documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).

## Dataset
 The training split contains `1100+ hours` of audio data crowdsoruced from native Bangla speakers. We trained on this split for `164 epochs` , then the model was evaluated on`23+ hours` of audio across 17 diverse domains .

## Usage
The model can be used as a pretrained checkpoint for inference or for fine-tuning on another dataset through the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). It is recommended to install the toolkit, after installing the pyTorch package. 
```bash
apt-get update && apt-get install -y libsndfile1 ffmpeg sox
pip install Cython
pip install nemo_toolkit['all'] #1.19.0
pip uninstall -y torchmetrics
pip install torchmetrics==0.9.2
```
After installing the required dependencies, download the .nemo file or the pretrained model to your local directory. you can instantiate the pretrained model like following: 
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("bengaliAI/BanglaConformer")
```
## Data Preprocessing
Prior to feeding the input audio to the pretrained model for training or inference, we need to resample the audio to **16KHz**. We can achieve that using the `sox` library :
```python
from sox import Transformer
if  not os.path.exists("<RESAMPLED AUDIO FILE PATH>"):
	tfm = Transformer()
	tfm.rate(samplerate=16000)
	tfm.channels(n_channels=1)
	tfm.build(input_filepath= "<AUDIO FILE PATH>", output_filepath= "<RESAMPLED AUDIO FILE PATH>") 
```
## Training 
We used the official [NeMo documentation on training an ASR model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/examples/kinyarwanda_asr.html)
to prepare our transcript manifest and train our model. However, we did not train any custom tokenizer and instead downloaded the tokenizer 
from [banglaBERT-large](https://huggingface.co/csebuetnlp/banglabert_large/) for better vocabulary coverage.  
For validation, we have used `29589` samples separated from the training data and processed accordingly. 
**The final  validation score was `22.4% WER` , at epoch `164`.**

Training script : [training.sh](training.sh)
## Evaluation 
`14,016` test samples have been used to evaluate the dataset. The generated output file contains both ground truth and predicted strings. The final result is the Word Error Rate (WER) and Character Error Rate (CER) for the model. 

Evaluation script: [evaluation.sh](evaluation.sh)

**Test Dataset WER/CER 69.25%/42.13%**
## Inference
### Inference using the library 
```python
asr_model.transcribe(paths2audio_files="<LIST OF RESAMPLED AUDIO FILES>")
```
### Infer directly using the repo
```bash
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
 model_path="<PRETRAINED MODEL PATH>" \
 audio_dir="<DIRECTORY CONTAINING RESAMPLED AUDIO FILES>" \
 output_filename="<PATH TO OUTPUT JSON>" \
 batch_size=1
```

## Domain-wise Performance
The model has also been separately evaluated across the different domain distributions of the test data. 
For more details please refer this paper: [OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking](https://arxiv.org/pdf/2305.09688.pdf) 
![](https://i.imgur.com/rzWzEVj.png)

If you this model in your work, please cite the following paper : 
```
@misc{rakib2023oodspeech,
      title={OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking}, 
      author={Fazle Rabbi Rakib and Souhardya Saha Dip and Samiul Alam and Nazia Tasnim and Md. Istiak Hossain Shihab and Md. Nazmuddoha Ansary and Syed Mobassir Hossen and Marsia Haque Meghla and Mamunur Mamun and Farig Sadeque and Sayma Sultana Chowdhury and Tahsin Reasat and Asif Sushmit and Ahmed Imtiaz Humayun},
      year={2023},
      eprint={2305.09688},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}
```