appledora commited on
Commit
a910682
1 Parent(s): 105f845

update: model card

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md CHANGED
@@ -1,3 +1,70 @@
 
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
  license: mit
4
+ language:
5
+ - bn
6
+ metrics:
7
+ - wer
8
+ - cer
9
+ library_name: nemo
10
+ pipeline_tag: automatic-speech-recognition
11
  ---
12
+ #### Model
13
+ **Conformer-CTC** model trained on the *OOD-Speech dataset* to transcribe speech from Bangla audio. This is a large variant of the model, with ~121M parameters. To know more about the model architecture see the NeMo Documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
14
+
15
+ #### Dataset
16
+ The training split contains `1100+ hours` of audio data crowdsoruced from native Bangla speakers. We trained on this split for `164 epochs` , then the model was evaluated on`23+ hours` of audio across 17 diverse domains, with a validation score of `22.4% WER` .
17
+
18
+ #### Usage
19
+ The model can be used as a pretrained checkpoint for inference or for fine-tuning on another dataset through the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). It is recommended to install the toolkit, after installing the pyTorch package.
20
+ ```bash
21
+ apt-get update && apt-get install -y libsndfile1 ffmpeg sox
22
+ pip install Cython
23
+ pip install nemo_toolkit['all']
24
+ pip uninstall -y torchmetrics
25
+ pip install torchmetrics==0.9.2
26
+ ```
27
+ After installing the required dependencies, download the .nemo file or the pretrained model to your local directory. you can instantiate the pretrained model like following:
28
+ ```python
29
+ import nemo.collections.asr as nemo_asr
30
+ asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<MODEL PATH>")
31
+ ```
32
+ ##### Data Preprocessing
33
+ Prior to feeding the input audio to the pretrained model for inference, we need to resample the audio to **16KHz**. We can achieve that using the `sox` library :
34
+ ```python
35
+ from sox import Transformer
36
+ if not os.path.exists("<RESAMPLED AUDIO FILE PATH>"):
37
+ tfm = Transformer()
38
+ tfm.rate(samplerate=16000)
39
+ tfm.channels(n_channels=1)
40
+ tfm.build(input_filepath= "<AUDIO FILE PATH>", output_filepath= "<RESAMPLED AUDIO FILE PATH>")
41
+ ```
42
+ ##### Inference using the library
43
+ ```python
44
+ asr_model.transcribe(paths2audio_files="<LIST OF RESAMPLED AUDIO FILES>")
45
+ ```
46
+ ##### Infer an entire batch of audio files
47
+ ```bash
48
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
49
+ model_path="<PRETRAINED MODEL PATH>" \
50
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
51
+ output_filename="<PATH TO OUTPUT JSON>" \
52
+ batch_size=1
53
+ ```
54
+
55
+ #### Performance
56
+ The model has been separately evaluated across the different domain distributions of the test data.
57
+ For more details please refer this paper: [OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking](https://arxiv.org/pdf/2305.09688.pdf)
58
+ ![](https://i.imgur.com/rzWzEVj.png)
59
+
60
+ If you this model in your work, please cite the following paper :
61
+ ```
62
+ @misc{rakib2023oodspeech,
63
+ title={OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking},
64
+ author={Fazle Rabbi Rakib and Souhardya Saha Dip and Samiul Alam and Nazia Tasnim and Md. Istiak Hossain Shihab and Md. Nazmuddoha Ansary and Syed Mobassir Hossen and Marsia Haque Meghla and Mamunur Mamun and Farig Sadeque and Sayma Sultana Chowdhury and Tahsin Reasat and Asif Sushmit and Ahmed Imtiaz Humayun},
65
+ year={2023},
66
+ eprint={2305.09688},
67
+ archivePrefix={arXiv},
68
+ primaryClass={eess.AS}
69
+ }
70
+ ```