--- license: mit language: - bn metrics: - wer - cer library_name: nemo pipeline_tag: automatic-speech-recognition --- #### Model **Conformer-CTC** model trained on the *OOD-Speech dataset* to transcribe speech from Bangla audio. This is a large variant of the model, with ~121M parameters. To know more about the model architecture see the NeMo Documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc). #### Dataset The training split contains `1100+ hours` of audio data crowdsoruced from native Bangla speakers. We trained on this split for `164 epochs` , then the model was evaluated on`23+ hours` of audio across 17 diverse domains, with a validation score of `22.4% WER` . #### Usage The model can be used as a pretrained checkpoint for inference or for fine-tuning on another dataset through the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). It is recommended to install the toolkit, after installing the pyTorch package. ```bash apt-get update && apt-get install -y libsndfile1 ffmpeg sox pip install Cython pip install nemo_toolkit['all'] pip uninstall -y torchmetrics pip install torchmetrics==0.9.2 ``` After installing the required dependencies, download the .nemo file or the pretrained model to your local directory. you can instantiate the pretrained model like following: ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="") ``` ##### Data Preprocessing Prior to feeding the input audio to the pretrained model for inference, we need to resample the audio to **16KHz**. We can achieve that using the `sox` library : ```python from sox import Transformer if not os.path.exists(""): tfm = Transformer() tfm.rate(samplerate=16000) tfm.channels(n_channels=1) tfm.build(input_filepath= "