appledora commited on
Commit
1a2302f
1 Parent(s): a910682

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -13
README.md CHANGED
@@ -9,18 +9,18 @@ metrics:
9
  library_name: nemo
10
  pipeline_tag: automatic-speech-recognition
11
  ---
12
- #### Model
13
  **Conformer-CTC** model trained on the *OOD-Speech dataset* to transcribe speech from Bangla audio. This is a large variant of the model, with ~121M parameters. To know more about the model architecture see the NeMo Documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
14
 
15
- #### Dataset
16
- The training split contains `1100+ hours` of audio data crowdsoruced from native Bangla speakers. We trained on this split for `164 epochs` , then the model was evaluated on`23+ hours` of audio across 17 diverse domains, with a validation score of `22.4% WER` .
17
 
18
- #### Usage
19
  The model can be used as a pretrained checkpoint for inference or for fine-tuning on another dataset through the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). It is recommended to install the toolkit, after installing the pyTorch package.
20
  ```bash
21
  apt-get update && apt-get install -y libsndfile1 ffmpeg sox
22
  pip install Cython
23
- pip install nemo_toolkit['all']
24
  pip uninstall -y torchmetrics
25
  pip install torchmetrics==0.9.2
26
  ```
@@ -29,8 +29,8 @@ After installing the required dependencies, download the .nemo file or the pretr
29
  import nemo.collections.asr as nemo_asr
30
  asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<MODEL PATH>")
31
  ```
32
- ##### Data Preprocessing
33
- Prior to feeding the input audio to the pretrained model for inference, we need to resample the audio to **16KHz**. We can achieve that using the `sox` library :
34
  ```python
35
  from sox import Transformer
36
  if not os.path.exists("<RESAMPLED AUDIO FILE PATH>"):
@@ -39,21 +39,60 @@ if not os.path.exists("<RESAMPLED AUDIO FILE PATH>"):
39
  tfm.channels(n_channels=1)
40
  tfm.build(input_filepath= "<AUDIO FILE PATH>", output_filepath= "<RESAMPLED AUDIO FILE PATH>")
41
  ```
42
- ##### Inference using the library
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```python
44
  asr_model.transcribe(paths2audio_files="<LIST OF RESAMPLED AUDIO FILES>")
45
  ```
46
- ##### Infer an entire batch of audio files
47
  ```bash
48
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
49
  model_path="<PRETRAINED MODEL PATH>" \
50
- audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
51
  output_filename="<PATH TO OUTPUT JSON>" \
52
  batch_size=1
53
  ```
54
 
55
- #### Performance
56
- The model has been separately evaluated across the different domain distributions of the test data.
57
  For more details please refer this paper: [OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking](https://arxiv.org/pdf/2305.09688.pdf)
58
  ![](https://i.imgur.com/rzWzEVj.png)
59
 
@@ -67,4 +106,4 @@ If you this model in your work, please cite the following paper :
67
  archivePrefix={arXiv},
68
  primaryClass={eess.AS}
69
  }
70
- ```
 
9
  library_name: nemo
10
  pipeline_tag: automatic-speech-recognition
11
  ---
12
+ ## Model
13
  **Conformer-CTC** model trained on the *OOD-Speech dataset* to transcribe speech from Bangla audio. This is a large variant of the model, with ~121M parameters. To know more about the model architecture see the NeMo Documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc).
14
 
15
+ ## Dataset
16
+ The training split contains `1100+ hours` of audio data crowdsoruced from native Bangla speakers. We trained on this split for `164 epochs` , then the model was evaluated on`23+ hours` of audio across 17 diverse domains .
17
 
18
+ ## Usage
19
  The model can be used as a pretrained checkpoint for inference or for fine-tuning on another dataset through the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). It is recommended to install the toolkit, after installing the pyTorch package.
20
  ```bash
21
  apt-get update && apt-get install -y libsndfile1 ffmpeg sox
22
  pip install Cython
23
+ pip install nemo_toolkit['all'] #1.19.0
24
  pip uninstall -y torchmetrics
25
  pip install torchmetrics==0.9.2
26
  ```
 
29
  import nemo.collections.asr as nemo_asr
30
  asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<MODEL PATH>")
31
  ```
32
+ ## Data Preprocessing
33
+ Prior to feeding the input audio to the pretrained model for training or inference, we need to resample the audio to **16KHz**. We can achieve that using the `sox` library :
34
  ```python
35
  from sox import Transformer
36
  if not os.path.exists("<RESAMPLED AUDIO FILE PATH>"):
 
39
  tfm.channels(n_channels=1)
40
  tfm.build(input_filepath= "<AUDIO FILE PATH>", output_filepath= "<RESAMPLED AUDIO FILE PATH>")
41
  ```
42
+ ## Training
43
+ We used the official [NeMo documentation on training an ASR model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/examples/kinyarwanda_asr.html) to prepare our transcript manifest and train our model. However, we did not train any custom tokenizer and instead downloaded the tokenizer from [banglaBERT-large](https://huggingface.co/csebuetnlp/banglabert_large/) for better vocabulary coverage. For validation, we have used `29589` samples separated from the training data and processed accordingly. The final validation score was `22.4% WER` , at epoch `164`.
44
+ Final Training script:
45
+ ```bash
46
+ export TRAIN_MANIFEST_PATH="<TRAINING MANIFEST JSON>"
47
+ export DEV_MANIFEST_PATH="<VALIDATION MANIFEST JSON>"
48
+ export TOKENIZER_PATH="<TOKENIZER FOLDER>"
49
+ export HYDRA_FULL_ERROR=1
50
+ python [NEMO_GIT_FOLDER]/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py --config-path=[NEMO_GIT_FOLDER]/examples/asr/conf/conformer/ --config-name=conformer_ctc_bpe \
51
+ model.train_ds.manifest_filepath=${TRAIN_MANIFEST_PATH} \
52
+ model.validation_ds.manifest_filepath=${DEV_MANIFEST_PATH} \
53
+ model.tokenizer.dir=${TOKENIZER_PATH} \
54
+ model.tokenizer.type=wpe \
55
+ trainer.devices=4 \
56
+ trainer.accelerator="gpu" \
57
+ trainer.strategy="ddp" \
58
+ trainer.max_epochs=1000 \
59
+ model.optim.name="adamw" \
60
+ model.optim.lr=0.001 \
61
+ model.optim.betas=[0.9,0.999] \
62
+ model.optim.weight_decay=0.0001 \
63
+ model.optim.sched.warmup_steps=2000 \
64
+ exp_manager.exp_dir=results/ \
65
+ exp_manager.create_wandb_logger=False \
66
+ exp_manager.resume_if_exists=true
67
+ ```
68
+ ## Evaluation
69
+ `14,016` test samples have been used to evaluate the dataset. The generated output file contains both ground truth and predicted strings. The final result is the Word Error Rate (WER) and Character Error Rate (CER) for the model.
70
+ ```bash
71
+ export HYDRA_FULL_ERROR=1
72
+ python3 [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
73
+ model_path="<PRETRAINED MODEL PATH>" \
74
+ dataset_manifest="<TEST MANIFEST JSON>" \
75
+ output_filename=test_with_predictions.json \
76
+ batch_size=1
77
+ ```
78
+
79
+ **Test Dataset WER/CER 69.25%/42.13%**
80
+ ## Inference
81
+ ### Inference using the library
82
  ```python
83
  asr_model.transcribe(paths2audio_files="<LIST OF RESAMPLED AUDIO FILES>")
84
  ```
85
+ ### Infer directly using the repo
86
  ```bash
87
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
88
  model_path="<PRETRAINED MODEL PATH>" \
89
+ audio_dir="<DIRECTORY CONTAINING RESAMPLED AUDIO FILES>" \
90
  output_filename="<PATH TO OUTPUT JSON>" \
91
  batch_size=1
92
  ```
93
 
94
+ ## Domain-wise Performance
95
+ The model has also been separately evaluated across the different domain distributions of the test data.
96
  For more details please refer this paper: [OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking](https://arxiv.org/pdf/2305.09688.pdf)
97
  ![](https://i.imgur.com/rzWzEVj.png)
98
 
 
106
  archivePrefix={arXiv},
107
  primaryClass={eess.AS}
108
  }
109
+ ```