Deovrat commited on
Commit
8ffe7b1
1 Parent(s): e9e1f81

Push model using huggingface_hub.

Browse files
README.md CHANGED
@@ -1,196 +1,111 @@
1
  ---
2
- license: cc-by-4.0
3
- library_name: nemo
4
- tags:
5
- - pytorch
6
- - NeMo
7
  ---
8
 
9
- # Indicconformer Stt Ur Hybrid Rnnt Large
10
-
11
- <style>
12
- img {
13
- display: inline;
14
- }
15
- </style>
16
-
17
- [![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
18
- | [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
19
- | [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
20
-
21
- **Put a short model description here.**
22
-
23
- See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
24
-
25
-
26
- ## NVIDIA NeMo: Training
27
-
28
- To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
29
- ```
30
- pip install nemo_toolkit['all']
31
- ```
32
-
33
- ## How to Use this Model
34
 
35
- The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
36
 
37
- ### Automatically instantiate the model
38
 
39
- **NOTE**: Please update the model class below to match the class of the model being uploaded.
40
-
41
- ```python
42
- import nemo.core import ModelPT
43
- model = ModelPT.from_pretrained("ai4bharat/indicconformer_stt_ur_hybrid_rnnt_large")
44
- ```
45
-
46
- ### NOTE
47
-
48
- Add some information about how to use the model here. An example is provided for ASR inference below.
49
-
50
- ### Transcribing using Python
51
- First, let's get a sample
52
- ```
53
- wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
54
- ```
55
- Then simply do:
56
  ```
57
- asr_model.transcribe(['2086-149220-0033.wav'])
58
  ```
59
 
60
- ### Transcribing many audio files
61
-
62
- ```shell
63
- python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="ai4bharat/indicconformer_stt_ur_hybrid_rnnt_large" audio_dir=""
 
 
 
 
 
 
 
 
 
 
 
 
64
  ```
65
 
66
- ### Input
67
-
68
- **Add some information about what are the inputs to this model**
69
-
70
- ### Output
71
-
72
- **Add some information about what are the outputs of this model**
73
-
74
- ## Model Architecture
75
-
76
- **Add information here discussing architectural details of the model or any comments to users about the model.**
77
-
78
- ## Training
79
-
80
- **Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
81
-
82
- ### NOTE
83
-
84
- An example is provided below for ASR
85
-
86
- The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
87
-
88
- The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
89
-
90
-
91
- ### Datasets
92
-
93
- **Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**
94
-
95
- ### NOTE
96
-
97
- An example for the manifest section is provided below for ASR datasets
98
-
99
- datasets:
100
- - librispeech_asr
101
- - fisher_corpus
102
- - Switchboard-1
103
- - WSJ-0
104
- - WSJ-1
105
- - National-Singapore-Corpus-Part-1
106
- - National-Singapore-Corpus-Part-6
107
- - vctk
108
- - voxpopuli
109
- - europarl
110
- - multilingual_librispeech
111
- - mozilla-foundation/common_voice_8_0
112
- - MLCommons/peoples_speech
113
-
114
- The corresponding text in this section for those datasets is stated below -
115
 
116
- The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
 
 
 
 
 
 
 
 
117
 
118
- The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
119
 
120
- - Librispeech 960 hours of English speech
121
- - Fisher Corpus
122
- - Switchboard-1 Dataset
123
- - WSJ-0 and WSJ-1
124
- - National Speech Corpus (Part 1, Part 6)
125
- - VCTK
126
- - VoxPopuli (EN)
127
- - Europarl-ASR (EN)
128
- - Multilingual Librispeech (MLS EN) - 2,000 hour subset
129
- - Mozilla Common Voice (v7.0)
130
- - People's Speech - 12,000 hour subset
131
 
 
132
 
133
- ## Performance
134
 
135
- **Add information here about the performance of the model. Discuss what is the metric that is being used to evaluate the model and if there are external links explaning the custom metric, please link to it.
136
 
137
- ### NOTE
 
138
 
139
- An example is provided below for ASR metrics list that can be added to the top of the README
140
-
141
- model-index:
142
- - name: PUT_MODEL_NAME
143
- results:
144
- - task:
145
- name: Automatic Speech Recognition
146
- type: automatic-speech-recognition
147
- dataset:
148
- name: AMI (Meetings test)
149
- type: edinburghcstr/ami
150
- config: ihm
151
- split: test
152
- args:
153
- language: en
154
- metrics:
155
- - name: Test WER
156
- type: wer
157
- value: 17.10
158
- - task:
159
- name: Automatic Speech Recognition
160
- type: automatic-speech-recognition
161
- dataset:
162
- name: Earnings-22
163
- type: revdotcom/earnings22
164
- split: test
165
- args:
166
- language: en
167
- metrics:
168
- - name: Test WER
169
- type: wer
170
- value: 14.11
171
 
172
- Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
173
 
174
- It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
175
 
176
- ## Limitations
177
 
178
- **Discuss any practical limitations to the model when being used in real world cases. They can also be legal disclaimers, or discussion regarding the safety of the model (particularly in the case of LLMs).**
179
 
 
 
 
180
 
181
- ### Note
182
 
183
- An example is provided below
184
 
 
185
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
186
 
187
 
188
- ## License
189
 
190
- License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
191
 
192
- ## References
193
 
194
- **Provide appropriate references in the markdown link format below. Please order them numerically.**
195
-
196
- [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ {}
 
 
 
 
3
  ---
4
 
5
+ ## IndicConformer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
+ IndicConformer is an Hybrid RNNT conformer model built for Urdu.
8
 
9
+ ## AI4Bharat NeMo:
10
 
11
+ To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ```
13
+ git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
14
  ```
15
 
16
+ ## Usage
17
+
18
+ ```bash
19
+ $ python inference.py --help
20
+ usage: inference.py [-h] -c CHECKPOINT -f AUDIO_FILEPATH -d (cpu,cuda) -l LANGUAGE_CODE
21
+
22
+ options:
23
+ -h, --help show this help message and exit
24
+ -c CHECKPOINT, --checkpoint CHECKPOINT
25
+ Path to .nemo file
26
+ -f AUDIO_FILEPATH, --audio_filepath AUDIO_FILEPATH
27
+ Audio filepath
28
+ -d (cpu,cuda), --device (cpu,cuda)
29
+ Device (cpu/gpu)
30
+ -l LANGUAGE_CODE, --language_code LANGUAGE_CODE
31
+ Language Code (eg. hi)
32
  ```
33
 
34
+ ## Example command
35
+ ```
36
+ python inference.py -c ai4b_indicConformer_hi.nemo -f hindi-16khz.wav -d cuda -l hi
37
+ ```
38
+ Expected output -
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ ```
41
+ Loading model..
42
+ ...
43
+ Transcibing..
44
+ ----------
45
+ Transcript:
46
+ Took ** seconds.
47
+ ----------
48
+ ```
49
 
50
+ ### Input
51
 
52
+ This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
 
 
 
 
 
 
 
 
 
 
53
 
54
+ ### Output
55
 
56
+ This model provides transcribed speech as a string for a given audio sample.
57
 
58
+ ## Model Architecture
59
 
60
+ This model is a onformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
61
+ 512 as the model dimension.
62
 
63
+ ## Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ <ADD INFORMATION ABOUT HOW THE MODEL WAS TRAINED - HOW MANY EPOCHS, AMOUNT OF COMPUTE ETC>
66
 
67
+ ### Datasets
68
 
69
+ <LIST THE NAME AND SPLITS OF DATASETS USED TO TRAIN THIS MODEL (ALONG WITH LANGUAGE AND ANY ADDITIONAL INFORMATION)>
70
 
71
+ ## Performance
72
 
73
+ <LIST THE SCORES OF THE MODEL -
74
+ OR
75
+ USE THE Hugging Face Evaluate LiBRARY TO UPLOAD METRICS>
76
 
77
+ ## Limitations
78
 
79
+ <DECLARE ANY POTENTIAL LIMITATIONS OF THE MODEL>
80
 
81
+ Eg:
82
  Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
83
 
84
 
85
+ ## References
86
 
87
+ <ADD ANY REFERENCES HERE AS NEEDED>
88
 
89
+ [1] [AI4Bharat NeMo Toolkit](https://github.com/AI4Bharat/NeMo)
90
 
91
+ language:
92
+ - Urdu
93
+ license: mit
94
+ library_name: nemo
95
+ datasets:
96
+ - IndicVoices
97
+ - Vistaar
98
+ - Mahadhwani
99
+ thumbnail: null
100
+ tags:
101
+ - automatic-speech-recognition
102
+ - speech
103
+ - audio
104
+ - RNNT
105
+ - HybridConformer
106
+ - Transformer
107
+ - NeMo
108
+ - pytorch
109
+ model-index:
110
+ - name: indicconformer_stt_ur_hybrid_rnnt_large
111
+ results: []
indicconformer_stt_ur_hybrid_rnnt_large.nemo CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c8cdc40409f0c49a9106b66e60759b12dc0b30270dc2837d0b24e0baa15cedf6
3
  size 523192320
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e919d3b10d44aea15d39e47885bda69a4a6ad4764d367767cdd6b0aaa1d79898
3
  size 523192320