dave-rtzr commited on
Commit
92233a4
1 Parent(s): 11f0f98

Add model weights and hyparams

Browse files
Files changed (9) hide show
  1. README.md +108 -0
  2. asr.ckpt +3 -0
  3. hyperparams.yaml +151 -0
  4. lm.ckpt +3 -0
  5. normalizer.ckpt +3 -0
  6. record_0_16k.wav +0 -0
  7. record_1_16k.wav +0 -0
  8. record_2_16k.wav +0 -0
  9. tokenizer.ckpt +3 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "kr"
3
+ thumbnail:
4
+ tags:
5
+ - ASR
6
+ - CTC
7
+ - Attention
8
+ - Conformer
9
+ - pytorch
10
+ - speechbrain
11
+ license: "apache-2.0"
12
+ datasets:
13
+ - ksponspeech
14
+ metrics:
15
+ - wer
16
+ - cer
17
+ ---
18
+
19
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
20
+ <br/><br/>
21
+
22
+ # Conformer for KsponSpeech (with Transformer LM)
23
+
24
+ This repository provides all the necessary tools to perform automatic speech
25
+ recognition from an end-to-end system pretrained on KsponSpeech (Kr) within
26
+ SpeechBrain. For a better experience, we encourage you to learn more about
27
+ [SpeechBrain](https://speechbrain.github.io).
28
+ The performance of the model is the following:
29
+
30
+ | Release | eval clean CER | eval other CER | GPUs |
31
+ |:-------------:|:--------------:|:--------------:|:--------:|
32
+ | 09-05-21 | 7.86 | 8.93 | 6xA100 80GB |
33
+
34
+ ## Pipeline description
35
+
36
+ This ASR system is composed of 3 different but linked blocks:
37
+ - Tokenizer (unigram) that transforms words into subword units and trained with
38
+ the train transcriptions of KsponSpeech.
39
+ - Neural language model (Transformer LM) trained on the train transcriptions of KsponSpeech
40
+ - Acoustic model made of a conformer encoder and a joint decoder with CTC +
41
+ transformer. Hence, the decoding also incorporates the CTC probabilities.
42
+ ## Install SpeechBrain
43
+ First of all, please install SpeechBrain with the following command:
44
+ ```
45
+ !pip install git+https://github.com/speechbrain/speechbrain.git@develop
46
+ ```
47
+ Please notice that we encourage you to read our tutorials and learn more about
48
+ [SpeechBrain](https://speechbrain.github.io).
49
+ ### Transcribing your own audio files (in Korean)
50
+ ```python
51
+ from speechbrain.pretrained import EncoderDecoderASR
52
+ asr_model = EncoderDecoderASR.from_hparams(source="dave-rtzr/ksponspeech-conformer-medium", savedir="pretrained_models/ksponspeech-conformer-medium", run_opts={"device":"cuda"})
53
+ asr_model.transcribe_file("dave-rtzr/ksponspeech-conformer-medium/example.wav")
54
+ ```
55
+
56
+ ### Inference on GPU
57
+
58
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
59
+
60
+ ## Parallel Inference on a Batch
61
+
62
+ Please, [see this Colab notebook](https://colab.research.google.com/drive/10N98aGoeLGfh6Hu6xOCH5BbjVTVYgCyB?usp=sharing) on using the pretrained model
63
+
64
+ ### Training
65
+
66
+ The model was trained with SpeechBrain (Commit hash: 'fd9826c').
67
+ To train it from scratch follow these steps:
68
+ 1. Clone SpeechBrain:
69
+ ```bash
70
+ git clone https://github.com/speechbrain/speechbrain/
71
+ ```
72
+ 2. Install it:
73
+ ```bash
74
+ cd speechbrain
75
+ pip install -r requirements.txt
76
+ pip install .
77
+ ```
78
+ 3. Run Training:
79
+ ```bash
80
+ cd recipes/KsponSpeech/ASR/transformer
81
+ python train.py hparams/conformer_medium.yaml --data_folder=your_data_folder
82
+ ```
83
+ You can find our training results (models, logs, etc) at the subdirectories.
84
+
85
+ ### Limitations
86
+
87
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
88
+
89
+ # **About SpeechBrain**
90
+
91
+ - Website: https://speechbrain.github.io/
92
+ - Code: https://github.com/speechbrain/speechbrain/
93
+ - HuggingFace: https://huggingface.co/speechbrain/
94
+
95
+ # **Citing SpeechBrain**
96
+
97
+ Please, cite SpeechBrain if you use it for your research or business.
98
+ ```bibtex
99
+ @misc{speechbrain,
100
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
101
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
102
+ year={2021},
103
+ eprint={2106.04624},
104
+ archivePrefix={arXiv},
105
+ primaryClass={eess.AS},
106
+ note={arXiv:2106.04624}
107
+ }
108
+ ```
asr.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:893a5fb84a67315a954d7645fd3b5f96cee806531f538e0073f6dcdf17dcf7c3
3
+ size 183510489
hyperparams.yaml ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ############################################################################
2
+ # Model: E2E ASR with Transformer
3
+ # Encoder: Conformer Encoder
4
+ # Decoder: Transformer Decoder + (CTC/ATT joint) beamsearch + TransformerLM
5
+ # Tokens: unigram
6
+ # losses: CTC + KLdiv (Label Smoothing loss)
7
+ # Training: KsponSpeech 965.2h
8
+ # Authors: Dongwon Kim, Dongwoo Kim
9
+ # ############################################################################
10
+ # Seed needs to be set at top of yaml, before objects with parameters are made
11
+
12
+ # Feature parameters
13
+ sample_rate: 16000
14
+ n_fft: 400
15
+ n_mels: 80
16
+
17
+ ####################### Model parameters ###########################
18
+ # Transformer
19
+ d_model: 256
20
+ nhead: 4
21
+ num_encoder_layers: 12
22
+ num_decoder_layers: 6
23
+ d_ffn: 2048
24
+ transformer_dropout: 0.0
25
+ activation: !name:torch.nn.GELU
26
+ output_neurons: 5000
27
+ vocab_size: 5000
28
+
29
+ # Outputs
30
+ blank_index: 0
31
+ label_smoothing: 0.1
32
+ pad_index: 0
33
+ bos_index: 1
34
+ eos_index: 2
35
+ unk_index: 0
36
+
37
+ # Decoding parameters
38
+ min_decode_ratio: 0.0
39
+ max_decode_ratio: 1.0
40
+ valid_search_interval: 10
41
+ valid_beam_size: 10
42
+ test_beam_size: 60
43
+ lm_weight: 0.60
44
+ ctc_weight_decode: 0.40
45
+
46
+ ############################## models ################################
47
+
48
+ normalizer: !new:speechbrain.processing.features.InputNormalization
49
+ norm_type: global
50
+
51
+ CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
52
+ input_shape: (8, 10, 80)
53
+ num_blocks: 2
54
+ num_layers_per_block: 1
55
+ out_channels: (64, 32)
56
+ kernel_sizes: (3, 3)
57
+ strides: (2, 2)
58
+ residuals: (False, False)
59
+
60
+ Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
61
+ input_size: 640
62
+ tgt_vocab: !ref <output_neurons>
63
+ d_model: !ref <d_model>
64
+ nhead: !ref <nhead>
65
+ num_encoder_layers: !ref <num_encoder_layers>
66
+ num_decoder_layers: !ref <num_decoder_layers>
67
+ d_ffn: !ref <d_ffn>
68
+ dropout: !ref <transformer_dropout>
69
+ activation: !ref <activation>
70
+ encoder_module: conformer
71
+ attention_type: RelPosMHAXL
72
+ normalize_before: True
73
+ causal: False
74
+
75
+ # NB: It has to match the pre-trained TransformerLM!!
76
+ lm_model: !new:speechbrain.lobes.models.transformer.TransformerLM.TransformerLM # yamllint disable-line rule:line-length
77
+ vocab: !ref <output_neurons>
78
+ d_model: 768
79
+ nhead: 12
80
+ num_encoder_layers: 12
81
+ num_decoder_layers: 0
82
+ d_ffn: 3072
83
+ dropout: 0.0
84
+ activation: !name:torch.nn.GELU
85
+ normalize_before: False
86
+
87
+ tokenizer: !new:sentencepiece.SentencePieceProcessor
88
+
89
+ ctc_lin: !new:speechbrain.nnet.linear.Linear
90
+ input_size: !ref <d_model>
91
+ n_neurons: !ref <output_neurons>
92
+
93
+ seq_lin: !new:speechbrain.nnet.linear.Linear
94
+ input_size: !ref <d_model>
95
+ n_neurons: !ref <output_neurons>
96
+
97
+ decoder: !new:speechbrain.decoders.S2STransformerBeamSearch
98
+ modules: [!ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
99
+ bos_index: !ref <bos_index>
100
+ eos_index: !ref <eos_index>
101
+ blank_index: !ref <blank_index>
102
+ min_decode_ratio: !ref <min_decode_ratio>
103
+ max_decode_ratio: !ref <max_decode_ratio>
104
+ beam_size: !ref <test_beam_size>
105
+ ctc_weight: !ref <ctc_weight_decode>
106
+ lm_weight: !ref <lm_weight>
107
+ lm_modules: !ref <lm_model>
108
+ temperature: 1.15
109
+ temperature_lm: 1.15
110
+ using_eos_threshold: False
111
+ length_normalization: True
112
+
113
+ Tencoder: !new:speechbrain.lobes.models.transformer.TransformerASR.EncoderWrapper
114
+ transformer: !ref <Transformer>
115
+
116
+ encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
117
+ input_shape: [null, null, !ref <n_mels>]
118
+ compute_features: !ref <compute_features>
119
+ normalize: !ref <normalizer>
120
+ cnn: !ref <CNN>
121
+ transformer_encoder: !ref <Tencoder>
122
+
123
+ asr_model: !new:torch.nn.ModuleList
124
+ - [!ref <normalizer>, !ref <CNN>, !ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
125
+
126
+ log_softmax: !new:torch.nn.LogSoftmax
127
+ dim: -1
128
+
129
+
130
+ compute_features: !new:speechbrain.lobes.features.Fbank
131
+ sample_rate: !ref <sample_rate>
132
+ n_fft: !ref <n_fft>
133
+ n_mels: !ref <n_mels>
134
+
135
+ modules:
136
+ compute_features: !ref <compute_features>
137
+ normalizer: !ref <normalizer>
138
+ pre_transformer: !ref <CNN>
139
+ transformer: !ref <Transformer>
140
+ asr_model: !ref <asr_model>
141
+ lm_model: !ref <lm_model>
142
+ encoder: !ref <encoder>
143
+ decoder: !ref <decoder>
144
+ # The pretrainer allows a mapping between pretrained files and instances that
145
+ # are declared in the yaml.
146
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
147
+ loadables:
148
+ normalizer: !ref <normalizer>
149
+ asr: !ref <asr_model>
150
+ lm: !ref <lm_model>
151
+ tokenizer: !ref <tokenizer>
lm.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee4a5a5d9ce11e24dcea93f24a241528b9b376798be6478c70fb279736515110
3
+ size 381074814
normalizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4866d96b29f5c97526c7469aa6f58cd50aeb9865b457daf599f0f42e5827be9
3
+ size 1783
record_0_16k.wav ADDED
Binary file (115 kB). View file
 
record_1_16k.wav ADDED
Binary file (170 kB). View file
 
record_2_16k.wav ADDED
Binary file (133 kB). View file
 
tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e095c023a42b6bd25352512597a245db9bf9126ce6bf64082bd41d0a196b220
3
+ size 313899