Titouan commited on
Commit
c3213c0
1 Parent(s): 9c70d1a

push model

Browse files
Files changed (6) hide show
  1. README.md +108 -0
  2. asr.ckpt +3 -0
  3. config.json +73 -0
  4. preprocessor_config.json +9 -0
  5. tokenizer.ckpt +3 -0
  6. wav2vec2.ckpt +3 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "rw"
3
+ thumbnail:
4
+ tags:
5
+ - automatic-speech-recognition
6
+ - CTC
7
+ - Attention
8
+ - pytorch
9
+ - speechbrain
10
+ - Transformer
11
+ license: "apache-2.0"
12
+ datasets:
13
+ - commonvoice
14
+ metrics:
15
+ - wer
16
+ - cer
17
+ ---
18
+
19
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
20
+ <br/><br/>
21
+
22
+ # wav2vec 2.0 with CTC/Attention trained on CommonVoice Kinyarwanda (No LM)
23
+
24
+ This repository provides all the necessary tools to perform automatic speech
25
+ recognition from an end-to-end system pretrained on CommonVoice (Kinyarwanda Language) within
26
+ SpeechBrain. For a better experience, we encourage you to learn more about
27
+ [SpeechBrain](https://speechbrain.github.io).
28
+
29
+ The performance of the model is the following:
30
+
31
+ | Release | Test WER | GPUs |
32
+ |:--------------:|:--------------:| :--------:|
33
+ | 03-06-21 | 15.69 | 2xV100 32GB |
34
+
35
+ ## Pipeline description
36
+
37
+ This ASR system is composed of 2 different but linked blocks:
38
+ - Tokenizer (unigram) that transforms words into subword units and trained with
39
+ the train transcriptions (train.tsv) of CommonVoice (RW).
40
+ - Acoustic model (wav2vec2.0 + CTC/Attention). A pretrained wav2vec 2.0 model ([wav2vec2-lv60-large](https://huggingface.co/facebook/wav2vec2-large-lv60)) is combined with two DNN layers and finetuned on CommonVoice En.
41
+ The obtained final acoustic representation is given to the CTC and attention decoders.
42
+
43
+
44
+ ## Install SpeechBrain
45
+
46
+ First of all, please install tranformers and SpeechBrain with the following command:
47
+
48
+ ```
49
+ pip install speechbrain transformers
50
+ ```
51
+
52
+ Please notice that we encourage you to read our tutorials and learn more about
53
+ [SpeechBrain](https://speechbrain.github.io).
54
+
55
+ ### Transcribing your own audio files (in Kinyarwanda)
56
+
57
+ ```python
58
+ from speechbrain.pretrained import EncoderDecoderASR
59
+
60
+ asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-rw", savedir="pretrained_models/asr-wav2vec2-commonvoice-rw")
61
+ asr_model.transcribe_file("example.wav")
62
+
63
+ ```
64
+ ### Inference on GPU
65
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
66
+
67
+ ### Training
68
+ The model was trained with SpeechBrain.
69
+ To train it from scratch follow these steps:
70
+ 1. Clone SpeechBrain:
71
+ ```bash
72
+ git clone https://github.com/speechbrain/speechbrain/
73
+ ```
74
+ 2. Install it:
75
+ ```bash
76
+ cd speechbrain
77
+ pip install -r requirements.txt
78
+ pip install -e .
79
+ ```
80
+
81
+ 3. Run Training:
82
+ ```bash
83
+ cd recipes/CommonVoice/ASR/seq2seq
84
+ python train.py hparams/train_fr_with_wav2vec.yaml --data_folder=your_data_folder
85
+ ```
86
+
87
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1tjz6IZmVRkuRE97E7h1cXFoGTer7pT73?usp=sharing).
88
+
89
+ ### Limitations
90
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
91
+
92
+ #### Referencing SpeechBrain
93
+
94
+ ```
95
+ @misc{SB2021,
96
+ author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
97
+ title = {SpeechBrain},
98
+ year = {2021},
99
+ publisher = {GitHub},
100
+ journal = {GitHub repository},
101
+ howpublished = {\\url{https://github.com/speechbrain/speechbrain}},
102
+ }
103
+ ```
104
+
105
+ #### About SpeechBrain
106
+ SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains.
107
+ Website: https://speechbrain.github.io/
108
+ GitHub: https://github.com/speechbrain/speechbrain
asr.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:226ceac9961434e257b4527ce949dafd857df06480c08b49733aeb55bb62f871
3
+ size 64926032
config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "apply_spec_augment": true,
4
+ "architectures": [
5
+ "Wav2Vec2Model"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "bos_token_id": 1,
9
+ "conv_bias": true,
10
+ "conv_dim": [
11
+ 512,
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512
18
+ ],
19
+ "conv_kernel": [
20
+ 10,
21
+ 3,
22
+ 3,
23
+ 3,
24
+ 3,
25
+ 2,
26
+ 2
27
+ ],
28
+ "conv_stride": [
29
+ 5,
30
+ 2,
31
+ 2,
32
+ 2,
33
+ 2,
34
+ 2,
35
+ 2
36
+ ],
37
+ "ctc_loss_reduction": "sum",
38
+ "ctc_zero_infinity": false,
39
+ "do_stable_layer_norm": true,
40
+ "eos_token_id": 2,
41
+ "feat_extract_activation": "gelu",
42
+ "feat_extract_dropout": 0.0,
43
+ "feat_extract_norm": "layer",
44
+ "feat_proj_dropout": 0.1,
45
+ "final_dropout": 0.0,
46
+ "gradient_checkpointing": false,
47
+ "hidden_act": "gelu",
48
+ "hidden_dropout": 0.1,
49
+ "hidden_size": 1024,
50
+ "initializer_range": 0.02,
51
+ "intermediate_size": 4096,
52
+ "layer_norm_eps": 1e-05,
53
+ "layerdrop": 0.1,
54
+ "mask_channel_length": 10,
55
+ "mask_channel_min_space": 1,
56
+ "mask_channel_other": 0.0,
57
+ "mask_channel_prob": 0.0,
58
+ "mask_channel_selection": "static",
59
+ "mask_time_length": 10,
60
+ "mask_time_min_space": 1,
61
+ "mask_time_other": 0.0,
62
+ "mask_time_prob": 0.075,
63
+ "mask_time_selection": "static",
64
+ "model_type": "wav2vec2",
65
+ "num_attention_heads": 16,
66
+ "num_conv_pos_embedding_groups": 16,
67
+ "num_conv_pos_embeddings": 128,
68
+ "num_feat_extract_layers": 7,
69
+ "num_hidden_layers": 24,
70
+ "pad_token_id": 0,
71
+ "transformers_version": "4.4.0.dev0",
72
+ "vocab_size": 32
73
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "return_attention_mask": true,
8
+ "sampling_rate": 16000
9
+ }
tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42678d102fa4fa8661b2ddc2b02127246666390fa5bafd0aa62cb3a83470fd70
3
+ size 252573
wav2vec2.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7fb070f3048c48973e388a0171d50c561458006d272d73d3388c3e308d6ec38
3
+ size 1261930757