jsnfly commited on
Commit
9da7e2a
1 Parent(s): 660038c

initial model files

Browse files
Dockerfile ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adapted from https://github.com/baaastijn/Dockerimages/tree/main/Hugginface_challenge_speech to use custom
2
+ # Transformers branch.
3
+
4
+ # Base image. Here we take one from OVHcloud with Jupyter inside and pytorch
5
+ FROM ovhcom/ai-training-pytorch:latest
6
+
7
+ # Install git, audio loader ang git lfs
8
+ RUN apt-get update && \
9
+ apt-get install -y git && \
10
+ apt-get install -y libsndfile1-dev sox && \
11
+ apt-get install -y git-lfs && \
12
+ git lfs install
13
+
14
+ # Install required python libraries. We install transformers from source to get latest version
15
+ RUN pip install --upgrade pip && \
16
+ pip install git+https://github.com/jsnfly/transformers.git@speech-challenge-experiments && \
17
+ pip install git+https://github.com/huggingface/datasets && \
18
+ pip install torchaudio librosa jiwer && \
19
+ pip install pandas numpy nano gradio
20
+
21
+ # Create a HOME dedicated to the OVHcloud user (42420:42420)
22
+ RUN chown -R 42420:42420 /workspace
23
+ ENV HOME /workspace
24
+ WORKDIR /workspace
25
+
26
+ # Copy a folder of example notebooks into another folder in remote workspace
27
+ # COPY notebooks /workspace/
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - de
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - mozilla-foundation/common_voice_7_0
8
+ - de
9
+ datasets:
10
+ - mozilla-foundation/common_voice_7_0
11
+ model-index:
12
+ - name: Wav2Vec2-Large-XLSR-53-German-GPT2
13
+ results:
14
+ - task:
15
+ name: Automatic Speech Recognition
16
+ type: automatic-speech-recognition
17
+ dataset:
18
+ name: Common Voice 7
19
+ type: mozilla-foundation/common_voice_7_0
20
+ args: de
21
+ metrics:
22
+ - name: Test WER
23
+ type: wer
24
+ value: 11.49
25
+ - name: Test CER
26
+ type: cer
27
+ value: 5.6
28
+ ---
29
+
30
+ # Wav2Vec2-Large-XLSR-53-German-GPT2
31
+
32
+ This is an encoder-decoder model for automatic speech recognition trained on on the
33
+ MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from
34
+ [jonatasgrosman/wav2vec2-large-xlsr-53-german](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german) and
35
+ the decoder from [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2).
36
+
37
+ It was trained using a two step process:
38
+ * fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
39
+ * fine-tuning the model end-to-end
40
+
41
+ There is also one trick, which seemed to improve performance significantly: adding position embeddings to the
42
+ encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See `eval.py`).
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"<|endoftext|>": 50265}
config.json ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SpeechEncoderDecoderModel"
4
+ ],
5
+ "decoder": {
6
+ "_name_or_path": "dbmdz/german-gpt2",
7
+ "activation_function": "gelu_new",
8
+ "add_cross_attention": true,
9
+ "architectures": [
10
+ "GPT2LMHeadModel"
11
+ ],
12
+ "attn_pdrop": 0.0,
13
+ "bad_words_ids": null,
14
+ "bos_token_id": 50256,
15
+ "chunk_size_feed_forward": 0,
16
+ "cross_attention_hidden_size": null,
17
+ "decoder_start_token_id": null,
18
+ "diversity_penalty": 0.0,
19
+ "do_sample": false,
20
+ "early_stopping": false,
21
+ "embd_pdrop": 0.0,
22
+ "encoder_no_repeat_ngram_size": 0,
23
+ "eos_token_id": 50256,
24
+ "finetuning_task": null,
25
+ "forced_bos_token_id": null,
26
+ "forced_eos_token_id": null,
27
+ "gradient_checkpointing": false,
28
+ "id2label": {
29
+ "0": "LABEL_0",
30
+ "1": "LABEL_1"
31
+ },
32
+ "initializer_range": 0.02,
33
+ "is_decoder": true,
34
+ "is_encoder_decoder": false,
35
+ "label2id": {
36
+ "LABEL_0": 0,
37
+ "LABEL_1": 1
38
+ },
39
+ "layer_norm_epsilon": 1e-05,
40
+ "length_penalty": 1.0,
41
+ "max_length": 20,
42
+ "min_length": 0,
43
+ "model_type": "gpt2",
44
+ "n_ctx": 1024,
45
+ "n_embd": 768,
46
+ "n_head": 12,
47
+ "n_inner": null,
48
+ "n_layer": 12,
49
+ "n_positions": 1024,
50
+ "no_repeat_ngram_size": 0,
51
+ "num_beam_groups": 1,
52
+ "num_beams": 1,
53
+ "num_return_sequences": 1,
54
+ "output_attentions": false,
55
+ "output_hidden_states": false,
56
+ "output_scores": false,
57
+ "pad_token_id": null,
58
+ "prefix": null,
59
+ "problem_type": null,
60
+ "pruned_heads": {},
61
+ "remove_invalid_values": false,
62
+ "reorder_and_upcast_attn": false,
63
+ "repetition_penalty": 1.0,
64
+ "resid_pdrop": 0.0,
65
+ "return_dict": true,
66
+ "return_dict_in_generate": false,
67
+ "scale_attn_by_inverse_layer_idx": false,
68
+ "scale_attn_weights": true,
69
+ "sep_token_id": null,
70
+ "summary_activation": null,
71
+ "summary_first_dropout": 0.1,
72
+ "summary_proj_to_labels": true,
73
+ "summary_type": "cls_index",
74
+ "summary_use_proj": true,
75
+ "task_specific_params": {
76
+ "text-generation": {
77
+ "do_sample": true,
78
+ "max_length": 50
79
+ }
80
+ },
81
+ "temperature": 1.0,
82
+ "tie_encoder_decoder": false,
83
+ "tie_word_embeddings": true,
84
+ "tokenizer_class": null,
85
+ "top_k": 50,
86
+ "top_p": 1.0,
87
+ "torch_dtype": "float32",
88
+ "torchscript": false,
89
+ "transformers_version": "4.17.0.dev0",
90
+ "use_bfloat16": false,
91
+ "use_cache": true,
92
+ "vocab_size": 50265
93
+ },
94
+ "decoder_start_token_id": 98,
95
+ "encoder": {
96
+ "_name_or_path": "jonatasgrosman/wav2vec2-large-xlsr-53-german",
97
+ "activation_dropout": 0.05,
98
+ "adapter_kernel_size": 3,
99
+ "adapter_stride": 2,
100
+ "add_adapter": false,
101
+ "add_cross_attention": false,
102
+ "apply_spec_augment": true,
103
+ "architectures": [
104
+ "Wav2Vec2ForCTC"
105
+ ],
106
+ "attention_dropout": 0.1,
107
+ "bad_words_ids": null,
108
+ "bos_token_id": 1,
109
+ "chunk_size_feed_forward": 0,
110
+ "classifier_proj_size": 256,
111
+ "codevector_dim": 768,
112
+ "contrastive_logits_temperature": 0.1,
113
+ "conv_bias": true,
114
+ "conv_dim": [
115
+ 512,
116
+ 512,
117
+ 512,
118
+ 512,
119
+ 512,
120
+ 512,
121
+ 512
122
+ ],
123
+ "conv_kernel": [
124
+ 10,
125
+ 3,
126
+ 3,
127
+ 3,
128
+ 3,
129
+ 2,
130
+ 2
131
+ ],
132
+ "conv_stride": [
133
+ 5,
134
+ 2,
135
+ 2,
136
+ 2,
137
+ 2,
138
+ 2,
139
+ 2
140
+ ],
141
+ "cross_attention_hidden_size": null,
142
+ "ctc_loss_reduction": "mean",
143
+ "ctc_zero_infinity": true,
144
+ "decoder_start_token_id": null,
145
+ "diversity_loss_weight": 0.1,
146
+ "diversity_penalty": 0.0,
147
+ "do_sample": false,
148
+ "do_stable_layer_norm": true,
149
+ "early_stopping": false,
150
+ "encoder_no_repeat_ngram_size": 0,
151
+ "eos_token_id": 2,
152
+ "feat_extract_activation": "gelu",
153
+ "feat_extract_dropout": 0.0,
154
+ "feat_extract_norm": "layer",
155
+ "feat_proj_dropout": 0.05,
156
+ "feat_quantizer_dropout": 0.0,
157
+ "final_dropout": 0.0,
158
+ "finetuning_task": null,
159
+ "forced_bos_token_id": null,
160
+ "forced_eos_token_id": null,
161
+ "hidden_act": "gelu",
162
+ "hidden_dropout": 0.05,
163
+ "hidden_size": 1024,
164
+ "id2label": {
165
+ "0": "LABEL_0",
166
+ "1": "LABEL_1"
167
+ },
168
+ "initializer_range": 0.02,
169
+ "intermediate_size": 4096,
170
+ "is_decoder": false,
171
+ "is_encoder_decoder": false,
172
+ "label2id": {
173
+ "LABEL_0": 0,
174
+ "LABEL_1": 1
175
+ },
176
+ "layer_norm_eps": 1e-05,
177
+ "layerdrop": 0.05,
178
+ "length_penalty": 1.0,
179
+ "mask_channel_length": 10,
180
+ "mask_channel_min_space": 1,
181
+ "mask_channel_other": 0.0,
182
+ "mask_channel_prob": 0.0,
183
+ "mask_channel_selection": "static",
184
+ "mask_feature_length": 10,
185
+ "mask_feature_min_masks": 0,
186
+ "mask_feature_prob": 0.0,
187
+ "mask_time_length": 10,
188
+ "mask_time_min_masks": 2,
189
+ "mask_time_min_space": 1,
190
+ "mask_time_other": 0.0,
191
+ "mask_time_prob": 0.05,
192
+ "mask_time_selection": "static",
193
+ "max_length": 20,
194
+ "min_length": 0,
195
+ "model_type": "wav2vec2",
196
+ "no_repeat_ngram_size": 0,
197
+ "num_adapter_layers": 3,
198
+ "num_attention_heads": 16,
199
+ "num_beam_groups": 1,
200
+ "num_beams": 1,
201
+ "num_codevector_groups": 2,
202
+ "num_codevectors_per_group": 320,
203
+ "num_conv_pos_embedding_groups": 16,
204
+ "num_conv_pos_embeddings": 128,
205
+ "num_feat_extract_layers": 7,
206
+ "num_hidden_layers": 24,
207
+ "num_negatives": 100,
208
+ "num_return_sequences": 1,
209
+ "output_attentions": false,
210
+ "output_hidden_size": 1024,
211
+ "output_hidden_states": false,
212
+ "output_scores": false,
213
+ "pad_token_id": 0,
214
+ "prefix": null,
215
+ "problem_type": null,
216
+ "proj_codevector_dim": 768,
217
+ "pruned_heads": {},
218
+ "remove_invalid_values": false,
219
+ "repetition_penalty": 1.0,
220
+ "return_dict": true,
221
+ "return_dict_in_generate": false,
222
+ "sep_token_id": null,
223
+ "task_specific_params": null,
224
+ "tdnn_dilation": [
225
+ 1,
226
+ 2,
227
+ 3,
228
+ 1,
229
+ 1
230
+ ],
231
+ "tdnn_dim": [
232
+ 512,
233
+ 512,
234
+ 512,
235
+ 512,
236
+ 1500
237
+ ],
238
+ "tdnn_kernel": [
239
+ 5,
240
+ 3,
241
+ 3,
242
+ 1,
243
+ 1
244
+ ],
245
+ "temperature": 1.0,
246
+ "tie_encoder_decoder": false,
247
+ "tie_word_embeddings": true,
248
+ "tokenizer_class": null,
249
+ "top_k": 50,
250
+ "top_p": 1.0,
251
+ "torch_dtype": null,
252
+ "torchscript": false,
253
+ "transformers_version": "4.17.0.dev0",
254
+ "use_bfloat16": false,
255
+ "use_weighted_layer_sum": false,
256
+ "vocab_size": 38,
257
+ "xvector_output_dim": 512
258
+ },
259
+ "is_encoder_decoder": true,
260
+ "max_length": 35,
261
+ "model_type": "speech-encoder-decoder",
262
+ "pad_token_id": 67,
263
+ "tie_word_embeddings": false,
264
+ "torch_dtype": "float32",
265
+ "transformers_version": null
266
+ }
eval.py ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import re
3
+ from typing import Dict
4
+
5
+ import torch
6
+ from datasets import Audio, Dataset, load_dataset, load_metric
7
+
8
+ from transformers import AutoFeatureExtractor, AutoTokenizer, SpeechEncoderDecoderModel, pipeline
9
+
10
+ from torch import nn
11
+ from torch.nn import CrossEntropyLoss
12
+ from transformers.models.encoder_decoder.modeling_encoder_decoder import shift_tokens_right
13
+ from transformers.modeling_outputs import Seq2SeqLMOutput
14
+
15
+
16
+ def log_results(result: Dataset, args: Dict[str, str]):
17
+ """DO NOT CHANGE. This function computes and logs the result metrics."""
18
+
19
+ log_outputs = args.log_outputs
20
+ dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
21
+
22
+ # load metric
23
+ wer = load_metric("wer")
24
+ cer = load_metric("cer")
25
+
26
+ # compute metrics
27
+ wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
28
+ cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
29
+
30
+ # print & log results
31
+ result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
32
+ print(result_str)
33
+
34
+ with open(f"{dataset_id}_eval_results.txt", "w") as f:
35
+ f.write(result_str)
36
+
37
+ # log all results in text file. Possibly interesting for analysis
38
+ if log_outputs is not None:
39
+ pred_file = f"log_{dataset_id}_predictions.txt"
40
+ target_file = f"log_{dataset_id}_targets.txt"
41
+
42
+ with open(pred_file, "w") as p, open(target_file, "w") as t:
43
+
44
+ # mapping function to write output
45
+ def write_to_file(batch, i):
46
+ p.write(f"{i}" + "\n")
47
+ p.write(batch["prediction"] + "\n")
48
+ t.write(f"{i}" + "\n")
49
+ t.write(batch["target"] + "\n")
50
+
51
+ result.map(write_to_file, with_indices=True)
52
+
53
+
54
+ def normalize_text(text: str) -> str:
55
+ """DO ADAPT FOR YOUR USE CASE. this function normalizes the target text."""
56
+
57
+ # From https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german.
58
+ CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
59
+ "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
60
+ "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
61
+ "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
62
+ "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
63
+ chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
64
+ text = re.sub(chars_to_ignore_regex, "", text.lower())
65
+
66
+ return text
67
+
68
+
69
+ def main(args):
70
+ # load dataset
71
+ dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
72
+
73
+ # # for testing: only process the first two examples as a test
74
+ # dataset = dataset.select(range(10))
75
+
76
+ # load processor
77
+ feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
78
+ sampling_rate = feature_extractor.sampling_rate
79
+
80
+ # resample audio
81
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
82
+
83
+ # load tokenizer
84
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id)
85
+
86
+ # load model
87
+ model = Wav2VecGPT2Model.from_pretrained(args.model_id)
88
+
89
+ # load eval pipeline
90
+ if args.device is None:
91
+ args.device = 0 if torch.cuda.is_available() else -1
92
+ asr = pipeline("automatic-speech-recognition", model=model, device=args.device,
93
+ feature_extractor=feature_extractor, tokenizer=tokenizer)
94
+
95
+ # map function to decode audio
96
+ def map_to_pred(batch):
97
+ prediction = asr(
98
+ batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s
99
+ )
100
+
101
+ batch["prediction"] = normalize_text(prediction["text"])
102
+ batch["target"] = normalize_text(batch["sentence"])
103
+ return batch
104
+
105
+ # run inference on all examples
106
+ result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
107
+
108
+ # compute and log_results
109
+ # do not change function below
110
+ log_results(result, args)
111
+
112
+
113
+ class Wav2VecGPT2Model(SpeechEncoderDecoderModel):
114
+ """
115
+ Basically the same as `SpeechEncoderDecoderModel` but position embeddings (initialized with GPT2's position
116
+ embeddings) are added to encoder output
117
+ """
118
+ def __init__(self, *args, **kwargs):
119
+ super().__init__(*args, **kwargs)
120
+ self.encoder_outputs_pos_emb = nn.Embedding(1024, self.decoder.config.hidden_size)
121
+ with torch.no_grad():
122
+ self.encoder_outputs_pos_emb.weight.copy_(self.decoder.transformer.wpe.weight)
123
+ self.enc_to_dec_proj_ln = nn.LayerNorm(self.decoder.config.hidden_size,
124
+ eps=self.decoder.config.layer_norm_epsilon)
125
+
126
+ def __getattribute__(self, name):
127
+ # Fake class so it is recognized as seq2seq model.
128
+ if name == '__class__':
129
+ return SpeechEncoderDecoderModel
130
+ return SpeechEncoderDecoderModel.__getattribute__(self, name)
131
+
132
+ def forward(
133
+ self,
134
+ inputs=None,
135
+ attention_mask=None,
136
+ decoder_input_ids=None,
137
+ decoder_attention_mask=None,
138
+ encoder_outputs=None,
139
+ past_key_values=None,
140
+ decoder_inputs_embeds=None,
141
+ labels=None,
142
+ use_cache=None,
143
+ output_attentions=None,
144
+ output_hidden_states=None,
145
+ input_values=None,
146
+ input_features=None,
147
+ return_dict=None,
148
+ **kwargs,
149
+ ):
150
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
151
+
152
+ kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith("decoder_")}
153
+
154
+ kwargs_decoder = {
155
+ argument[len("decoder_") :]: value for argument, value in kwargs.items() if argument.startswith("decoder_")
156
+ }
157
+
158
+ if encoder_outputs is None and inputs is None:
159
+ if input_values is not None and input_features is not None:
160
+ raise ValueError("You cannot specify both input_values and input_features at the same time")
161
+ elif input_values is not None:
162
+ inputs = input_values
163
+ elif input_features is not None:
164
+ inputs = input_features
165
+ else:
166
+ raise ValueError("You have to specify either input_values or input_features")
167
+
168
+ encoder_outputs = self.encoder(
169
+ inputs,
170
+ attention_mask=attention_mask,
171
+ output_attentions=output_attentions,
172
+ output_hidden_states=output_hidden_states,
173
+ return_dict=return_dict,
174
+ **kwargs_encoder,
175
+ )
176
+
177
+ encoder_hidden_states = encoder_outputs[0]
178
+
179
+ # optionally project encoder_hidden_states
180
+ if (
181
+ self.encoder_output_dim != self.decoder.config.hidden_size
182
+ and self.decoder.config.cross_attention_hidden_size is None
183
+ ):
184
+ encoder_hidden_states = self.enc_to_dec_proj(encoder_hidden_states)
185
+ encoder_hidden_states += self.encoder_outputs_pos_emb(
186
+ torch.arange(0, encoder_hidden_states.shape[1], device=encoder_hidden_states.device)
187
+ )
188
+ encoder_hidden_states = self.enc_to_dec_proj_ln(encoder_hidden_states)
189
+
190
+ # compute correct encoder attention mask
191
+ if attention_mask is not None:
192
+ encoder_attention_mask = self.encoder._get_feature_vector_attention_mask(
193
+ encoder_hidden_states.shape[1], attention_mask
194
+ )
195
+ else:
196
+ encoder_attention_mask = None
197
+
198
+ if (labels is not None) and (decoder_input_ids is None and decoder_inputs_embeds is None):
199
+ decoder_input_ids = shift_tokens_right(
200
+ labels, self.config.pad_token_id, self.config.decoder_start_token_id
201
+ )
202
+
203
+ # Decode
204
+ decoder_outputs = self.decoder(
205
+ input_ids=decoder_input_ids,
206
+ attention_mask=decoder_attention_mask,
207
+ encoder_hidden_states=encoder_hidden_states,
208
+ encoder_attention_mask=encoder_attention_mask,
209
+ inputs_embeds=decoder_inputs_embeds,
210
+ output_attentions=output_attentions,
211
+ output_hidden_states=output_hidden_states,
212
+ use_cache=use_cache,
213
+ past_key_values=past_key_values,
214
+ return_dict=return_dict,
215
+ **kwargs_decoder,
216
+ )
217
+
218
+ # Compute loss independent from decoder (as some shift the logits inside them)
219
+ loss = None
220
+ if labels is not None:
221
+ logits = decoder_outputs.logits if return_dict else decoder_outputs[0]
222
+ loss_fct = CrossEntropyLoss()
223
+ loss = loss_fct(logits.reshape(-1, self.decoder.config.vocab_size), labels.view(-1))
224
+
225
+ if not return_dict:
226
+ if loss is not None:
227
+ return (loss,) + decoder_outputs + encoder_outputs
228
+ else:
229
+ return decoder_outputs + encoder_outputs
230
+
231
+ return Seq2SeqLMOutput(
232
+ loss=loss,
233
+ logits=decoder_outputs.logits,
234
+ past_key_values=decoder_outputs.past_key_values,
235
+ decoder_hidden_states=decoder_outputs.hidden_states,
236
+ decoder_attentions=decoder_outputs.attentions,
237
+ cross_attentions=decoder_outputs.cross_attentions,
238
+ encoder_last_hidden_state=encoder_outputs.last_hidden_state,
239
+ encoder_hidden_states=encoder_outputs.hidden_states,
240
+ encoder_attentions=encoder_outputs.attentions,
241
+ )
242
+
243
+
244
+ if __name__ == "__main__":
245
+ parser = argparse.ArgumentParser()
246
+
247
+ parser.add_argument(
248
+ "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
249
+ )
250
+ parser.add_argument(
251
+ "--dataset",
252
+ type=str,
253
+ required=True,
254
+ help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
255
+ )
256
+ parser.add_argument(
257
+ "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
258
+ )
259
+ parser.add_argument("--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`")
260
+ parser.add_argument(
261
+ "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to 5 seconds."
262
+ )
263
+ parser.add_argument(
264
+ "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to 1 second."
265
+ )
266
+ parser.add_argument(
267
+ "--log_outputs", action="store_true", help="If defined, write outputs to log file for analysis."
268
+ )
269
+ parser.add_argument(
270
+ "--device",
271
+ type=int,
272
+ default=None,
273
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
274
+ )
275
+ args = parser.parse_args()
276
+
277
+ main(args)
log_mozilla-foundation_common_voice_7_0_de_test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_7_0_de_test_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "return_attention_mask": true,
8
+ "sampling_rate": 16000
9
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "~", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>", "pad_token": "_"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "<|endoftext|>", "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "add_prefix_space": false, "special_tokens_map_file": null, "name_or_path": "dbmdz/german-gpt2", "tokenizer_class": "GPT2Tokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff