comodoro
/

wav2vec2-xls-r-300m-cs-250

+language: cs
+datasets:
+- mozilla-foundation/common_voice_8_0
+metrics:
+- wer
+tags:
+- generated_from_trainer
+- audio
+- automatic-speech-recognition
+- speech
+- xlsr-fine-tuning-week
+license: apache-2.0
+model-index:
+- name: Czech comodoro Wav2Vec2 XLSR 300M CV8
+  results:
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 8.0 cs
+      type: mozilla-foundation/common_voice_8_0
+      args: cs
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 0.47455377483706096
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# wav2vec2-xls-r-300m-cs-cv8
+This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice 8.0 dataset.
+It achieves the following results on the evaluation set:
+- WER: 0.47455377483706096
+- CER: 0.10877155235645618
+## Model description
+Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Czech using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset.
+When using this model, make sure that your speech input is sampled at 16kHz.
+The model can be used directly (without a language model) as follows:
+```python
+import torch
+import torchaudio
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "cs", split="test[:2%]")
+processor = Wav2Vec2Processor.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-cv8")
+model = Wav2Vec2ForCTC.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-cv8")
+resampler = torchaudio.transforms.Resample(48_000, 16_000)
+# Preprocessing the datasets.
+# We need to read the aduio files as arrays
+def speech_file_to_array_fn(batch):
+	speech_array, sampling_rate = torchaudio.load(batch["path"])
+	batch["speech"] = resampler(speech_array).squeeze().numpy()
+	return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+print("Prediction:", processor.batch_decode(predicted_ids))
+print("Reference:", test_dataset[:2]["sentence"])
+```
+## Evaluation
+The model can be evaluated using the attached `eval.py` script.
+## Training and evaluation data
+The Common Voice 8.0 `train` and `validation` datasets were used for training
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 7e-05
+- train_batch_size: 32
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 20
+- total_train_batch_size: 640
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 500
+- num_epochs: 150
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch  | Step | Validation Loss | Wer    | Cer    |
+|:-------------:|:------:|:----:|:---------------:|:------:|:------:|
+| 7.2926        | 8.06   | 250  | 3.8497          | 1.0    | 1.0    |
+| 3.417         | 16.13  | 500  | 3.2852          | 1.0    | 0.9857 |
+| 2.0264        | 24.19  | 750  | 0.7099          | 0.7342 | 0.1768 |
+| 0.4018        | 32.25  | 1000 | 0.6188          | 0.6415 | 0.1551 |
+| 0.2444        | 40.32  | 1250 | 0.6632          | 0.6362 | 0.1600 |
+| 0.1882        | 48.38  | 1500 | 0.6070          | 0.5783 | 0.1388 |
+| 0.153         | 56.44  | 1750 | 0.6425          | 0.5720 | 0.1377 |
+| 0.1214        | 64.51  | 2000 | 0.6363          | 0.5546 | 0.1337 |
+| 0.1011        | 72.57  | 2250 | 0.6310          | 0.5222 | 0.1224 |
+| 0.0879        | 80.63  | 2500 | 0.6353          | 0.5258 | 0.1253 |
+| 0.0782        | 88.7   | 2750 | 0.6078          | 0.4904 | 0.1127 |
+| 0.0709        | 96.76  | 3000 | 0.6465          | 0.4960 | 0.1154 |
+| 0.0661        | 104.82 | 3250 | 0.6622          | 0.4945 | 0.1166 |
+| 0.0616        | 112.89 | 3500 | 0.6440          | 0.4786 | 0.1104 |
+| 0.0579        | 120.95 | 3750 | 0.6815          | 0.4887 | 0.1144 |
+| 0.0549        | 129.03 | 4000 | 0.6603          | 0.4780 | 0.1105 |
+| 0.0527        | 137.09 | 4250 | 0.6652          | 0.4749 | 0.1090 |
+| 0.0506        | 145.16 | 4500 | 0.6958          | 0.4846 | 0.1133 |
+### Framework versions
+- Transformers 4.16.0.dev0
+- Pytorch 1.10.1+cu102
+- Datasets 1.17.1.dev0
+- Tokenizers 0.11.0

eval.py ADDED Viewed

	@@ -0,0 +1,163 @@

+#!/usr/bin/env python3
+from datasets import load_dataset, load_metric, Audio, Dataset
+from transformers import pipeline, AutoFeatureExtractor
+import re
+import argparse
+import unicodedata
+from typing import Dict
+def log_results(result: Dataset, args: Dict[str, str]):
+    """ DO NOT CHANGE. This function computes and logs the result metrics. """
+    log_outputs = args.log_outputs
+    dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
+    # load metric
+    wer = load_metric("wer")
+    cer = load_metric("cer")
+    # compute metrics
+    wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
+    cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
+    # print & log results
+    result_str = (
+        f"WER: {wer_result}\n"
+        f"CER: {cer_result}"
+    )
+    print(result_str)
+    with open(f"{dataset_id}_eval_results.txt", "w") as f:
+        f.write(result_str)
+    # log all results in text file. Possibly interesting for analysis
+    if log_outputs is not None:
+        pred_file = f"log_{dataset_id}_predictions.txt"
+        target_file = f"log_{dataset_id}_targets.txt"
+        with open(pred_file, "w") as p, open(target_file, "w") as t:
+            # mapping function to write output
+            def write_to_file(batch, i):
+                p.write(f"{i}" + "\n")
+                p.write(batch["prediction"] + "\n")
+                t.write(f"{i}" + "\n")
+                t.write(batch["target"] + "\n")
+            result.map(write_to_file, with_indices=True)
+def normalize_text(text: str) -> str:
+    """ DO ADAPT FOR YOUR USE CASE. this function normalizes the target text. """
+    CHARS = {
+    'ü': 'ue',
+    'ö': 'oe',
+    'ï': 'i',
+    'ë': 'e',
+    'ä': 'ae',
+    'ã': 'a',
+    'à': 'á',
+    'ø': 'o',
+    'è': 'é',
+    'ê': 'é',
+    'å': 'ó',
+    'î': 'i',
+    'ñ': 'ň',
+    'ç': 's',
+    'ľ': 'l',
+    'ż': 'ž',
+    'ł': 'w',
+    'ć': 'č',
+    'þ': 't',
+    'ß': 'ss',
+    'ę': 'en',
+    'ą': 'an',
+    'æ': 'ae',
+  }
+    def replace_chars(sentence):
+      result = ''
+      for ch in sentence:
+        new = CHARS[ch] if ch in CHARS else ch
+        result += new
+      return result
+    chars_to_remove_regex = '[\,\?\.\!\-\;\:\/\"\“\„\%\”\�\–\'\`\«\»\—\’\…]'
+    text = text.lower()
+    # normalize non-standard (stylized) unicode characters
+    text = unicodedata.normalize('NFKC', text)
+    # remove punctuation
+    text = re.sub(chars_to_ignore_regex, "", text)
+    batch["sentence"] = replace_chars(batch['sentence'])
+    # Let's also make sure we split on all kinds of newlines, spaces, etc...
+    text = " ".join(text.split())
+    return text
+def main(args):
+    # load dataset
+    dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
+    # for testing: only process the first two examples as a test
+    # dataset = dataset.select(range(10))
+    # load processor
+    feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
+    sampling_rate = feature_extractor.sampling_rate
+    # resample audio
+    dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
+    # load eval pipeline
+    asr = pipeline("automatic-speech-recognition", model=args.model_id)
+    # map function to decode audio
+    def map_to_pred(batch):
+        prediction = asr(batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s)
+        batch["prediction"] = prediction["text"]
+        batch["target"] = normalize_text(batch["sentence"])
+        return batch
+    # run inference on all examples
+    result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
+    # compute and log_results
+    # do not change function below
+    log_results(result, args)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
+    )
+    parser.add_argument(
+        "--dataset", type=str, required=True, help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets"
+    )
+    parser.add_argument(
+        "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'`  for Common Voice"
+    )
+    parser.add_argument(
+        "--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`"
+    )
+    parser.add_argument(
+        "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to None. For long audio files a good value would be 5.0 seconds."
+    )
+    parser.add_argument(
+        "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to None. For long audio files a good value would be 1.0 seconds."
+    )
+    parser.add_argument(
+        "--log_outputs", action='store_true', help="If defined, write outputs to log file for analysis."
+    )
+    args = parser.parse_args()
+    main(args)