Edit model card

Whisper Medium Amharic FLEURS

This model is a fine-tuned version of openai/whisper-medium on the google/fleurs am_et dataset. It achieves the following results on the evaluation set:

  • Loss: 7.8670
  • Wer: 154.4118

Model description

Intended uses & limitations

  • For experimentation and curiosity.
  • Based on the paper AXRIV and Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer, there is a performance bias towards certain languages and curated datasets.
  • From the Whisper paper, am_et is a low resource language (Table E), with the WER results ranging from 120-229, based on model size. Whisper small WER=120.2, indicating more training time may improve the fine tuning.

Training and evaluation data

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps: 3000
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer
0.0194 100.0 100 3.8540 147.9947
0.0001 200.0 200 4.1479 148.1283
0.0001 300.0 300 4.1840 150.5348
0.0001 400.0 400 4.3339 177.9412
0.0 500.0 500 4.5831 151.0695
0.0 600.0 600 4.9317 164.0374
0.0 700.0 700 5.3031 141.0428
0.0 800.0 800 5.6584 122.3262
0.0 900.0 900 5.9711 157.4866
0.0 1000.0 1000 6.2465 141.1765
0.0 1100.0 1100 6.4832 169.6524
0.0 1200.0 1200 6.6890 155.0802
0.0 1300.0 1300 6.8679 159.7594
0.0 1400.0 1400 7.0250 155.0802
0.0 1500.0 1500 7.1615 146.2567
0.0 1600.0 1600 7.2877 143.0481
0.0 1700.0 1700 7.3987 148.5294
0.0 1800.0 1800 7.5010 142.5134
0.0 1900.0 1900 7.5849 136.7647
0.0 2000.0 2000 7.6689 148.2620
0.0 2100.0 2100 7.6955 165.3743
0.0 2200.0 2200 7.7247 162.9679
0.0 2300.0 2300 7.7557 161.6310
0.0 2400.0 2400 7.7842 162.2995
0.0 2500.0 2500 7.8074 150.9358
0.0 2600.0 2600 7.8287 154.8128
0.0 2700.0 2700 7.8434 155.4813
0.0 2800.0 2800 7.8567 154.4118
0.0 2900.0 2900 7.8635 154.4118
0.0 3000.0 3000 7.8670 154.4118

Recommendations

Limit training duration for smaller datasets to ~ 2000 to 3000 steps to avoid overfitting. 5000 steps using the HuggingFace - Whisper Small takes ~ 5hrs on A100 GPUs (1hr/1000 steps). Encountered RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 which is related to Trainer RuntimeError as some languages datasets have input lengths that have non-standard lengths. The link did not resolve my issue, and appears elsewhere too Training languagemodel – RuntimeError the expanded size of the tensor (100) must match the existing size (64) at non singleton dimension 1. To circumvent this issue, run.sh paremeters are adjusted. Then run python run_eval_whisper_streaming.py --model_id="openai/whisper-small" --dataset="google/fleurs" --config="am_et" --batch_size=32 --max_eval_samples=64 --device=0 --language="am" to find the WER score manually. Otherwise, erroring out during evaluation prevents the trained model from loading to HugginFace. Based on the paper AXRIV and Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer, there is a performance bias towards certain languages and curated datasets. The OpenAI fintuning community event provided ample free GPU time to help develop the model further and improve WER scores.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). In total roughly 100 hours were used primarily in US East/Asia Pacific (80%/20%), with AWS as the reference. Additional resources are available at Our World in Data - CO2 Emissions

  • Hardware Type: AMD EPYC 7J13 64-Core Processor (30 core VM) 197GB RAM, with NVIDIA A100-SXM 40GB
  • Hours Used: 100 hrs
  • Cloud Provider: Lambda Cloud GPU
  • Compute Region: US East/Asia Pacific
  • Carbon Emitted: 12 kg (GPU) + 13 kg (CPU) = 25 kg (the weight of 3 gallons of water)

Framework versions

  • Transformers 4.26.0.dev0
  • Pytorch 1.13.1+cu117
  • Datasets 2.8.1.dev0
  • Tokenizers 0.13.2

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@article{owidco2andothergreenhousegasemissions,
    author = {Hannah Ritchie and Max Roser and Pablo Rosado},
    title = {CO₂ and Greenhouse Gas Emissions},
    journal = {Our World in Data},
    year = {2020},
    note = {https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions}
}
Downloads last month
2
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train drmeeseeks/whisper-medium-v2-amet

Evaluation results