metadata

language:
  - el
license: apache-2.0
tags:
  - whisper-event
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_11_0
metrics:
  - wer
model-index:
  - name: Whisper Small - Greek (el)
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: mozilla-foundation/common_voice_11_0 el
          type: mozilla-foundation/common_voice_11_0
          config: el
          split: test
          args: el
        metrics:
          - name: Wer
            type: wer
            value: 25.696508172362552

Whisper Small - Greek (el)

This model is a fine-tuned version of openai/whisper-small on the mozilla-foundation/common_voice_11_0 el dataset for translation from Greek to English. It achieves the following results on the evaluation set:

Loss: 0.4642
Wer: 25.6965

Model description

This model was finetuned with the encoder frozen. Only the decoder weights have been changed by this training run.

Intended uses & limitations

The purpose of this model was to understand how the freezing of a part of the model might affect learning, in an effort to assess the feasibility of enabling adapters.

Training and evaluation data

The training was performed by streaming interleaved train+eval spits of the greek (el) subset of mozilla-foundation/common_voice_11_0 (el). The test set was similarly used for validation.

Training procedure

Fine-tuning was performed on a lambdalabs laptop equipped with an NVIDIA GeForce RTX 3080 Laptop GPU (16GB).

The script used to perform the training run_speech_recognition_seq2seq_streaming.py is included in the files of this space with the following arguments:

                --model_name_or_path   "openai/whisper-small"
                --model_revision       "main"
                --do_train             True
                --do_eval              True
                --use_auth_token       False
                --freeze_encoder       True
                --model_index_name     "Whisper Small - Greek (el)"
                --dataset_name         "mozilla-foundation/common_voice_11_0"
                --dataset_config_name  "el"
                --audio_column_name    "audio"
                --text_column_name     "sentence"
                --max_duration_in_seconds 30
                --train_split_name    "train+validation"
                --eval_split_name      "test"
                --do_lower_case         False
                --do_remove_punctuation False
                --do_normalize_eval    True
                --language             "greek"
                --task                  "translate"
                --shuffle_buffer_size   500
                --output_dir             "./data/finetuningRuns/whisper-sm-el-frzEnc-xlate"
                --per_device_train_batch_size 16
                --gradient_accumulation_steps 4 
                --learning_rate          1e-5
                --warmup_steps           500
                --max_steps              5000
                --gradient_checkpointing True
                --fp16                   True
                --evaluation_strategy    "steps"
                --per_device_eval_batch_size 8
                --predict_with_generate  True
                --generation_max_length  225
                --save_steps             1000
                --eval_steps             1000
                --logging_steps          25
                --report_to              "tensorboard"
                --load_best_model_at_end True
                --metric_for_best_model  "wer"
                --greater_is_better      False
                --push_to_hub            False
                --overwrite_output_dir    True

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 5000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.0032	18.01	1000	0.4642	25.6965
0.0006	37.01	2000	0.5369	26.4395
0.0003	56.01	3000	0.5703	26.3187
0.0002	75.0	4000	0.5913	26.4302
0.0001	94.0	5000	0.5996	26.4952

Upon completion of training the best model was reloaded and tested with the following results extracted from the stdout log:

***** eval metrics *****
  epoch                   =       94.0
  eval_loss               =     0.4642
  eval_runtime            = 0:19:54.59
  eval_samples_per_second =       1.42
  eval_steps_per_second   =      0.177
  eval_wer                =    25.6965

Framework versions

Transformers 4.26.0.dev0
Pytorch 1.13.0
Datasets 2.7.1.dev0
Tokenizers 0.12.1