File size: 4,038 Bytes
69769f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Background

The [original Whisper models](https://cdn.openai.com/papers/whisper.pdf) were trained using 680.000 hours via Large-Scale Weak Supervision of dataset build from the Internet. 
Human curated datasets like Common Voice or Fleur were not using during training. The hypothesis is that by fine-tuning Whisper
using human curated datasets the quality can improve for a language or a particular domain.

These Whisper fine tuning models for Catalan language were created during early 2023 as part of the Hugging Face Whisper Sprint.

The models were finetuned using [these scripts](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)

For fine tuning we used Common Voice dataset version 11. 

# Learnings on fine-tuning Whisper models

The goal of the rest of the document is just to share knowledge in the spirit that can benefit others. Things as shared as it is.

**1. Model improves when benchmarched against Common Voice**. 

The model improves in WER evaluation metric when it is evaluated against the Common Voice test dataset. If you take the sample of the [small](https://huggingface.co/softcatala/whisper-small-ca) fine-tuned model, you see how the final WER is 8.5 while starting at WER 13.

**2. Model degrades according to human evaluation**

When doing human evaliuation the results for finetuned Catalan language model were disapointing. The fine-tuned models clearly perform worse than the original OpenAI models as reported by all users (half dozen) that test them.

Our hypothesis is that the evaluation on Common Voice gives better results because the model is overfitted and has lost generalization capabilities.

**2. Model degrades according evaluation with other datasets**

Doing a more extensive evaluation shows:
 
|             | base        | sc-base     | small       | sc-small   |medium       | sc-medium   |
| ----------- | ----------- | ----------- | ----------- |----------- | ----------- | ----------- |
| 15GdH9-curt     | 55.60           | 70.40           | 36.62           |78.75           | 78.75           | 22.39|49.53
| Ona_catalan-balear      | 71.28           | 71.01           | 44.68           |49.20           | 49.20           | 28.72 |56.65
| Son_Goku_catalan_valencian_voice      | 51.90           | 85.44           | 39.87           |65.19           | 18.99| 71.52 
| Universal_Declaration_of_Human_Rights      | 47.12           | 36.45           | 39.14          |75.59           | 44.37           | 27.79 

As you can see, 
 
Legend:
* "sc-" Indicates Softcatalà fine-tuned model
* The scores are WER metrics


**4. Whisper fine tuned models clients provide different quality**

Summary as March 2023:

**a**. OpenAI Whisper implementation does not support out of the box inference on fine-tuned models, only on OpenAI models.

**b**. HuggingFace Whisper implementation performs poorly. This can be really misleading when doing evaluations, since HuggingFace is the stack used for fine-tuning

In our experiments 

| Whisper Client | WER |
| ----------- | ----------- |
| OpenAI      | 27.32       |
| Whisper.cpp 1.2.1   | 38.89 |
| HuggingFace   | 93.54  |
| CTranslate2 3.10.3  | 43.68  |

We strongly recommend using CTranslate2 as inference client.


**5. Fine-tunning degrades timestamp prediction**

Whisper uses timestamp tokens to indicate the timestamps of the transcribed texts. 

The training scripts available for fine tunning do not generate timestamps tokens and the timestamp prediciton degradates.

This is important since many people uses Whisper models to create video subtitles were the timestamp prediction is important


**Next steps**

There is the possibility that fine-tuning can not improve Whisper or only in certain scenarios (domain adaptation). See [Nickolay Shmyrev](https://alphacephei.com/nsh/2023/01/15/whisper-finetuning.html) vision and tests.

Potential next steps:

* The training scripts have to be improved to include timestamp tokens 
* New trainings should include more corpus than just Common Voice