File size: 21,371 Bytes
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad25137
 
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
from textwrap import dedent

from iso639 import Lang

BANNER_TEXT = """
<div style="text-align: center;">
    <h1><a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit Benchmarks</a></h1>
</div>
"""


INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit, our on-device ASR solution, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more."""


INTRO_TEXT = """
<h3 style="display: flex;
  justify-content: center;
  align-items: center;
"></h2>
\n📈 Key Metrics:  
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better.  
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit performs no worse than the reference model. Higher is better.  
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better.  
Speed (⬆️): Input audio seconds transcribed per second. Higher is better.

🎯 WhisperKit is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER).
\n💻 Our benchmarks include:  
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API)  
On-device: <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> (various versions and optimizations)  

ℹ️ Reference Implementation:  
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request.
\n🔍 We use two primary datasets:  
<a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips  
<a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls  

🌐 Multilingual Benchmarks:  
These benchmarks aim to demonstrate WhisperKit's capabilities across diverse languages, helping developers assess its suitability for multilingual applications.  
\nDataset:  
<a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>: Short-form audio files (<30s/clip) for a maximum of 400 samples per language from Common Voice 17.0. Test set covers a wide range of languages to test model's versatility.  

\nMetrics:  
Average WER: Provides an overall measure of model performance across all languages.    
Language-specific WER: Allows for detailed analysis of model performance for each supported language.    
Language Detection Accuracy: Measured using a confusion matrix, showing the model's ability to identify the correct language.  
Results are shown for both forced (correct language given as input) and unforced (model detects language) scenarios.  

🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs.
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> to reproduce these results or run evaluations on their own custom datasets.

🔗 Links:
- <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a>
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
- <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
- <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
- <a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
"""


METHODOLOGY_TEXT = dedent(
    """
    # Methodology

    ## Overview
    WhisperKit Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit models across supported devices, OS versions and audio datasets.

    ## Metrics

    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
        - This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files.
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
    - **Multilingual results**: Separated into "language hinted" and "language predicted" categories to evaluate performance with and without prior knowledge of the input language.
    
    ## Data

    - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
    - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
    - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
    - **Multilingual**: Max 400 samples per language with <30s/clip from [Common Voice 17.0 Test Set](https://huggingface.co/datasets/argmaxinc/common_voice_17_0-argmax_subset-400). Common Voice covers 77 of the 99 languages supported by Whisper.

    ## Performance Measurement

    1. On-device testing is conducted with [WhisperKit Regression Test Automations](https://github.com/argmaxinc/WhisperKit/blob/main/BENCHMARKS.md) on iPhones, iPads, and Macs, across different iOS and macOS versions.
    2. Performance is recorded on 10-minute datasets described above for short- and long-form
    3. Quality metrics are recorded on full datasets on Apple M2 Ultra Mac Studios to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab.
    4. Quality is also sanity-checked on 10-minute datasets in order to catch potential correctness regressions across different device and OS combinations despite running the same version of WhisperKit.
    5. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis.

    ## Dashboard Features

    - Performance: Interactive filtering by model, device, OS, and performance metrics
    - Timeline: Visualizations of performance trends
    - English Quality: English transcription quality on short- and long-form audio
    - Multilingual Quality: Multilingual (77) transcription quality on short-form audio with and without language prediction
	- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
    - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases.
"""
)

PERFORMANCE_TEXT = dedent(
    """
    ## Metrics
    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.

    ## Data

   - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech).
    - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours).
"""
)

QUALITY_TEXT = dedent(
    """
    ## Metrics
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
"""
)

COL_NAMES = {
    "model.model_version": "Model",
    "device.product_name": "Device",
    "device.os": "OS",
    "average_wer": "Average WER",
    "qoi": "QoI",
    "speed": "Speed",
    "tokens_per_second": "Tok / s",
    "model": "Model",
    "device": "Device",
    "os": "OS",
    "english_wer": "English WER",
    "multilingual_wer": "Multilingual WER",
}


CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"


CITATION_BUTTON_TEXT = r"""@misc{whisperkit-argmax,
   title = {WhisperKit},
   author = {Argmax, Inc.},
   year = {2024},
   URL = {https://github.com/argmaxinc/WhisperKit}
}"""


HEADER = """<div align="center">
        <div position: relative>
        <img
            src=""
            style="display:block;width:7%;height:auto;"
        />
        </div>
</div>"""


EARNINGS22_URL = (
    "https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}"
)
LIBRISPEECH_URL = (
    "https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}"
)

AUDIO_URL = (
    "https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/"
)

WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}"

BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"

AVAILABLE_LANGUAGES = [
    "af",
    "am",
    "ar",
    "as",
    "az",
    "ba",
    "be",
    "bg",
    "bn",
    "br",
    "ca",
    "cs",
    "cy",
    "da",
    "de",
    "el",
    "en",
    "es",
    "et",
    "eu",
    "fa",
    "fi",
    "fr",
    "gl",
    "ha",
    "he",
    "hi",
    "hu",
    "hy",
    "id",
    "it",
    "ja",
    "ka",
    "kk",
    "ko",
    "lo",
    "lt",
    "lv",
    "mk",
    "ml",
    "mn",
    "mr",
    "mt",
    "ne",
    "nl",
    "nn",
    "oc",
    "pa",
    "pl",
    "ps",
    "pt",
    "ro",
    "ru",
    "sk",
    "sl",
    "sq",
    "sr",
    "sv",
    "sw",
    "ta",
    "te",
    "th",
    "tk",
    "tr",
    "tt",
    "uk",
    "ur",
    "uz",
    "vi",
    "yi",
    "yo",
    "yue",
    "zh",
]
LANGUAGE_MAP = {lang: Lang(lang).name for lang in AVAILABLE_LANGUAGES}