from textwrap import dedent from iso639 import Lang BANNER_TEXT = """

WhisperKit Benchmarks

""" INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit, our on-device ASR solution, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more.""" INTRO_TEXT = """

\n📈 Key Metrics: Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better. Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit performs no worse than the reference model. Higher is better. Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better. Speed (⬆️): Input audio seconds transcribed per second. Higher is better. 🎯 WhisperKit is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER). \n💻 Our benchmarks include: Reference: WhisperOpenAIAPI (OpenAI's Whisper API) On-device: WhisperKit (various versions and optimizations) ℹ️ Reference Implementation: WhisperOpenAIAPI sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request. \n🔍 We use two primary datasets: LibriSpeech: ~5 hours of short English audio clips Earnings22: ~120 hours of English audio from earnings calls 🌐 Multilingual Benchmarks: These benchmarks aim to demonstrate WhisperKit's capabilities across diverse languages, helping developers assess its suitability for multilingual applications. \nDataset: Common Voice 17.0: Short-form audio files (<30s/clip) for a maximum of 400 samples per language from Common Voice 17.0. Test set covers a wide range of languages to test model's versatility. \nMetrics: Average WER: Provides an overall measure of model performance across all languages. Language-specific WER: Allows for detailed analysis of model performance for each supported language. Language Detection Accuracy: Measured using a confusion matrix, showing the model's ability to identify the correct language. Results are shown for both forced (correct language given as input) and unforced (model detects language) scenarios. 🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs. \n🛠️ Developers can use WhisperKit to reproduce these results or run evaluations on their own custom datasets. 🔗 Links: - WhisperKit - whisperkittools - LibriSpeech - Earnings22 - Common Voice 17.0 - WhisperOpenAIAPI """ METHODOLOGY_TEXT = dedent( """ # Methodology ## Overview WhisperKit Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit models across supported devices, OS versions and audio datasets. ## Metrics - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. - This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files. - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model. - This metric does not capture improvements to the reference. It only measures potential regressions. - **Parity %**: The percentage difference between a model's Average WER on a given device and its Average WER on the Apple M2 Ultra, where a negative value indicates worse performance compared to the M2 Ultra. - **Multilingual results**: Separated into "language hinted" and "language predicted" categories to evaluate performance with and without prior knowledge of the input language. ## Data - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance. - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance. - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests. - **Multilingual**: Max 400 samples per language with <30s/clip from [Common Voice 17.0 Test Set](https://huggingface.co/datasets/argmaxinc/common_voice_17_0-argmax_subset-400). Common Voice covers 77 of the 99 languages supported by Whisper. ## Performance Measurement 1. On-device testing is conducted with [WhisperKit Regression Test Automations](https://github.com/argmaxinc/WhisperKit/blob/main/BENCHMARKS.md) on iPhones, iPads, and Macs, across different iOS and macOS versions. 2. Performance is recorded on 10-minute datasets described above for short- and long-form 3. Quality metrics are recorded on full datasets on Apple M2 Ultra Mac Studios to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab. 4. Quality is also sanity-checked on 10-minute datasets in order to catch potential correctness regressions across different device and OS combinations despite running the same version of WhisperKit. 5. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis. ## Dashboard Features - Performance: Interactive filtering by model, device, OS, and performance metrics - Timeline: Visualizations of performance trends - English Quality: English transcription quality on short- and long-form audio - Multilingual Quality: Multilingual (77) transcription quality on short-form audio with and without language prediction - Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:. - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases. """ ) PERFORMANCE_TEXT = dedent( """ ## Metrics - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. - **Parity %**: The percentage difference between a model's Average WER on a given device and its Average WER on the Apple M2 Ultra, where a negative value indicates worse performance compared to the M2 Ultra. ## Data - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). """ ) QUALITY_TEXT = dedent( """ ## Metrics - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model. - This metric does not capture improvements to the reference. It only measures potential regressions. """ ) COL_NAMES = { "model.model_version": "Model", "device.product_name": "Device", "device.os": "OS", "average_wer": "Average WER", "qoi": "QoI", "speed": "Speed", "tokens_per_second": "Tok / s", "model": "Model", "device": "Device", "os": "OS", "parity": "Parity %", "english_wer": "English WER", "multilingual_wer": "Multilingual WER", } CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r"""@misc{whisperkit-argmax, title = {WhisperKit}, author = {Argmax, Inc.}, year = {2024}, URL = {https://github.com/argmaxinc/WhisperKit} }""" HEADER = """

""" EARNINGS22_URL = ( "https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}" ) LIBRISPEECH_URL = ( "https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}" ) AUDIO_URL = ( "https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/" ) WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}" BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data" AVAILABLE_LANGUAGES = [ "af", "am", "ar", "as", "az", "ba", "be", "bg", "bn", "br", "ca", "cs", "cy", "da", "de", "el", "en", "es", "et", "eu", "fa", "fi", "fr", "gl", "ha", "he", "hi", "hu", "hy", "id", "it", "ja", "ka", "kk", "ko", "lo", "lt", "lv", "mk", "ml", "mn", "mr", "mt", "ne", "nl", "nn", "oc", "pa", "pl", "ps", "pt", "ro", "ru", "sk", "sl", "sq", "sr", "sv", "sw", "ta", "te", "th", "tk", "tr", "tt", "uk", "ur", "uz", "vi", "yi", "yo", "yue", "zh", ] LANGUAGE_MAP = {lang: Lang(lang).name for lang in AVAILABLE_LANGUAGES}