{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "63b2fd12-a1a3-45a7-949e-64b903c5d2d5", "metadata": {}, "source": [ "# Automatic speech recognition using Distil-Whisper and OpenVINO\n", "\n", "[Distil-Whisper](https://huggingface.co/distil-whisper/distil-large-v2) is a distilled variant of the [Whisper](https://huggingface.co/openai/whisper-large-v2) model by OpenAI. The Distil-Whisper is proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). According to authors, compared to Whisper, Distil-Whisper runs in several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.\n", "\n", "Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It maps a sequence of audio spectrogram features to a sequence of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. Then, the Transformer encoder encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states.\n", "\n", "You can see the model architecture in the diagram below:\n", "\n", "![whisper_architecture.svg](https://user-images.githubusercontent.com/29454499/204536571-8f6d8d77-5fbd-4c6d-8e29-14e734837860.svg)\n", "\n", "In this tutorial, we consider how to run Distil-Whisper using OpenVINO. We will use the pre-trained model from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library. To simplify the user experience, the [Hugging Face Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to OpenVINO™ IR format. To further improve OpenVINO Distil-Whisper model performance `INT8` post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) is applied.\n", "\n", "#### Table of contents:\n", "\n", "- [Prerequisites](#Prerequisites)\n", "- [Load PyTorch model](#Load-PyTorch-model)\n", " - [Prepare input sample](#Prepare-input-sample)\n", " - [Run model inference](#Run-model-inference)\n", "- [Load OpenVINO model using Optimum library](#Load-OpenVINO-model-using-Optimum-library)\n", " - [Select Inference device](#Select-Inference-device)\n", " - [Compile OpenVINO model](#Compile-OpenVINO-model)\n", " - [Run OpenVINO model inference](#Run-OpenVINO-model-inference)\n", "- [Compare performance PyTorch vs OpenVINO](#Compare-performance-PyTorch-vs-OpenVINO)\n", "- [Usage OpenVINO model with HuggingFace pipelines](#Usage-OpenVINO-model-with-HuggingFace-pipelines)\n", "- [Quantization](#Quantization)\n", " - [Prepare calibration datasets](#Prepare-calibration-datasets)\n", " - [Quantize Distil-Whisper encoder and decoder models](#Quantize-Distil-Whisper-encoder-and-decoder-models)\n", " - [Run quantized model inference](#Run-quantized-model-inference)\n", " - [Compare performance and accuracy of the original and quantized models](#Compare-performance-and-accuracy-of-the-original-and-quantized-models)\n", "- [Interactive demo](#Interactive-demo)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "22bf06fc-5988-4e3d-9d81-7fe23ff18131", "metadata": {}, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bb9fc7f3-cea0-4adf-9ee6-4a3d15931db7", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:38.239262600Z", "start_time": "2023-11-08T15:05:38.138403800Z" } }, "outputs": [], "source": [ "%pip install -q \"transformers>=4.35\" \"torch>=2.1\" onnx \"git+https://github.com/huggingface/optimum-intel.git\" \"peft==0.6.2\" --extra-index-url https://download.pytorch.org/whl/cpu\n", "%pip install -q \"openvino>=2023.2.0\" datasets \"gradio>=4.0\" \"librosa\" \"soundfile\"\n", "%pip install -q \"nncf>=2.6.0\" \"jiwer\"" ] }, { "attachments": {}, "cell_type": "markdown", "id": "34bbdf5e-0e4c-482c-a08a-395972c8b56f", "metadata": {}, "source": [ "## Load PyTorch model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The `AutoModelForSpeechSeq2Seq.from_pretrained` method is used for the initialization of PyTorch Whisper model using the transformers library. By default, we will use the `distil-whisper/distil-large-v2` model as an example in this tutorial. The model will be downloaded once during first run and this process may require some time.\n", "\n", "You may also choose other models from [Distil-Whisper hugging face collection](https://huggingface.co/collections/distil-whisper/distil-whisper-models-65411987e6727569748d2eb6) such as `distil-whisper/distil-medium.en` or `distil-whisper/distil-small.en`. Models of the original Whisper architecture are also available, more on them [here](https://huggingface.co/openai).\n", "\n", "Preprocessing and post-processing are important in this model use. `AutoProcessor` class used for initialization `WhisperProcessor` is responsible for preparing audio input data for the model, converting it to Mel-spectrogram and decoding predicted output token_ids into string using tokenizer." ] }, { "cell_type": "code", "execution_count": null, "id": "756cd923", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import ipywidgets as widgets\n", "\n", "model_ids = {\n", " \"Distil-Whisper\": [\n", " \"distil-whisper/distil-large-v2\",\n", " \"distil-whisper/distil-medium.en\",\n", " \"distil-whisper/distil-small.en\",\n", " ],\n", " \"Whisper\": [\n", " \"openai/whisper-large-v3\",\n", " \"openai/whisper-large-v2\",\n", " \"openai/whisper-large\",\n", " \"openai/whisper-medium\",\n", " \"openai/whisper-small\",\n", " \"openai/whisper-base\",\n", " \"openai/whisper-tiny\",\n", " \"openai/whisper-medium.en\",\n", " \"openai/whisper-small.en\",\n", " \"openai/whisper-base.en\",\n", " \"openai/whisper-tiny.en\",\n", " ],\n", "}\n", "\n", "model_type = widgets.Dropdown(\n", " options=model_ids.keys(),\n", " value=\"Distil-Whisper\",\n", " description=\"Model type:\",\n", " disabled=False,\n", ")\n", "\n", "model_type" ] }, { "cell_type": "code", "execution_count": null, "id": "de1107e4", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "model_id = widgets.Dropdown(\n", " options=model_ids[model_type.value],\n", " value=model_ids[model_type.value][0],\n", " description=\"Model:\",\n", " disabled=False,\n", ")\n", "\n", "model_id" ] }, { "cell_type": "code", "execution_count": 2, "id": "e5382431-497e-4688-b4ec-8958a92163e7", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:45.226409400Z", "start_time": "2023-11-08T15:05:38.138403800Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ], "source": [ "from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq\n", "\n", "processor = AutoProcessor.from_pretrained(model_id.value)\n", "\n", "pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id.value)\n", "pt_model.eval();" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bbe82d01-ea1e-433f-92c1-570f9c51c456", "metadata": {}, "source": [ "### Prepare input sample\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The processor expects audio data in numpy array format and information about the audio sampling rate and returns the `input_features` tensor for making predictions. Conversion of audio to numpy format is handled by Hugging Face datasets implementation." ] }, { "cell_type": "code", "execution_count": 3, "id": "df5a5952-0457-4f1e-9dfe-0446c4cb0111", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:48.289894900Z", "start_time": "2023-11-08T15:05:45.226409400Z" } }, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "\n", "def extract_input_features(sample):\n", " input_features = processor(\n", " sample[\"audio\"][\"array\"],\n", " sampling_rate=sample[\"audio\"][\"sampling_rate\"],\n", " return_tensors=\"pt\",\n", " ).input_features\n", " return input_features\n", "\n", "\n", "dataset = load_dataset(\"hf-internal-testing/librispeech_asr_dummy\", \"clean\", split=\"validation\")\n", "sample = dataset[0]\n", "input_features = extract_input_features(sample)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5ff96530-4d3c-4a20-8ac0-b475794b54b5", "metadata": {}, "source": [ "### Run model inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "To perform speech recognition, one can use `generate` interface of the model. After generation is finished processor.batch_decode can be used for decoding predicted token_ids into text transcription." ] }, { "cell_type": "code", "execution_count": 4, "id": "c9618867-beae-4875-a5be-0e0a3b453414", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:51.638930400Z", "start_time": "2023-11-08T15:05:48.289894900Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Reference: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL\n", "Result: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n" ] } ], "source": [ "import IPython.display as ipd\n", "\n", "predicted_ids = pt_model.generate(input_features)\n", "transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n", "\n", "display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n", "print(f\"Reference: {sample['text']}\")\n", "print(f\"Result: {transcription[0]}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "219ed303-d323-4a07-8a92-66a2e96e1ec5", "metadata": {}, "source": [ "## Load OpenVINO model using Optimum library\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The Hugging Face Optimum API is a high-level API that enables us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details, refer to the [Hugging Face Optimum documentation](https://huggingface.co/docs/optimum/intel/inference).\n", "\n", "Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/docs/optimum/intel/hf.co/models) and create pipelines to run an inference with OpenVINO Runtime using Hugging Face APIs. The Optimum Inference models are API compatible with Hugging Face Transformers models. This means we just need to replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.\n", "\n", "Below is an example of the distil-whisper model\n", "\n", "```diff\n", "-from transformers import AutoModelForSpeechSeq2Seq\n", "+from optimum.intel.openvino import OVModelForSpeechSeq2Seq\n", "from transformers import AutoTokenizer, pipeline\n", "\n", "model_id = \"distil-whisper/distil-large-v2\"\n", "-model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)\n", "+model = OVModelForSpeechSeq2Seq.from_pretrained(model_id, export=True)\n", "```\n", "\n", "Model class initialization starts with calling the `from_pretrained` method. When downloading and converting the Transformers model, the parameter `export=True` should be added. We can save the converted model for the next usage with the `save_pretrained` method.\n", "Tokenizers and Processors are distributed with models also compatible with the OpenVINO model. It means that we can reuse initialized early processor." ] }, { "cell_type": "code", "execution_count": 5, "id": "7ef523e8-b70f-4d86-a7d1-81f761c3eac0", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:52.840159900Z", "start_time": "2023-11-08T15:05:51.638930400Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n" ] } ], "source": [ "from pathlib import Path\n", "from optimum.intel.openvino import OVModelForSpeechSeq2Seq\n", "\n", "model_path = Path(model_id.value.replace(\"/\", \"_\"))\n", "ov_config = {\"CACHE_DIR\": \"\"}\n", "\n", "if not model_path.exists():\n", " ov_model = OVModelForSpeechSeq2Seq.from_pretrained(\n", " model_id.value,\n", " ov_config=ov_config,\n", " export=True,\n", " compile=False,\n", " load_in_8bit=False,\n", " )\n", " ov_model.half()\n", " ov_model.save_pretrained(model_path)\n", "else:\n", " ov_model = OVModelForSpeechSeq2Seq.from_pretrained(model_path, ov_config=ov_config, compile=False)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "99a3dffb-5476-4a5d-843f-c7a7cbbf2154", "metadata": {}, "source": [ "### Select Inference device\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "4b1bd73b-bcc8-4f72-b896-63e11f33f607", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:53.210751600Z", "start_time": "2023-11-08T15:05:53.179009600Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2de8aed5af2d4f2f9527a7f384ec62a7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import openvino as ov\n", "import ipywidgets as widgets\n", "\n", "core = ov.Core()\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "attachments": {}, "cell_type": "markdown", "id": "45cd85e8-63e4-402c-86bc-2023ed5775a8", "metadata": {}, "source": [ "### Compile OpenVINO model\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "057328a7-dc25-4c54-a438-85467e0076de", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:55.485452300Z", "start_time": "2023-11-08T15:05:53.211466100Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Compiling the encoder to AUTO ...\n", "Compiling the decoder to AUTO ...\n", "Compiling the decoder to AUTO ...\n" ] } ], "source": [ "ov_model.to(device.value)\n", "ov_model.compile()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3590030d-5149-4f83-9e78-f6a582e1511a", "metadata": {}, "source": [ "### Run OpenVINO model inference\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "68a94f38-09e9-48fc-9df0-6c954a82f2fb", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:57.755227300Z", "start_time": "2023-11-08T15:05:55.486017200Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/optimum/intel/openvino/modeling_seq2seq.py:457: FutureWarning: `shared_memory` is deprecated and will be removed in 2024.0. Value of `shared_memory` is going to override `share_inputs` value. Please use only `share_inputs` explicitly.\n", " last_hidden_state = torch.from_numpy(self.request(inputs, shared_memory=True)[\"last_hidden_state\"]).to(\n", "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/optimum/intel/openvino/modeling_seq2seq.py:538: FutureWarning: `shared_memory` is deprecated and will be removed in 2024.0. Value of `shared_memory` is going to override `share_inputs` value. Please use only `share_inputs` explicitly.\n", " self.request.start_async(inputs, shared_memory=True)\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Reference: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL\n", "Result: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n" ] } ], "source": [ "predicted_ids = ov_model.generate(input_features)\n", "transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n", "\n", "display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n", "print(f\"Reference: {sample['text']}\")\n", "print(f\"Result: {transcription[0]}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8d69046e-707f-4c07-af24-389b125b3abd", "metadata": {}, "source": [ "## Compare performance PyTorch vs OpenVINO\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "6ddafe5c-3238-40d3-b8ed-9d50c73f0d8a", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:05:57.756685200Z", "start_time": "2023-11-08T15:05:57.755227300Z" } }, "outputs": [], "source": [ "import time\n", "import numpy as np\n", "from tqdm.notebook import tqdm\n", "\n", "\n", "def measure_perf(model, sample, n=10):\n", " timers = []\n", " input_features = extract_input_features(sample)\n", " for _ in tqdm(range(n), desc=\"Measuring performance\"):\n", " start = time.perf_counter()\n", " model.generate(input_features)\n", " end = time.perf_counter()\n", " timers.append(end - start)\n", " return np.median(timers)" ] }, { "cell_type": "code", "execution_count": 10, "id": "94025726-5c09-42b8-9046-9fbbe73afc47", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:06:45.775679100Z", "start_time": "2023-11-08T15:05:57.755227300Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dcfd5f827f7e4cfdb82fa0bd573bcdc3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Measuring performance: 0%| | 0/10 [00:00= 0, \"non-negative timestamp expected\"\n", " milliseconds = round(seconds * 1000.0)\n", "\n", " hours = milliseconds // 3_600_000\n", " milliseconds -= hours * 3_600_000\n", "\n", " minutes = milliseconds // 60_000\n", " milliseconds -= minutes * 60_000\n", "\n", " seconds = milliseconds // 1_000\n", " milliseconds -= seconds * 1_000\n", "\n", " return (f\"{hours}:\" if hours > 0 else \"00:\") + f\"{minutes:02d}:{seconds:02d},{milliseconds:03d}\"\n", "\n", "\n", "def prepare_srt(transcription):\n", " \"\"\"\n", " Format transcription into srt file format\n", " \"\"\"\n", " segment_lines = []\n", " for idx, segment in enumerate(transcription[\"chunks\"]):\n", " segment_lines.append(str(idx + 1) + \"\\n\")\n", " timestamps = segment[\"timestamp\"]\n", " time_start = format_timestamp(timestamps[0])\n", " time_end = format_timestamp(timestamps[1])\n", " time_str = f\"{time_start} --> {time_end}\\n\"\n", " segment_lines.append(time_str)\n", " segment_lines.append(segment[\"text\"] + \"\\n\\n\")\n", " return segment_lines" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4fdb0ad8-e083-4e63-aeb1-c15566d945a7", "metadata": {}, "source": [ "`return_timestamps` argument allows getting timestamps of start and end of speech associated with each processed chunk. It could be useful in tasks like speech separation or generation of video subtitles. In this example, we provide output formatting in SRT format, one of the popular subtitles format. " ] }, { "cell_type": "code", "execution_count": 17, "id": "3219bb35-955a-4032-acf1-d83e5dab09bd", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:08:40.300007100Z", "start_time": "2023-11-08T15:08:28.543311300Z" } }, "outputs": [], "source": [ "result = pipe(sample_long[\"audio\"].copy(), return_timestamps=True)" ] }, { "cell_type": "code", "execution_count": 18, "id": "bd7ef03b-3c71-4f3a-9a9c-40549256b447", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:08:42.981577100Z", "start_time": "2023-11-08T15:08:40.300007100Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "1\n", "00:00:00,000 --> 00:00:06,560\n", " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n", "\n", "2\n", "00:00:06,560 --> 00:00:11,280\n", " Nor is Mr. Quilter's manner less interesting than his matter.\n", "\n", "3\n", "00:00:11,280 --> 00:00:16,840\n", " He tells us that at this festive season of the year, with Christmas and roast beef looming\n", "\n", "4\n", "00:00:16,840 --> 00:00:23,760\n", " before us, similes drawn from eating and its results occur most readily to the mind.\n", "\n", "5\n", "00:00:23,760 --> 00:00:29,360\n", " He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and\n", "\n", "6\n", "00:00:29,360 --> 00:00:33,640\n", " can discover in it but little of Rocky Ithaca.\n", "\n", "7\n", "00:00:33,640 --> 00:00:39,760\n", " Lennel's pictures are a sort of upgards and Adam paintings, and Mason's exquisite\n", "\n", "8\n", "00:00:39,760 --> 00:00:44,720\n", " idles are as national as a jingo poem.\n", "\n", "9\n", "00:00:44,720 --> 00:00:50,320\n", " Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used\n", "\n", "10\n", "00:00:50,320 --> 00:00:52,920\n", " to flash his teeth.\n", "\n", "11\n", "00:00:52,920 --> 00:00:58,680\n", " And Mr. John Collier gives his sitter a cheerful slap on the back, before he says, like\n", "\n", "12\n", "00:00:58,680 --> 00:01:01,120\n", " a shampooer and a Turkish bath,\n", "\n", "13\n", "00:01:01,120 --> 00:01:02,000\n", " Next man!\n", "\n", "\n" ] } ], "source": [ "srt_lines = prepare_srt(result)\n", "\n", "display(ipd.Audio(sample_long[\"audio\"][\"array\"], rate=sample_long[\"audio\"][\"sampling_rate\"]))\n", "print(\"\".join(srt_lines))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b36d31bc", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Quantization\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor.\n", "\n", "The optimization process contains the following steps:\n", "\n", "1. Create a calibration dataset for quantization.\n", "2. Run `nncf.quantize` to obtain quantized encoder and decoder models.\n", "3. Serialize the `INT8` model using `openvino.save_model` function.\n", "\n", ">**Note**: Quantization is time and memory consuming operation. Running quantization code below may take some time.\n", "\n", "Please select below whether you would like to run Distil-Whisper quantization." ] }, { "cell_type": "code", "execution_count": 19, "id": "58d361c3", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:08:42.981577100Z", "start_time": "2023-11-08T15:08:42.981577100Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5aa6a12230054911b0f409ebad748ff0", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Checkbox(value=True, description='Quantization')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to_quantize = widgets.Checkbox(\n", " value=True,\n", " description=\"Quantization\",\n", " disabled=False,\n", ")\n", "\n", "to_quantize" ] }, { "cell_type": "code", "execution_count": 20, "id": "46cc97d3", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T15:08:42.981577100Z", "start_time": "2023-11-08T15:08:42.981577100Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Fetch `skip_kernel_extension` module\n", "import requests\n", "\n", "r = requests.get(\n", " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py\",\n", ")\n", "open(\"skip_kernel_extension.py\", \"w\").write(r.text)\n", "\n", "%load_ext skip_kernel_extension" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cfb2c2a7", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Prepare calibration datasets\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "First step is to prepare calibration datasets for quantization. Since we quantize whisper encoder and decoder separately, we need to prepare a calibration dataset for each of the models. We import an `InferRequestWrapper` class that will intercept model inputs and collect them to a list. Then we run model inference on some small amount of audio samples. Generally, increasing the calibration dataset size improves quantization quality." ] }, { "cell_type": "code", "execution_count": 21, "id": "96d6b01e", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T16:08:47.608131500Z", "start_time": "2023-11-08T16:08:47.567321700Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "from itertools import islice\n", "from optimum.intel.openvino.quantization import InferRequestWrapper\n", "\n", "\n", "def collect_calibration_dataset(ov_model: OVModelForSpeechSeq2Seq, calibration_dataset_size: int):\n", " # Overwrite model request properties, saving the original ones for restoring later\n", " encoder_calibration_data = []\n", " decoder_calibration_data = []\n", " ov_model.encoder.request = InferRequestWrapper(ov_model.encoder.request, encoder_calibration_data, apply_caching=True)\n", " ov_model.decoder_with_past.request = InferRequestWrapper(ov_model.decoder_with_past.request,\n", " decoder_calibration_data,\n", " apply_caching=True)\n", "\n", " try:\n", " calibration_dataset = load_dataset(\"librispeech_asr\", \"clean\", split=\"validation\", streaming=True)\n", " for sample in tqdm(islice(calibration_dataset, calibration_dataset_size), desc=\"Collecting calibration data\",\n", " total=calibration_dataset_size):\n", " input_features = extract_input_features(sample)\n", " ov_model.generate(input_features)\n", " finally:\n", " ov_model.encoder.request = ov_model.encoder.request.request\n", " ov_model.decoder_with_past.request = ov_model.decoder_with_past.request.request\n", "\n", " return encoder_calibration_data, decoder_calibration_data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "023f2eff", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Quantize Distil-Whisper encoder and decoder models\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Below we run the `quantize` function which calls `nncf.quantize` on Distil-Whisper encoder and decoder-with-past models. We don't quantize first-step-decoder because its share in whole inference time is negligible." ] }, { "cell_type": "code", "execution_count": 22, "id": "0de8bd26", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T16:20:21.666837100Z", "start_time": "2023-11-08T16:20:19.667042200Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false }, "test_replace": { "CALIBRATION_DATASET_SIZE = 50": "CALIBRATION_DATASET_SIZE = 1" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bf08dd57a86d47fc8100fe234f394d1b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Collecting calibration data: 0%| | 0/10 [00:00\n", " \n", " Your browser does not support the audio element.\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Original : Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n", "Quantized: Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "dataset = load_dataset(\n", " \"hf-internal-testing/librispeech_asr_dummy\", \"clean\", split=\"validation\"\n", ")\n", "sample = dataset[0]\n", "input_features = extract_input_features(sample)\n", "\n", "predicted_ids = ov_model.generate(input_features)\n", "transcription_original = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n", "\n", "predicted_ids = ov_quantized_model.generate(input_features)\n", "transcription_quantized = processor.batch_decode(predicted_ids, skip_special_tokens=True)\n", "\n", "display(ipd.Audio(sample[\"audio\"][\"array\"], rate=sample[\"audio\"][\"sampling_rate\"]))\n", "print(f\"Original : {transcription_original[0]}\")\n", "print(f\"Quantized: {transcription_quantized[0]}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3228cf53", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Results are the same!" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c68cb960", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "### Compare performance and accuracy of the original and quantized models\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Finally, we compare original and quantized Distil-Whisper models from accuracy and performance stand-points.\n", "\n", "To measure accuracy, we use `1 - WER` as a metric, where WER stands for Word Error Rate.\n", "\n", "When measuring inference time, we do it separately for encoder and decoder-with-past model forwards, and for the whole model inference too." ] }, { "cell_type": "code", "execution_count": 24, "id": "7133f52f", "metadata": { "ExecuteTime": { "end_time": "2023-11-08T16:15:20.910568900Z", "start_time": "2023-11-08T16:12:18.721295800Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false }, "test_replace": { "TEST_DATASET_SIZE = 50": "TEST_DATASET_SIZE = 1" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Got disconnected from remote data host. Retrying in 5sec [1/20]\n", "Got disconnected from remote data host. Retrying in 5sec [2/20]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4324dc631e7242788ddeaea0e552f8a7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Measuring performance and accuracy: 0%| | 0/50 [00:00 MAX_AUDIO_MINS:\n", " raise gr.Error(\n", " f\"To ensure fair usage of the Space, the maximum audio length permitted is {MAX_AUDIO_MINS} minutes.\"\n", " f\"Got an audio of length {round(audio_length_mins, 3)} minutes.\"\n", " )\n", "\n", " inputs = {\"array\": inputs, \"sampling_rate\": pipe.feature_extractor.sampling_rate}\n", "\n", " def _forward_ov_time(*args, **kwargs):\n", " global ov_time\n", " start_time = time.time()\n", " result = pipe_forward(*args, **kwargs)\n", " ov_time = time.time() - start_time\n", " ov_time = round(ov_time, 2)\n", " return result\n", "\n", " pipe._forward = _forward_ov_time\n", " ov_text = pipe(inputs.copy(), batch_size=BATCH_SIZE)[\"text\"]\n", " return ov_text, ov_time\n", "\n", "\n", "with gr.Blocks() as demo:\n", " gr.HTML(\n", " \"\"\"\n", "
\n", " \n", "

\n", " OpenVINO Distil-Whisper demo\n", "

\n", "
\n", " \n", " \"\"\"\n", " )\n", " audio = gr.components.Audio(type=\"filepath\", label=\"Audio input\")\n", " with gr.Row():\n", " button = gr.Button(\"Transcribe\")\n", " if to_quantize.value:\n", " button_q = gr.Button(\"Transcribe quantized\")\n", " with gr.Row():\n", " infer_time = gr.components.Textbox(label=\"OpenVINO Distil-Whisper Transcription Time (s)\")\n", " if to_quantize.value:\n", " infer_time_q = gr.components.Textbox(label=\"OpenVINO Quantized Distil-Whisper Transcription Time (s)\")\n", " with gr.Row():\n", " transcription = gr.components.Textbox(label=\"OpenVINO Distil-Whisper Transcription\", show_copy_button=True)\n", " if to_quantize.value:\n", " transcription_q = gr.components.Textbox(\n", " label=\"OpenVINO Quantized Distil-Whisper Transcription\",\n", " show_copy_button=True,\n", " )\n", " button.click(\n", " fn=transcribe,\n", " inputs=audio,\n", " outputs=[transcription, infer_time],\n", " )\n", " if to_quantize.value:\n", " button_q.click(\n", " fn=transcribe,\n", " inputs=[audio, gr.Number(value=1, visible=False)],\n", " outputs=[transcription_q, infer_time_q],\n", " )\n", " gr.Markdown(\"## Examples\")\n", " gr.Examples(\n", " [[\"./example_1.wav\"]],\n", " audio,\n", " outputs=[transcription, infer_time],\n", " fn=transcribe,\n", " cache_examples=False,\n", " )\n", "# if you are launching remotely, specify server_name and server_port\n", "# demo.launch(server_name='your server name', server_port='server port in int')\n", "# Read more in the docs: https://gradio.app/docs/\n", "try:\n", " demo.launch(debug=True)\n", "except Exception:\n", " demo.launch(share=True, debug=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/52c58b58-7730-48d2-803d-4af0b6115499", "tags": { "categories": [ "Model Demos", "AI Trends" ], "libraries": [], "other": [], "tasks": [ "Speech Recognition" ] } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }