--- language: ja library_name: transformers license: apache-2.0 tags: - audio - automatic-speech-recognition - hf-asr-leaderboard widget: - example_title: CommonVoice 8.0 (Test Split) src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0/resolve/main/sample.flac - example_title: JSUT Basic 5000 src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac - example_title: ReazonSpeech (Test Split) src: >- https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test/resolve/main/sample.flac pipeline_tag: automatic-speech-recognition datasets: - japanese-asr/whisper_transcriptions.reazonspeech.large - japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0 - japanese-asr/whisper_transcriptions.reazonspeech.large.wer_10.0.vectorized --- # Kotoba-Whisper-v1.1 _Kotoba-Whisper-v1.1_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), with additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). These libraries are merged into Kotoba-Whisper-v1.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0). The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech) Following table presents the raw CER (unlike usual CER where the punctuations are removed before computing the metrics, see the evaluation script [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/blob/main/run_short_form_eval.py)) along with the. | model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) | |:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:| | [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 17.6 | 15.4 | 17.4 | | [kotoba-tech/kotoba-whisper-v2.1](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.1) | 17.7 | 15.4 | 17 | | [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 17.8 | 15.2 | 17.8 | | [kotoba-tech/kotoba-whisper-v1.1](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1) | 17.9 | 15 | 17.8 | | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 15.3 | 13.4 | 20.5 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 15.9 | 10.6 | 34.6 | | [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 16.6 | 11.3 | 40.7 | | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 17.9 | 13.1 | 39.3 | | [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 34.5 | 26.4 | 76 | | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 21.5 | 18.9 | 48.1 | | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 58.8 | 38.3 | 153.3 | Regarding to the normalized CER, since those update from v1.1 will be removed by the normalization, kotoba-tech/kotoba-whisper-v1.1 marks the same CER values as [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0). ### Latency Kotoba-whisper-v1.1 improves the punctuation and the timestamp of the output from Kotoba-whisper-v1.0. However, since we apply the punctuator and stable-ts to each chunk, we need to obtain the timestamps, which decreases the latency of the original kotoba-whisper-v1.0. See the following table comparing the inference speed on transcribing **50min** Japanese speech audio, where we report the average over five independent runs. | model | return_timestamps | time (mean) | |:---------------------------------------------------------|:--------------------|--------------:| | kotoba-tech/kotoba-whisper-v1.0 | False | 10.8 | | kotoba-tech/kotoba-whisper-v1.0 | True | 15.7 | | kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts) | True | 17.9 | | kotoba-tech/kotoba-whisper-v1.1 (punctuator) | True | 17.7 | | kotoba-tech/kotoba-whisper-v1.1 (stable-ts) | True | 16.1 | | openai/whisper-large-v3 | False | 29.1 | | openai/whisper-large-v3 | True | 37.9 | See the full table [here](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1/raw/main/latency.csv). ## Transformers Usage Kotoba-Whisper-v1.1 is supported in the Hugging Face πŸ€— Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers. ```bash pip install --upgrade pip pip install --upgrade transformers accelerate torchaudio pip install stable-ts==2.16.0 pip install punctuators==0.0.5 ``` ### Transcription The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audio files as follows: ```python import torch from transformers import pipeline from datasets import load_dataset # config model_id = "kotoba-tech/kotoba-whisper-v1.1" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 device = "cuda:0" if torch.cuda.is_available() else "cpu" model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} generate_kwargs = {"language": "ja", "task": "transcribe"} # load model pipe = pipeline( model=model_id, torch_dtype=torch_dtype, device=device, model_kwargs=model_kwargs, batch_size=16, trust_remote_code=True, punctuator=True ) # load sample audio dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") sample = dataset[0]["audio"] # run inference result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs) print(result) ``` - To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: ```diff - result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs) + result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs) ``` - To deactivate punctuator: ```diff - punctuator=True, + punctuator=False, ``` ### Transcription with Prompt Kotoba-whisper can generate transcription with prompting as below: ```python import re import torch from transformers import pipeline from datasets import load_dataset # config model_id = "kotoba-tech/kotoba-whisper-v1.1" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 device = "cuda:0" if torch.cuda.is_available() else "cpu" model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} generate_kwargs = {"language": "japanese", "task": "transcribe"} # load model pipe = pipeline( model=model_id, torch_dtype=torch_dtype, device=device, model_kwargs=model_kwargs, batch_size=16, trust_remote_code=True ) # load sample audio dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test") # --- Without prompt --- text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text'] print(text) # 81ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚ # --- With prompt ---: Let's change `81` to `91`. prompt = "91ζ­³" generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device) text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text'] # currently the pipeline for ASR appends the prompt at the beginning of the transcription, so remove it text = re.sub(rf"\A\s*{prompt}\s*", "", text) print(text) # γ‚γ£γΆγ£γŸγ§γ‚‚γ‚Ήγƒ«γ‚¬γ•γ‚“γ€91ζ­³γ€εŠ›εΌ·γ„θ΅°γ‚Šγ«ε€‰γ‚γ£γ¦γγΎγ™γ€‚ ``` ### Flash Attention 2 We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): ``` pip install flash-attn --no-build-isolation ``` Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`: ```diff - model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {} + model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {} ``` ## Acknowledgements * [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3). * Hugging Face πŸ€— [Transformers](https://github.com/huggingface/transformers) for the model integration. * Hugging Face πŸ€— for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper). * [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).