Automatic Speech Recognition
Transformers
PyTorch
TensorFlow
JAX
Safetensors
whisper
audio
hf-asr-leaderboard
Eval Results (legacy)
Eval Results
Instructions to use openai/whisper-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openai/whisper-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("openai/whisper-large") model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large") - Notebooks
- Google Colab
- Kaggle
Commit ·
e80f01d
1
Parent(s): 27b5bb3
Update README.md
Browse files
README.md
CHANGED
|
@@ -356,8 +356,8 @@ This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean
|
|
| 356 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
|
| 357 |
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
| 358 |
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 359 |
-
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline.
|
| 360 |
-
predict
|
| 361 |
|
| 362 |
```python
|
| 363 |
>>> import torch
|
|
@@ -376,15 +376,17 @@ predict utterance level timestamps by passing `return_timestamps=True`:
|
|
| 376 |
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 377 |
>>> sample = ds[0]["audio"]
|
| 378 |
|
| 379 |
-
>>> prediction = pipe(sample.copy())["text"]
|
| 380 |
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
| 381 |
|
| 382 |
>>> # we can also return timestamps for the predictions
|
| 383 |
-
>>> prediction = pipe(sample, return_timestamps=True)["chunks"]
|
| 384 |
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
|
| 385 |
'timestamp': (0.0, 5.44)}]
|
| 386 |
```
|
| 387 |
|
|
|
|
|
|
|
| 388 |
## Fine-Tuning
|
| 389 |
|
| 390 |
The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
|
|
|
|
| 356 |
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
|
| 357 |
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
| 358 |
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 359 |
+
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
|
| 360 |
+
can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
|
| 361 |
|
| 362 |
```python
|
| 363 |
>>> import torch
|
|
|
|
| 376 |
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 377 |
>>> sample = ds[0]["audio"]
|
| 378 |
|
| 379 |
+
>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
|
| 380 |
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
| 381 |
|
| 382 |
>>> # we can also return timestamps for the predictions
|
| 383 |
+
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
|
| 384 |
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
|
| 385 |
'timestamp': (0.0, 5.44)}]
|
| 386 |
```
|
| 387 |
|
| 388 |
+
Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
|
| 389 |
+
|
| 390 |
## Fine-Tuning
|
| 391 |
|
| 392 |
The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
|