sanchit-gandhi
/

wav2vec2-large-tedlium

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-large-tedlium / README.md

sanchit-gandhi's picture

sanchit-gandhi HF staff

Update README.md

b771920 over 2 years ago

|

history blame contribute delete

2.96 kB

	---
	language: en
	datasets:
	- LIUM/tedlium
	tags:
	- speech
	license: apache-2.0
	---
	# Wav2Vec2-Large-Tedlium
	The Wav2Vec2 large model fine-tuned on the TEDLIUM corpus.

	The model is initialised with Facebook's [Wav2Vec2 large LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint pre-trained on 60,000h of audiobooks from the LibriVox project. It is fine-tuned on 452h of TED talks from the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (Release 3). When using the model, make sure that your speech input is sampled at 16Khz.

	The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/10c85yc4?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.

	See [this notebook](https://colab.research.google.com/drive/1FjTsqbYKphl9kL-eILgUc-bl4zVThL8F?usp=sharing) for more information on how this model was fine-tuned.


	# Usage
	To transcribe audio files the model can be used as a standalone acoustic model as follows:
	```python
	from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
	from datasets import load_dataset
	import torch

	# load model and processor
	processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
	model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")

	# load dummy dataset
	ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")

	# process audio inputs
	input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	print("Target: ", ds["text"][0])
	print("Transcription: ", transcription[0])
	```

	## Evaluation

	This code snippet shows how to evaluate Wav2Vec2-Large-Tedlium on the TEDLIUM test data.

	```python
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
	import torch
	from jiwer import wer

	tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
	model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
	processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
	def map_to_pred(batch):
	input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
	with torch.no_grad():
	logits = model(input_values.to("cuda")).logits
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	batch["transcription"] = transcription
	return batch
	result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
	print("WER:", wer(result["text"], result["transcription"]))
	```