Updated README

ee8993b 13 days ago

4.98 kB

	---
	license: apache-2.0
	datasets:
	- mozilla-foundation/common_voice_10_0
	base_model:
	- facebook/wav2vec2-xls-r-300m
	tags:
	- pytorch
	- phoneme-recognition
	pipeline_tag: automatic-speech-recognition
	arxiv: arxiv.org/abs/2306.04306
	metrics:
	- per
	- aer
	library_name: allophant
	language:
	- bn
	- ca
	- cs
	- cv
	- da
	- de
	- el
	- en
	- es
	- et
	- eu
	- fi
	- fr
	- ga
	- hi
	- hu
	- id
	- it
	- ka
	- ky
	- lt
	- mt
	- nl
	- pl
	- pt
	- ro
	- ru
	- sk
	- sl
	- sv
	- sw
	- ta
	- tr
	- uk
	---

	Model Information
	=================

	Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.

	The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).

	\| Model Name \| UCLA Phonetic Corpus (PER) \| UCLA Phonetic Corpus (AER) \| Common Voice (PER) \| Common Voice (AER) \|
	\| ---------------- \| ---------: \| ---------: \| -------: \| -------: \|
	\| [Multitask](https://huggingface.co/kgnlp/allophant) \| 45.62% \| 19.44% \| 34.34% \| 8.36% \|
	\| Hierarchical \| 46.09% \| 19.18% \| 34.35% \| 8.56% \|
	\| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) \| 46.05% \| 19.52% \| 41.20% \| 8.88% \|
	\| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) \| 48.25% \| - \| 45.35% \| - \|
	\| [Baseline](https://huggingface.co/kgnlp/allophant-baseline) \| 57.01% \| - \| 46.95% \| - \|

	Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.

	Usage
	=====

	Install the [`allophant`](https://github.com/kgnlp/allophant) package:

	```bash
	pip install allophant
	```

	A pre-trained model can be loaded from a huggingface checkpoint or local file:

	```python
	from allophant.estimator import Estimator

	device = "cpu"
	model, attribute_indexer = Estimator.restore("kgnlp/allophant-hierarchical", device=device)
	supported_features = attribute_indexer.feature_names
	# The phonetic feature categories supported by the model, including "phonemes"
	print(supported_features)
	```
	Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:

	```python
	# 1. For a single language:
	inventory = attribute_indexer.phoneme_inventory("es")
	# 2. For multiple languages, e.g. in code-switching scenarios
	inventory = attribute_indexer.phoneme_inventory(["es", "it"])
	# 3. Any custom selection of phones for which features are available in the Allophoible database
	inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
	````

	Audio files can then be loaded, resampled and transcribed using the given
	inventory by first computing the log probabilities for each classifier:

	```python
	import torch
	import torchaudio
	from allophant.dataset_processing import Batch

	# Load an audio file and resample the first channel to the sample rate used by the model
	audio, sample_rate = torchaudio.load("utterance.wav")
	audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)

	# Construct a batch of 0-padded single channel audio, lengths and language IDs
	# Language ID can be 0 for inference
	batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
	model_outputs = model.predict(
	batch.to(device),
	attribute_indexer.composition_feature_matrix(inventory).to(device)
	)
	```

	Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:

	```python
	from allophant import predictions

	# Create a feature mapping for your inventory and CTC decoders for the desired feature set
	inventory_indexer = attribute_indexer.attributes.subset(inventory)
	ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)

	for feature_name, decoder in ctc_decoders.items():
	decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
	# Print the feature name and values for each utterance in the batch
	for [hypothesis] in decoded:
	# NOTE: token indices are offset by one due to the <BLANK> token used during decoding
	recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
	print(feature_name, recognized)
	```

	Citation
	========

	```bibtex
	@inproceedings{glocker2023allophant,
	title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
	author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
	year={2023},
	booktitle={{Proc. Interspeech 2023}},
	month={8}}
	```
	[](arxiv.org/abs/2306.04306)