kgnlp
/

allophant-baseline

Automatic Speech Recognition

PyTorch

allophant

phoneme-recognition

Model card Files Files and versions Community

kgnlp commited on 13 days ago

Commit

0bbf438

•

1 Parent(s): 90deeab

Update README.md

Browse files

Files changed (1) hide show

README.md +151 -150

README.md CHANGED Viewed

@@ -1,150 +1,151 @@
----
-license: apache-2.0
-datasets:
-- mozilla-foundation/common_voice_10_0
-base_model:
-- facebook/wav2vec2-xls-r-300m
-tags:
-- pytorch
-- phoneme-recognition
-pipeline_tag: automatic-speech-recognition
-arxiv: arxiv.org/abs/2306.04306
-metrics:
-- per
-- aer
-library_name: allophant
-language:
-- bn
-- ca
-- cs
-- cv
-- da
-- de
-- el
-- en
-- es
-- et
-- eu
-- fi
-- fr
-- ga
-- hi
-- hu
-- id
-- it
-- ka
-- ky
-- lt
-- mt
-- nl
-- pl
-- pt
-- ro
-- ru
-- sk
-- sl
-- sv
-- sw
-- ta
-- tr
-- uk
----
-Model Information
-=================
-Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
-The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
-| Model Name       | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
-| ---------------- | ---------: | ---------: | -------: | -------: |
-| [Multitask](https://huggingface.co/kgnlp/allophant)        | **45.62%** | 19.44% | **34.34%** | **8.36%** |
-| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical)     | 46.09% | **19.18%** | 34.35% | 8.56% |
-| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
-| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared)  | 48.25% |   -    | 45.35% |  -    |
-| **Baseline**         | 57.01% |   -    | 46.95% |  -    |
-Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
-Usage
-=====
-Install the [`allophant`](https://github.com/kgnlp/allophant) package:
-```bash
-pip install allophant
-```
-A pre-trained model can be loaded from a huggingface checkpoint or local file:
-```python
-from allophant.estimator import Estimator
-device = "cpu"
-model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device)
-supported_features = attribute_indexer.feature_names
-# The phonetic feature categories supported by the model, including "phonemes"
-print(supported_features)
-```
-Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
-```python
-# 1. For a single language:
-inventory = attribute_indexer.phoneme_inventory("es")
-# 2. For multiple languages, e.g. in code-switching scenarios
-inventory = attribute_indexer.phoneme_inventory(["es", "it"])
-# 3. Any custom selection of phones for which features are available in the Allophoible database
-inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
-````
-Audio files can then be loaded, resampled and transcribed using the given
-inventory by first computing the log probabilities for each classifier:
-```python
-import torch
-import torchaudio
-from allophant.dataset_processing import Batch
-# Load an audio file and resample the first channel to the sample rate used by the model
-audio, sample_rate = torchaudio.load("utterance.wav")
-audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
-# Construct a batch of 0-padded single channel audio, lengths and language IDs
-# Language ID can be 0 for inference
-batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
-model_outputs = model.predict(
-  batch.to(device),
-  attribute_indexer.composition_feature_matrix(inventory).to(device)
-)
-```
-Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
-```python
-from allophant import predictions
-# Create a feature mapping for your inventory and CTC decoders for the desired feature set
-inventory_indexer = attribute_indexer.attributes.subset(inventory)
-ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
-for feature_name, decoder in ctc_decoders.items():
-    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
-    # Print the feature name and values for each utterance in the batch
-    for [hypothesis] in decoded:
-        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
-        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
-        print(feature_name, recognized)
-```
-Citation
-========
-```bibtex
-@inproceedings{glocker2023allophant,
-    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
-    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
-    year={2023},
-    booktitle={{Proc. Interspeech 2023}},
-    month={8}}
-```

+---
+license: apache-2.0
+datasets:
+- mozilla-foundation/common_voice_10_0
+base_model:
+- facebook/wav2vec2-xls-r-300m
+tags:
+- pytorch
+- phoneme-recognition
+pipeline_tag: automatic-speech-recognition
+metrics:
+- per
+- aer
+library_name: allophant
+language:
+- bn
+- ca
+- cs
+- cv
+- da
+- de
+- el
+- en
+- es
+- et
+- eu
+- fi
+- fr
+- ga
+- hi
+- hu
+- id
+- it
+- ka
+- ky
+- lt
+- mt
+- nl
+- pl
+- pt
+- ro
+- ru
+- sk
+- sl
+- sv
+- sw
+- ta
+- tr
+- uk
+---
+Model Information
+=================
+Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
+The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
+| Model Name       | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
+| ---------------- | ---------: | ---------: | -------: | -------: |
+| [Multitask](https://huggingface.co/kgnlp/allophant)        | **45.62%** | 19.44% | **34.34%** | **8.36%** |
+| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical)     | 46.09% | **19.18%** | 34.35% | 8.56% |
+| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
+| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared)  | 48.25% |   -    | 45.35% |  -    |
+| **Baseline**         | 57.01% |   -    | 46.95% |  -    |
+Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
+Usage
+=====
+Install the [`allophant`](https://github.com/kgnlp/allophant) package:
+```bash
+pip install allophant
+```
+A pre-trained model can be loaded from a huggingface checkpoint or local file:
+```python
+from allophant.estimator import Estimator
+device = "cpu"
+model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device)
+supported_features = attribute_indexer.feature_names
+# The phonetic feature categories supported by the model, including "phonemes"
+print(supported_features)
+```
+Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
+```python
+# 1. For a single language:
+inventory = attribute_indexer.phoneme_inventory("es")
+# 2. For multiple languages, e.g. in code-switching scenarios
+inventory = attribute_indexer.phoneme_inventory(["es", "it"])
+# 3. Any custom selection of phones for which features are available in the Allophoible database
+inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
+````
+Audio files can then be loaded, resampled and transcribed using the given
+inventory by first computing the log probabilities for each classifier:
+```python
+import torch
+import torchaudio
+from allophant.dataset_processing import Batch
+# Load an audio file and resample the first channel to the sample rate used by the model
+audio, sample_rate = torchaudio.load("utterance.wav")
+audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
+# Construct a batch of 0-padded single channel audio, lengths and language IDs
+# Language ID can be 0 for inference
+batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
+model_outputs = model.predict(
+  batch.to(device),
+  attribute_indexer.composition_feature_matrix(inventory).to(device)
+)
+```
+Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
+```python
+from allophant import predictions
+# Create a feature mapping for your inventory and CTC decoders for the desired feature set
+inventory_indexer = attribute_indexer.attributes.subset(inventory)
+ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
+for feature_name, decoder in ctc_decoders.items():
+    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
+    # Print the feature name and values for each utterance in the batch
+    for [hypothesis] in decoded:
+        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
+        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
+        print(feature_name, recognized)
+```
+Citation
+========
+```bibtex
+@inproceedings{glocker2023allophant,
+    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
+    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
+    year={2023},
+    booktitle={{Proc. Interspeech 2023}},
+    month={8}}
+```
+[](arxiv.org/abs/2306.04306)