Update README.md
Browse files
README.md
CHANGED
@@ -1,150 +1,151 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
datasets:
|
4 |
-
- mozilla-foundation/common_voice_10_0
|
5 |
-
base_model:
|
6 |
-
- facebook/wav2vec2-xls-r-300m
|
7 |
-
tags:
|
8 |
-
- pytorch
|
9 |
-
- phoneme-recognition
|
10 |
-
pipeline_tag: automatic-speech-recognition
|
11 |
-
|
12 |
-
|
13 |
-
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
-
|
18 |
-
-
|
19 |
-
-
|
20 |
-
-
|
21 |
-
-
|
22 |
-
-
|
23 |
-
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
-
|
27 |
-
-
|
28 |
-
-
|
29 |
-
-
|
30 |
-
-
|
31 |
-
-
|
32 |
-
-
|
33 |
-
-
|
34 |
-
-
|
35 |
-
-
|
36 |
-
-
|
37 |
-
-
|
38 |
-
-
|
39 |
-
-
|
40 |
-
-
|
41 |
-
-
|
42 |
-
-
|
43 |
-
-
|
44 |
-
-
|
45 |
-
-
|
46 |
-
-
|
47 |
-
-
|
48 |
-
-
|
49 |
-
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
|
61 |
-
|
|
62 |
-
| [
|
63 |
-
| [
|
64 |
-
| [
|
65 |
-
|
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
import
|
106 |
-
import
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
audio
|
111 |
-
|
112 |
-
|
113 |
-
#
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- mozilla-foundation/common_voice_10_0
|
5 |
+
base_model:
|
6 |
+
- facebook/wav2vec2-xls-r-300m
|
7 |
+
tags:
|
8 |
+
- pytorch
|
9 |
+
- phoneme-recognition
|
10 |
+
pipeline_tag: automatic-speech-recognition
|
11 |
+
metrics:
|
12 |
+
- per
|
13 |
+
- aer
|
14 |
+
library_name: allophant
|
15 |
+
language:
|
16 |
+
- bn
|
17 |
+
- ca
|
18 |
+
- cs
|
19 |
+
- cv
|
20 |
+
- da
|
21 |
+
- de
|
22 |
+
- el
|
23 |
+
- en
|
24 |
+
- es
|
25 |
+
- et
|
26 |
+
- eu
|
27 |
+
- fi
|
28 |
+
- fr
|
29 |
+
- ga
|
30 |
+
- hi
|
31 |
+
- hu
|
32 |
+
- id
|
33 |
+
- it
|
34 |
+
- ka
|
35 |
+
- ky
|
36 |
+
- lt
|
37 |
+
- mt
|
38 |
+
- nl
|
39 |
+
- pl
|
40 |
+
- pt
|
41 |
+
- ro
|
42 |
+
- ru
|
43 |
+
- sk
|
44 |
+
- sl
|
45 |
+
- sv
|
46 |
+
- sw
|
47 |
+
- ta
|
48 |
+
- tr
|
49 |
+
- uk
|
50 |
+
---
|
51 |
+
|
52 |
+
Model Information
|
53 |
+
=================
|
54 |
+
|
55 |
+
Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
|
56 |
+
|
57 |
+
The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
|
58 |
+
|
59 |
+
| Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
|
60 |
+
| ---------------- | ---------: | ---------: | -------: | -------: |
|
61 |
+
| [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
|
62 |
+
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
|
63 |
+
| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
|
64 |
+
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - |
|
65 |
+
| **Baseline** | 57.01% | - | 46.95% | - |
|
66 |
+
|
67 |
+
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
|
68 |
+
|
69 |
+
Usage
|
70 |
+
=====
|
71 |
+
|
72 |
+
Install the [`allophant`](https://github.com/kgnlp/allophant) package:
|
73 |
+
|
74 |
+
```bash
|
75 |
+
pip install allophant
|
76 |
+
```
|
77 |
+
|
78 |
+
A pre-trained model can be loaded from a huggingface checkpoint or local file:
|
79 |
+
|
80 |
+
```python
|
81 |
+
from allophant.estimator import Estimator
|
82 |
+
|
83 |
+
device = "cpu"
|
84 |
+
model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device)
|
85 |
+
supported_features = attribute_indexer.feature_names
|
86 |
+
# The phonetic feature categories supported by the model, including "phonemes"
|
87 |
+
print(supported_features)
|
88 |
+
```
|
89 |
+
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
|
90 |
+
|
91 |
+
```python
|
92 |
+
# 1. For a single language:
|
93 |
+
inventory = attribute_indexer.phoneme_inventory("es")
|
94 |
+
# 2. For multiple languages, e.g. in code-switching scenarios
|
95 |
+
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
|
96 |
+
# 3. Any custom selection of phones for which features are available in the Allophoible database
|
97 |
+
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
|
98 |
+
````
|
99 |
+
|
100 |
+
Audio files can then be loaded, resampled and transcribed using the given
|
101 |
+
inventory by first computing the log probabilities for each classifier:
|
102 |
+
|
103 |
+
```python
|
104 |
+
import torch
|
105 |
+
import torchaudio
|
106 |
+
from allophant.dataset_processing import Batch
|
107 |
+
|
108 |
+
# Load an audio file and resample the first channel to the sample rate used by the model
|
109 |
+
audio, sample_rate = torchaudio.load("utterance.wav")
|
110 |
+
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
|
111 |
+
|
112 |
+
# Construct a batch of 0-padded single channel audio, lengths and language IDs
|
113 |
+
# Language ID can be 0 for inference
|
114 |
+
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
|
115 |
+
model_outputs = model.predict(
|
116 |
+
batch.to(device),
|
117 |
+
attribute_indexer.composition_feature_matrix(inventory).to(device)
|
118 |
+
)
|
119 |
+
```
|
120 |
+
|
121 |
+
Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
|
122 |
+
|
123 |
+
```python
|
124 |
+
from allophant import predictions
|
125 |
+
|
126 |
+
# Create a feature mapping for your inventory and CTC decoders for the desired feature set
|
127 |
+
inventory_indexer = attribute_indexer.attributes.subset(inventory)
|
128 |
+
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
|
129 |
+
|
130 |
+
for feature_name, decoder in ctc_decoders.items():
|
131 |
+
decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
|
132 |
+
# Print the feature name and values for each utterance in the batch
|
133 |
+
for [hypothesis] in decoded:
|
134 |
+
# NOTE: token indices are offset by one due to the <BLANK> token used during decoding
|
135 |
+
recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
|
136 |
+
print(feature_name, recognized)
|
137 |
+
```
|
138 |
+
|
139 |
+
Citation
|
140 |
+
========
|
141 |
+
|
142 |
+
```bibtex
|
143 |
+
@inproceedings{glocker2023allophant,
|
144 |
+
title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
|
145 |
+
author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
|
146 |
+
year={2023},
|
147 |
+
booktitle={{Proc. Interspeech 2023}},
|
148 |
+
month={8}}
|
149 |
+
```
|
150 |
+
|
151 |
+
[](arxiv.org/abs/2306.04306)
|