kgnlp commited on
Commit
0bbf438
1 Parent(s): 90deeab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -150
README.md CHANGED
@@ -1,150 +1,151 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - mozilla-foundation/common_voice_10_0
5
- base_model:
6
- - facebook/wav2vec2-xls-r-300m
7
- tags:
8
- - pytorch
9
- - phoneme-recognition
10
- pipeline_tag: automatic-speech-recognition
11
- arxiv: arxiv.org/abs/2306.04306
12
- metrics:
13
- - per
14
- - aer
15
- library_name: allophant
16
- language:
17
- - bn
18
- - ca
19
- - cs
20
- - cv
21
- - da
22
- - de
23
- - el
24
- - en
25
- - es
26
- - et
27
- - eu
28
- - fi
29
- - fr
30
- - ga
31
- - hi
32
- - hu
33
- - id
34
- - it
35
- - ka
36
- - ky
37
- - lt
38
- - mt
39
- - nl
40
- - pl
41
- - pt
42
- - ro
43
- - ru
44
- - sk
45
- - sl
46
- - sv
47
- - sw
48
- - ta
49
- - tr
50
- - uk
51
- ---
52
-
53
- Model Information
54
- =================
55
-
56
- Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
57
-
58
- The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
59
-
60
- | Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
61
- | ---------------- | ---------: | ---------: | -------: | -------: |
62
- | [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
63
- | [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
64
- | [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
65
- | [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - |
66
- | **Baseline** | 57.01% | - | 46.95% | - |
67
-
68
- Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
69
-
70
- Usage
71
- =====
72
-
73
- Install the [`allophant`](https://github.com/kgnlp/allophant) package:
74
-
75
- ```bash
76
- pip install allophant
77
- ```
78
-
79
- A pre-trained model can be loaded from a huggingface checkpoint or local file:
80
-
81
- ```python
82
- from allophant.estimator import Estimator
83
-
84
- device = "cpu"
85
- model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device)
86
- supported_features = attribute_indexer.feature_names
87
- # The phonetic feature categories supported by the model, including "phonemes"
88
- print(supported_features)
89
- ```
90
- Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
91
-
92
- ```python
93
- # 1. For a single language:
94
- inventory = attribute_indexer.phoneme_inventory("es")
95
- # 2. For multiple languages, e.g. in code-switching scenarios
96
- inventory = attribute_indexer.phoneme_inventory(["es", "it"])
97
- # 3. Any custom selection of phones for which features are available in the Allophoible database
98
- inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
99
- ````
100
-
101
- Audio files can then be loaded, resampled and transcribed using the given
102
- inventory by first computing the log probabilities for each classifier:
103
-
104
- ```python
105
- import torch
106
- import torchaudio
107
- from allophant.dataset_processing import Batch
108
-
109
- # Load an audio file and resample the first channel to the sample rate used by the model
110
- audio, sample_rate = torchaudio.load("utterance.wav")
111
- audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
112
-
113
- # Construct a batch of 0-padded single channel audio, lengths and language IDs
114
- # Language ID can be 0 for inference
115
- batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
116
- model_outputs = model.predict(
117
- batch.to(device),
118
- attribute_indexer.composition_feature_matrix(inventory).to(device)
119
- )
120
- ```
121
-
122
- Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
123
-
124
- ```python
125
- from allophant import predictions
126
-
127
- # Create a feature mapping for your inventory and CTC decoders for the desired feature set
128
- inventory_indexer = attribute_indexer.attributes.subset(inventory)
129
- ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
130
-
131
- for feature_name, decoder in ctc_decoders.items():
132
- decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
133
- # Print the feature name and values for each utterance in the batch
134
- for [hypothesis] in decoded:
135
- # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
136
- recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
137
- print(feature_name, recognized)
138
- ```
139
-
140
- Citation
141
- ========
142
-
143
- ```bibtex
144
- @inproceedings{glocker2023allophant,
145
- title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
146
- author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
147
- year={2023},
148
- booktitle={{Proc. Interspeech 2023}},
149
- month={8}}
150
- ```
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - mozilla-foundation/common_voice_10_0
5
+ base_model:
6
+ - facebook/wav2vec2-xls-r-300m
7
+ tags:
8
+ - pytorch
9
+ - phoneme-recognition
10
+ pipeline_tag: automatic-speech-recognition
11
+ metrics:
12
+ - per
13
+ - aer
14
+ library_name: allophant
15
+ language:
16
+ - bn
17
+ - ca
18
+ - cs
19
+ - cv
20
+ - da
21
+ - de
22
+ - el
23
+ - en
24
+ - es
25
+ - et
26
+ - eu
27
+ - fi
28
+ - fr
29
+ - ga
30
+ - hi
31
+ - hu
32
+ - id
33
+ - it
34
+ - ka
35
+ - ky
36
+ - lt
37
+ - mt
38
+ - nl
39
+ - pl
40
+ - pt
41
+ - ro
42
+ - ru
43
+ - sk
44
+ - sl
45
+ - sv
46
+ - sw
47
+ - ta
48
+ - tr
49
+ - uk
50
+ ---
51
+
52
+ Model Information
53
+ =================
54
+
55
+ Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
56
+
57
+ The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
58
+
59
+ | Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
60
+ | ---------------- | ---------: | ---------: | -------: | -------: |
61
+ | [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
62
+ | [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
63
+ | [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
64
+ | [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - |
65
+ | **Baseline** | 57.01% | - | 46.95% | - |
66
+
67
+ Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
68
+
69
+ Usage
70
+ =====
71
+
72
+ Install the [`allophant`](https://github.com/kgnlp/allophant) package:
73
+
74
+ ```bash
75
+ pip install allophant
76
+ ```
77
+
78
+ A pre-trained model can be loaded from a huggingface checkpoint or local file:
79
+
80
+ ```python
81
+ from allophant.estimator import Estimator
82
+
83
+ device = "cpu"
84
+ model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline", device=device)
85
+ supported_features = attribute_indexer.feature_names
86
+ # The phonetic feature categories supported by the model, including "phonemes"
87
+ print(supported_features)
88
+ ```
89
+ Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
90
+
91
+ ```python
92
+ # 1. For a single language:
93
+ inventory = attribute_indexer.phoneme_inventory("es")
94
+ # 2. For multiple languages, e.g. in code-switching scenarios
95
+ inventory = attribute_indexer.phoneme_inventory(["es", "it"])
96
+ # 3. Any custom selection of phones for which features are available in the Allophoible database
97
+ inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
98
+ ````
99
+
100
+ Audio files can then be loaded, resampled and transcribed using the given
101
+ inventory by first computing the log probabilities for each classifier:
102
+
103
+ ```python
104
+ import torch
105
+ import torchaudio
106
+ from allophant.dataset_processing import Batch
107
+
108
+ # Load an audio file and resample the first channel to the sample rate used by the model
109
+ audio, sample_rate = torchaudio.load("utterance.wav")
110
+ audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
111
+
112
+ # Construct a batch of 0-padded single channel audio, lengths and language IDs
113
+ # Language ID can be 0 for inference
114
+ batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
115
+ model_outputs = model.predict(
116
+ batch.to(device),
117
+ attribute_indexer.composition_feature_matrix(inventory).to(device)
118
+ )
119
+ ```
120
+
121
+ Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
122
+
123
+ ```python
124
+ from allophant import predictions
125
+
126
+ # Create a feature mapping for your inventory and CTC decoders for the desired feature set
127
+ inventory_indexer = attribute_indexer.attributes.subset(inventory)
128
+ ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
129
+
130
+ for feature_name, decoder in ctc_decoders.items():
131
+ decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
132
+ # Print the feature name and values for each utterance in the batch
133
+ for [hypothesis] in decoded:
134
+ # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
135
+ recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
136
+ print(feature_name, recognized)
137
+ ```
138
+
139
+ Citation
140
+ ========
141
+
142
+ ```bibtex
143
+ @inproceedings{glocker2023allophant,
144
+ title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
145
+ author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
146
+ year={2023},
147
+ booktitle={{Proc. Interspeech 2023}},
148
+ month={8}}
149
+ ```
150
+
151
+ [](arxiv.org/abs/2306.04306)