Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
|
|
1 |
---
|
2 |
language: fr
|
3 |
pipeline_tag: "token-classification"
|
@@ -17,32 +18,51 @@ tags:
|
|
17 |
- MEDIA
|
18 |
---
|
19 |
|
20 |
-
# vpelloin/
|
21 |
-
|
22 |
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
|
23 |
It maps each input words into outputs concepts tags (76 available).
|
24 |
|
25 |
-
This model is
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
## Usage with Pipeline
|
29 |
```python
|
30 |
from transformers import pipeline
|
31 |
|
32 |
-
generator = pipeline(
|
|
|
|
|
|
|
33 |
|
34 |
-
|
35 |
-
|
|
|
|
|
|
|
|
|
36 |
|
|
|
|
|
|
|
37 |
## Usage with AutoTokenizer/AutoModel
|
38 |
```python
|
39 |
from transformers import (
|
40 |
AutoTokenizer,
|
41 |
AutoModelForTokenClassification
|
42 |
)
|
43 |
-
|
44 |
-
|
45 |
-
|
|
|
|
|
|
|
46 |
|
47 |
sentences = [
|
48 |
"je voudrais réserver une chambre à paris pour demain et lundi",
|
@@ -51,8 +71,24 @@ sentences = [
|
|
51 |
"dans un hôtel avec piscine à marseille"
|
52 |
]
|
53 |
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
|
54 |
-
|
55 |
outptus = model(**inputs).logits
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
-
|
58 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
---
|
3 |
language: fr
|
4 |
pipeline_tag: "token-classification"
|
|
|
18 |
- MEDIA
|
19 |
---
|
20 |
|
21 |
+
# vpelloin/MEDIA_NLU-flaubert_oral_ft
|
|
|
22 |
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
|
23 |
It maps each input words into outputs concepts tags (76 available).
|
24 |
|
25 |
+
This model is trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) as its inital checkpoint. It obtained 11.98% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352).
|
26 |
|
27 |
+
## Available MEDIA NLU models:
|
28 |
+
- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
|
29 |
+
- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
|
30 |
+
- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
|
31 |
+
- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
|
32 |
+
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
|
33 |
+
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.
|
34 |
|
35 |
## Usage with Pipeline
|
36 |
```python
|
37 |
from transformers import pipeline
|
38 |
|
39 |
+
generator = pipeline(
|
40 |
+
model="vpelloin/MEDIA_NLU-flaubert_oral_ft",
|
41 |
+
task="token-classification"
|
42 |
+
)
|
43 |
|
44 |
+
sentences = [
|
45 |
+
"je voudrais réserver une chambre à paris pour demain et lundi",
|
46 |
+
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
|
47 |
+
"deux nuits s'il vous plait",
|
48 |
+
"dans un hôtel avec piscine à marseille"
|
49 |
+
]
|
50 |
|
51 |
+
for sentence in sentences:
|
52 |
+
print([(tok['word'], tok['entity']) for tok in generator(sentence)])
|
53 |
+
```
|
54 |
## Usage with AutoTokenizer/AutoModel
|
55 |
```python
|
56 |
from transformers import (
|
57 |
AutoTokenizer,
|
58 |
AutoModelForTokenClassification
|
59 |
)
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
61 |
+
"vpelloin/MEDIA_NLU-flaubert_oral_ft"
|
62 |
+
)
|
63 |
+
model = AutoModelForTokenClassification.from_pretrained(
|
64 |
+
"vpelloin/MEDIA_NLU-flaubert_oral_ft"
|
65 |
+
)
|
66 |
|
67 |
sentences = [
|
68 |
"je voudrais réserver une chambre à paris pour demain et lundi",
|
|
|
71 |
"dans un hôtel avec piscine à marseille"
|
72 |
]
|
73 |
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
|
|
|
74 |
outptus = model(**inputs).logits
|
75 |
+
print([
|
76 |
+
[model.config.id2label[i] for i in b]
|
77 |
+
for b in outptus.argmax(dim=-1).tolist()
|
78 |
+
])
|
79 |
+
```
|
80 |
+
|
81 |
+
## Reference
|
82 |
|
83 |
+
If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
|
84 |
```
|
85 |
+
@inproceedings{pelloin22_interspeech,
|
86 |
+
author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
|
87 |
+
title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
|
88 |
+
year=2022,
|
89 |
+
booktitle={Proc. Interspeech 2022},
|
90 |
+
pages={3453--3457},
|
91 |
+
doi={10.21437/Interspeech.2022-352}
|
92 |
+
}
|
93 |
+
```
|
94 |
+
|