Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +47 -11

README.md CHANGED Viewed

@@ -1,3 +1,4 @@
 ---
 language: fr
 pipeline_tag: "token-classification"
@@ -17,32 +18,51 @@ tags:
 - MEDIA
 ---
-# vpelloin/MEDIA_NLU_flaubert_finetuned (FT)
 This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
 It maps each input words into outputs concepts tags (76 available).
-This model is a fine-tuning of [`flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) (FlauBERT finetuned on ASR data).
 ## Usage with Pipeline
 ```python
 from transformers import pipeline
-generator = pipeline(model="vpelloin/MEDIA_NLU_flaubert_finetuned", task="token-classification")
-print(generator)
-```
 ## Usage with AutoTokenizer/AutoModel
 ```python
 from transformers import (
     AutoTokenizer,
     AutoModelForTokenClassification
 )
-tokenizer = AutoTokenizer.from_pretrained("vpelloin/MEDIA_NLU_flaubert_finetuned")
-model = AutoModelForTokenClassification.from_pretrained("vpelloin/MEDIA_NLU_flaubert_finetuned")
 sentences = [
     "je voudrais réserver une chambre à paris pour demain et lundi",
@@ -51,8 +71,24 @@ sentences = [
     "dans un hôtel avec piscine à marseille"
  ]
 inputs = tokenizer(sentences, padding=True, return_tensors='pt')
 outptus = model(**inputs).logits
-print([[model.config.id2label[i] for i in b] for b in outptus.argmax(dim=-1).tolist()])
 ```

 ---
 language: fr
 pipeline_tag: "token-classification"
 - MEDIA
 ---
+# vpelloin/MEDIA_NLU-flaubert_oral_ft
 This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
 It maps each input words into outputs concepts tags (76 available).
+This model is trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) as its inital checkpoint. It obtained 11.98% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352).
+## Available MEDIA NLU models:
+- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
+- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
+- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
+- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
+- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
+- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.
 ## Usage with Pipeline
 ```python
 from transformers import pipeline
+generator = pipeline(
+    model="vpelloin/MEDIA_NLU-flaubert_oral_ft",
+    task="token-classification"
+)
+sentences = [
+    "je voudrais réserver une chambre à paris pour demain et lundi",
+    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
+    "deux nuits s'il vous plait",
+    "dans un hôtel avec piscine à marseille"
+ ]
+for sentence in sentences:
+    print([(tok['word'], tok['entity']) for tok in generator(sentence)])
+```
 ## Usage with AutoTokenizer/AutoModel
 ```python
 from transformers import (
     AutoTokenizer,
     AutoModelForTokenClassification
 )
+tokenizer = AutoTokenizer.from_pretrained(
+    "vpelloin/MEDIA_NLU-flaubert_oral_ft"
+)
+model = AutoModelForTokenClassification.from_pretrained(
+    "vpelloin/MEDIA_NLU-flaubert_oral_ft"
+)
 sentences = [
     "je voudrais réserver une chambre à paris pour demain et lundi",
     "dans un hôtel avec piscine à marseille"
  ]
 inputs = tokenizer(sentences, padding=True, return_tensors='pt')
 outptus = model(**inputs).logits
+print([
+    [model.config.id2label[i] for i in b]
+    for b in outptus.argmax(dim=-1).tolist()
+])
+```
+## Reference
+If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
 ```
+@inproceedings{pelloin22_interspeech,
+  author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
+  title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
+  year=2022,
+  booktitle={Proc. Interspeech 2022},
+  pages={3453--3457},
+  doi={10.21437/Interspeech.2022-352}
+}
+```