|
--- |
|
license: mit |
|
language: |
|
- fr |
|
library_name: transformers |
|
tags: |
|
- linformer |
|
- legal |
|
- medical |
|
- RoBERTa |
|
- pytorch |
|
--- |
|
|
|
# Jargon-multidomain-base |
|
|
|
[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture. |
|
|
|
Jargon is available in several versions with different context sizes and types of pre-training corpora. |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). |
|
--> |
|
|
|
| **Model** | **Initialised from...** |**Training Data**| |
|
|-------------------------------------------------------------------------------------|:-----------------------:|:----------------:| |
|
| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus| |
|
| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus| |
|
| jargon-general-legal | jargon-general-base |18GB Legal Corpus |
|
| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora| |
|
| jargon-legal | scratch |18GB Legal Corpus| |
|
| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) | scratch |18GB Legal Corpus| |
|
| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus| |
|
| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus| |
|
| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)| |
|
| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)| |
|
|
|
|
|
## Evaluation |
|
|
|
The Jargon models were evaluated on an range of specialized downstream tasks. |
|
|
|
For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/). |
|
|
|
|
|
## Using Jargon models with HuggingFace transformers |
|
|
|
You can get started with this model using the code snippet below: |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-multidomain-base", trust_remote_code=True) |
|
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-multidomain-base", trust_remote_code=True) |
|
|
|
jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
output = jargon_maskfiller("Il est allé au <mask> hier") |
|
``` |
|
|
|
You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question. |
|
|
|
- **Language(s):** French |
|
- **License:** MIT |
|
- **Developed by:** Vincent Segonne |
|
- **Funded by** |
|
- GENCI-IDRIS (Grant 2022 A0131013801) |
|
- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001 |
|
- MIAI@Grenoble Alpes ANR-19-P3IA-0003 |
|
- PROPICTO ANR-20-CE93-0005 |
|
- Lawbot ANR-20-CE38-0013 |
|
- Swiss National Science Foundation (grant PROPICTO N°197864) |
|
- **Authors** |
|
- Vincent Segonne |
|
- Aidan Mannion |
|
- Laura Cristina Alonzo Canul |
|
- Alexandre Audibert |
|
- Xingyu Liu |
|
- Cécile Macaire |
|
- Adrien Pupier |
|
- Yongxin Zhou |
|
- Mathilde Aguiar |
|
- Felix Herron |
|
- Magali Norré |
|
- Massih-Reza Amini |
|
- Pierrette Bouillon |
|
- Iris Eshkol-Taravella |
|
- Emmanuelle Esperança-Rodier |
|
- Thomas François |
|
- Lorraine Goeuriot |
|
- Jérôme Goulian |
|
- Mathieu Lafourcade |
|
- Benjamin Lecouteux |
|
- François Portet |
|
- Fabien Ringeval |
|
- Vincent Vandeghinste |
|
- Maximin Coavoux |
|
- Marco Dinarelli |
|
- Didier Schwab |
|
|
|
|
|
|
|
## Citation |
|
|
|
If you use this model for your own research work, please cite as follows: |
|
|
|
```bibtex |
|
@inproceedings{segonne:hal-04535557, |
|
TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}}, |
|
AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier}, |
|
URL = {https://hal.science/hal-04535557}, |
|
BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}}, |
|
ADDRESS = {Turin, Italy}, |
|
YEAR = {2024}, |
|
MONTH = May, |
|
KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription}, |
|
PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf}, |
|
HAL_ID = {hal-04535557}, |
|
HAL_VERSION = {v1}, |
|
} |
|
``` |
|
|
|
|
|
|
|
<!-- - **Finetuned from model [optional]:** [More Information Needed] --> |
|
<!-- |
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |