--- license: apache-2.0 language: - en - es - fr - it widget: - text: "The best cough medicine is because " - text: "El mejor medicamento para la tos es porque " - text: "Le meilleur médicament contre la toux est car perché la

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

# Model Card for MedMT5-large

We present Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Medical mT5 is an encoder-decoder model developed by continuing the training of publicly available mT5 checkpoints on medical domain data for English, Spanish, French, and Italian.

- 📖 Paper: [Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain]() - 🌐 Project Website: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote)

Pre-Training settings for Medical MT5.
	Medical mT5-Large (HiTZ/Medical-mT5-large)	Meical mT5-XL (HiTZ/Medical-mT5-xl)
Param. no.	738M	3B
Sequence Length	1024	480
Token/step	65536	30720
Epochs	1	1
Total Tokens	4.5B	4.5B
Optimizer	Adafactor	Adafactor
LR	0.001	0.001
Scheduler	Constant	Constant
Hardware	4xA100	4xA100
Time (h)	10.5	20.5
CO₂eq (kg)	2.9	5.6

# Model Description - **Developed by**: Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata and Andrea Zaninello - **Contact**: [Iker García-Ferrero](https://ikergarcia1996.github.io/Iker-Garcia-Ferrero/) and [Rodrigo Agerri](https://ragerri.github.io/) - **Website**: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote) - **Funding**: CHIST-ERA XAI 2019 call. Antidote (PCI2020-120717-2) funded by MCIN/AEI /10.13039/501100011033 and by European Union NextGenerationEU/PRTR - **Model type**: text2text-generation - **Language(s) (NLP)**: English, Spanish, French, Italian - **License**: apache-2.0 - **Finetuned from model**: mT5 ## How to Get Started with the Model You can load the model using ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("HiTZ/Medical-mT5-xl") model = AutoModelForSeq2SeqLM.from_pretrained("HiTZ/Medical-mT5-xl") ``` The model has been trained using the T5 masked language modeling tasks. You need to finetune the model for your task.

### Medical mT5 for Sequence Labelling If you want to use Medical mT5 for Sequence Labeling, we recommend you use this code: https://github.com/ikergarcia1996/Sequence-Labeling-LLMs ## Training Data
Data sources and word counts by language.

Language Source Words

English ClinicalTrials 127.4M

EMEA 12M

PubMed 968.4M

Spanish EMEA 13.6M

PubMed 8.4M

Medical Crawler 918M

SPACC 350K

UFAL 10.5M

WikiMed 5.2M

French PubMed 1.4M

Science Direct 15.2M

Wikipedia - Médecine 5M

EDP 48K

Google Patents 654M

Italian Medical Commoncrawl - IT 67M

Drug instructions 30.5M

Wikipedia - Medicina 13.3M

E3C Corpus - IT 11.6M

Medicine descriptions 6.3M

Medical theses 5.8M

Medical websites 4M

PubMed 2.3M

Supplement description 1.3M

Medical notes 975K

Pathologies 157K

Medical test simulations 26K

Clinical cases 20K

## Evaluation ### Single-task supervised F1 scores for Sequence Labelling

Data sources and word counts by language.
Language	Source	Words
English	ClinicalTrials	127.4M
EMEA	12M
PubMed	968.4M
Spanish	EMEA	13.6M
PubMed	8.4M
Medical Crawler	918M
SPACC	350K
UFAL	10.5M
WikiMed	5.2M
French	PubMed	1.4M
Science Direct	15.2M
Wikipedia - Médecine	5M
EDP	48K
Google Patents	654M
Italian	Medical Commoncrawl - IT	67M
Drug instructions	30.5M
Wikipedia - Medicina	13.3M
E3C Corpus - IT	11.6M
Medicine descriptions	6.3M
Medical theses	5.8M
Medical websites	4M
PubMed	2.3M
Supplement description	1.3M
Medical notes	975K
Pathologies	157K
Medical test simulations	26K
Clinical cases	20K

### Multi-task supervised F1 scores for Sequence Labelling

### Zero-shot F1 scores for Argument Mining. Models have been trained in English and evaluated in Spanish, French and Italian.

## Ethical Statement

Our research in developing Medical mT5, a multilingual text-to-text model for the medical domain, has ethical implications that we acknowledge. Firstly, the broader impact of this work lies in its potential to improve medical communication and understanding across languages, which can enhance healthcare access and quality for diverse linguistic communities. However, it also raises ethical considerations related to privacy and data security. To create our multilingual corpus, we have taken measures to anonymize and protect sensitive patient information, adhering to data protection regulations in each language's jurisdiction or deriving our data from sources that explicitly address this issue in line with privacy and safety regulations and guidelines. Furthermore, we are committed to transparency and fairness in our model's development and evaluation. We have worked to ensure that our benchmarks are representative and unbiased, and we will continue to monitor and address any potential biases in the future. Finally, we emphasize our commitment to open source by making our data, code, and models publicly available, with the aim of promoting collaboration within the research community.

## Citation ```bibtext @inproceedings{medMt5, title = "{{Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain}}", author = "{Iker García-Ferrero and Rodrigo Agerri and Aitziber Atutxa Salazar and Elena Cabrio and Iker de la Iglesia and Alberto Lavelli and Bernardo Magnini and Benjamin Molinet and Johana Ramirez-Romero and German Rigau and Jose Maria Villa-Gonzalez and Serena Villata and Andrea Zaninello}", publisher = "Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)", year = 2024 } ```