HiTZ
/

Iker commited on
Commit
ca0ee7e
1 Parent(s): 1696c86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md CHANGED
@@ -1,3 +1,150 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ widget:
4
+ - text: >-
5
+ <Disease> Torsade de pointes ventricular tachycardia during low dose
6
+ intermittent dobutamine treatment in a patient with dilated cardiomyopathy
7
+ and congestive heart failure .
8
+ - text: >-
9
+ <ClinicalEntity> Ecográficamente se observan tres nódulos tumorales
10
+ independientes y bien delimitados : dos de ellos heterogéneos , sólidos , de
11
+ 20 y 33 mm de diámetros , con áreas quísticas y calcificaciones .
12
+ - text: >-
13
+ <ClinicalEntity> On notait une hyperlordose lombaire avec une contracture
14
+ permanente des muscles paravertébraux , de l abdomen et des deux membres
15
+ inférieurs .
16
+ - text: >-
17
+ <ClinicalEntity> Nell ’ anamnesi patologica era riferita ipertensione
18
+ arteriosa controllata con terapia medica
19
+ library_name: transformers
20
+ pipeline_tag: text2text-generation
21
+ tags:
22
+ - medical
23
+ - multilingual
24
+ - medic
25
+ datasets:
26
+ - HiTZ/Multilingual-Medical-Corpus
27
+ language:
28
+ - es
29
+ - en
30
+ - fr
31
+ - it
32
+ base_model: HiTZ/Medical-mT5-XL
33
  ---
34
+
35
+ <p align="center">
36
+ <br>
37
+ <img src="http://www.ixa.eus/sites/default/files/anitdote.png" style="width: 45%;">
38
+ <h2 align="center">Medical mT5: An Open-Source Multilingual Text-to-Text LLM
39
+ for the Medical Domain</h2>
40
+ <be>
41
+
42
+ # Model Card for Medical MT5-XL-multitask
43
+
44
+
45
+ <p align="justify">
46
+
47
+ Medical MT5-xl-multitask is a version of Medical MT5 finetuned for sequence labelling. It can correctly label a wide range of Medical labels in unstructured text, such as `Disease`, `Disability`, `ClinicalEntity`, `Chemical`... Medical MT5-xl-multitask has been finetuned for English, Spanish, French and Italian, although it may work with a wide range of languages.
48
+
49
+ - 📖 Paper: [Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain]()
50
+ - 🌐 Project Website: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote)
51
+
52
+
53
+ <p align="center">
54
+ <br>
55
+ <img src="https://raw.githubusercontent.com/ikergarcia1996/Sequence-Labeling-LLMs/main/resources/MedT5-Ner-mtask.png" style="width: 60%;">
56
+ <be>
57
+
58
+ # Open Source Models
59
+ <table border="1" cellspacing="0" cellpadding="5">
60
+ <thead>
61
+ <tr>
62
+ <th></th>
63
+ <th>Medical mT5-Large (<a href="https://huggingface.co/HiTZ/Medical-mT5-large">HiTZ/Medical-mT5-large</a>)</th>
64
+ <th>Medical mT5-XL (<a href="https://huggingface.co/HiTZ/Medical-mT5-xl">HiTZ/Medical-mT5-xl</a>)</th>
65
+ <th>Medical mT5-Large-multitask (<a href="https://huggingface.co/HiTZ/Medical-mT5-large-multitask">HiTZ/Medical-mT5-large</a>)</th>
66
+ <th>Medical mT5-XL-multitask (<a href="https://huggingface.co/HiTZ/Medical-mT5-xl-multitask">HiTZ/Medical-mT5-xl</a>)</th>
67
+ </tr>
68
+ </thead>
69
+ <tbody>
70
+ <tr>
71
+ <td>Param. no.</td>
72
+ <td>738M</td>
73
+ <td>3B</td>
74
+ <td>738M</td>
75
+ <td>3B</td>
76
+ </tr>
77
+ <tr>
78
+ <td>Task</td>
79
+ <td>Language Modeling</td>
80
+ <td>Language Modeling</td>
81
+ <td>Multitask Sequence Labeling</td>
82
+ <td>Multitask Sequence Labeling</td>
83
+ </tr>
84
+ <tr>
85
+ </tbody>
86
+ </table>
87
+
88
+
89
+ # Usage
90
+
91
+ Medical MT5-xl-multitask was training using the *Sequence-Labeling-LLMs* library: https://github.com/ikergarcia1996/Sequence-Labeling-LLMs/
92
+ This library uses constrained decoding to ensure that the output contains the same words as the input and a valid HTML annotation. We recommend using Medical MT5-xl-multitask together with this library.
93
+ Although you can also directly use it with 🤗 huggingface. In order to label a sentence, you need to append the labels you wan to use, for example, if you want to label *dieseases* you should format your input as follows: `<Disease> Torsade de pointes ventricular tachycardia during low dose intermittent dobutamine treatment in a patient with dilated cardiomyopathy and congestive heart failure .`
94
+
95
+ ```python
96
+ import torch
97
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
98
+
99
+ model = AutoModelForSeq2SeqLM.from_pretrained("Medical-mT5-xl-multitask",torch_dtype=torch.bfloat16, device_map="auto")
100
+ tokenizer = AutoTokenizer.from_pretrained("Medical-mT5-xl-multitask")
101
+
102
+ input_example = "<Disease> Torsade de pointes ventricular tachycardia during low dose intermittent dobutamine treatment in a patient with dilated cardiomyopathy and congestive heart failure ."
103
+
104
+ model_input = tokenizer(input_example, return_tensors="pt")
105
+
106
+ output = model.generate(**model_input.to(model.device),max_new_tokens=128,num_beams=1,num_return_sequences=1,do_sample=False)
107
+
108
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
109
+ ```
110
+
111
+ # Performance
112
+ <img src="https://raw.githubusercontent.com/ikergarcia1996/Sequence-Labeling-LLMs/main/resources/multitask_performance.png" style="width: 70%;">
113
+
114
+ # Model Description
115
+
116
+ - **Developed by**: Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata and Andrea Zaninello
117
+ - **Contact**: [Iker García-Ferrero](https://ikergarcia1996.github.io/Iker-Garcia-Ferrero/) and [Rodrigo Agerri](https://ragerri.github.io/)
118
+ - **Website**: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote)
119
+ - **Funding**: CHIST-ERA XAI 2019 call. Antidote (PCI2020-120717-2) funded by MCIN/AEI /10.13039/501100011033 and by European Union NextGenerationEU/PRTR
120
+ - **Model type**: text2text-generation
121
+ - **Language(s) (NLP)**: English, Spanish, French, Italian
122
+ - **License**: apache-2.0
123
+ - **Finetuned from model**: HiTZ/Medical-mT5-xl
124
+
125
+
126
+ # Ethical Statement
127
+ <p align="justify">
128
+ Our research in developing Medical mT5, a multilingual text-to-text model for the medical domain, has ethical implications that we acknowledge.
129
+ Firstly, the broader impact of this work lies in its potential to improve medical communication and understanding across languages, which
130
+ can enhance healthcare access and quality for diverse linguistic communities. However, it also raises ethical considerations related to privacy and data security.
131
+ To create our multilingual corpus, we have taken measures to anonymize and protect sensitive patient information, adhering to
132
+ data protection regulations in each language's jurisdiction or deriving our data from sources that explicitly address this issue in line with
133
+ privacy and safety regulations and guidelines. Furthermore, we are committed to transparency and fairness in our model's development and evaluation.
134
+ We have worked to ensure that our benchmarks are representative and unbiased, and we will continue to monitor and address any potential biases in the future.
135
+ Finally, we emphasize our commitment to open source by making our data, code, and models publicly available, with the aim of promoting collaboration within
136
+ the research community.
137
+ </p>
138
+
139
+ # Citation
140
+
141
+ We will soon release a paper, but, for now, you can use:
142
+
143
+ ```bibtext
144
+ @inproceedings{medical-mt5,
145
+ title = "{{Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain}}",
146
+ author = "{Iker García-Ferrero and Rodrigo Agerri and Aitziber Atutxa Salazar and Elena Cabrio and Iker de la Iglesia and Alberto Lavelli and Bernardo Magnini and Benjamin Molinet and Johana Ramirez-Romero and German Rigau and Jose Maria Villa-Gonzalez and Serena Villata and Andrea Zaninello}",
147
+ publisher = "Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)",
148
+ year = 2024 }
149
+
150
+ ```