Text Classification
Transformers
Safetensors
Spanish
roberta
biology
medical
Inference Endpoints
leon93 commited on
Commit
552fe10
verified
1 Parent(s): 63d41da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -50
README.md CHANGED
@@ -1,63 +1,79 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
 
 
 
 
 
17
 
18
  This is the model card of a 馃 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
30
  <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
  - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
  ### Downstream Use [optional]
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
  [More Information Needed]
51
 
52
  ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
  [More Information Needed]
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
  [More Information Needed]
63
 
@@ -65,11 +81,26 @@ This is the model card of a 馃 transformers model that has been pushed on the
65
 
66
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- Use the code below to get started with the model.
73
 
74
  [More Information Needed]
75
 
@@ -77,38 +108,71 @@ Use the code below to get started with the model.
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
  [More Information Needed]
83
 
84
  ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
  #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
92
 
93
  #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
  #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
100
 
101
  [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
  ### Testing Data, Factors & Metrics
108
 
 
109
  #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
  [More Information Needed]
114
 
@@ -120,17 +184,15 @@ Use the code below to get started with the model.
120
 
121
  #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
 
126
 
127
- ### Results
128
-
129
- [More Information Needed]
130
 
131
  #### Summary
132
 
133
-
134
 
135
  ## Model Examination [optional]
136
 
@@ -144,35 +206,34 @@ Use the code below to get started with the model.
144
 
145
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
  - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
  - **Carbon Emitted:** [More Information Needed]
152
 
153
  ## Technical Specifications [optional]
154
 
155
  ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
  ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
 
167
  #### Software
168
 
169
- [More Information Needed]
170
 
171
  ## Citation [optional]
172
 
173
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
  [More Information Needed]
178
 
@@ -182,9 +243,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
182
 
183
  ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
  ## More Information [optional]
190
 
@@ -192,8 +251,8 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
192
 
193
  ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - biology
5
+ - medical
6
+ license: cc-by-nc-nd-4.0
7
+ datasets:
8
+ - fundacionctic/DermatES
9
+ language:
10
+ - es
11
+ metrics:
12
+ - accuracy
13
+ - f1
14
+ pipeline_tag: text-classification
15
  ---
16
 
17
  # Model Card for Model ID
18
 
19
+ This is a fine-tuned version of the pre-trained biomedical language model [bsc-bio-ehr-es](https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es) in Spanish, tailored for text classification tasks. We used two NVIDIA GPUs for training.
 
20
 
21
 
22
  ## Model Details
23
 
24
  ### Model Description
25
 
26
+ This model has been fine-tuned for text classification on dermatological Spanish electronic health records (EHR). It leverages the pre-trained biomedical language understanding from the [bsc-bio-ehr-es](https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es) model and adapts it to classify dermatology-related texts effectively.
27
+ The model is intended to predict among 25 different skin diseases from a medical record. It could be a first visit or a follow-up visit.
28
+ It takes as input four features:
29
+ - *textual medical record:* the EHR written by a doctor
30
+ - *disease type:* the type of disease associated with the EHR
31
+ - *disease location:* the location in the body of the disease
32
+ - *disease severity:* how severe or lethal is the disease
33
+ It is IMPORTANT to load and concatenate them in this specific order.
34
+ The details to reproduce the cascade predictions are available in the Training section.
35
 
36
  This is the model card of a 馃 transformers model that has been pushed on the Hub. This model card has been automatically generated.
37
 
38
+ - **Developed by:** [Fundacion CTIC](https://www.fundacionctic.org)
39
+ - **Funded by [optional]:** [SATEC](https://www.satec.es)
40
+ - **Model type:** Fine-tuned LM Encoder
41
+ - **Language(s) (NLP):** Spanish
42
+ - **License:** CC-BY-NC
43
+ - **Finetuned from model [optional]:** [bsc-bio-ehr-es](https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es)
 
44
 
45
  ### Model Sources [optional]
46
 
47
  <!-- Provide the basic links for the model. -->
48
 
49
+ - **Repository:**
50
+ - **Paper [optional]:** Coming soon...
51
  - **Demo [optional]:** [More Information Needed]
52
 
53
  ## Uses
54
 
55
+ The Model is industry-friendly and the best model of the **dermat** collection. The vanilla version of the model is called [vanilla-dermat](https://huggingface.co/fundacionctic/vanilla-dermat/) and is meant to predict not only the disease but also the 3 features mentionned above.
56
+ We DO NOT recommend to fine-tune this model. It is already meant to be a downstream task.
57
  ### Direct Use
58
 
59
+ This model can be directly used for classifying dermatological text data in Spanish EHRs.
60
 
 
61
 
62
  ### Downstream Use [optional]
63
+ The model can be integrated into healthcare applications for automatic classification of dermatological conditions from patient records.
64
 
 
65
 
66
  [More Information Needed]
67
 
68
  ### Out-of-Scope Use
69
 
70
+ The model is not suitable for non-medical text classification tasks or for texts in languages other than Spanish.
71
 
72
  [More Information Needed]
73
 
74
  ## Bias, Risks, and Limitations
75
 
76
+ This model is fine-tuned on a specific dataset and may not generalize well to other types of medical texts or conditions. Users should be cautious of biases in the training data that could affect the model's performance.
77
 
78
  [More Information Needed]
79
 
 
81
 
82
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
83
 
84
+ Users should validate the model's performance on their specific data and consider any ethical implications of deploying a machine learning model in a healthcare setting.
85
 
86
  ## How to Get Started with the Model
87
+ ```
88
+ from transformers import RobertaTokenizerFast, RobertaForSequenceClassification,
89
+
90
+ tokenizer = RobertaTokenizerFast.from_pretrained("fundacionctic/predict-dermat")
91
+ model = RobertaForSequenceClassification.from_pretrained("fundacionctic/predict-dermat")
92
+
93
+
94
+ inputs = tokenizer("Ejemplo de texto dermatol贸gico + tipo + localizacion + gravedad".tolist(),
95
+ truncation=True,
96
+ padding='max_length',
97
+ max_length=max_length, # Replace with your desired maximum sequence length
98
+ return_tensors='pt',
99
+ return_attention_mask=True,
100
+ ))
101
+ outputs = model(input_ids, attention_mask=attention_mask)
102
+ ```
103
 
 
104
 
105
  [More Information Needed]
106
 
 
108
 
109
  ### Training Data
110
 
111
+ The model was fine-tuned on the DermatES dataset from Fundaci贸n CTIC, which contains Spanish dermatological EHRs.
 
112
  [More Information Needed]
113
 
114
  ### Training Procedure
115
+ In order to reproduce the experiment it is ESSENTIAL to respect the order of prediction of the three ontology-base features. More details in the original paper of *Dermat*
116
+
117
+ ```
118
+ from transformers import RobertaTokenizerFast, RobertaForSequenceClassification,
119
+
120
+ tokenizer = RobertaTokenizerFast.from_pretrained("PlanTL-GOB-ES/bsc-bio-ehr-es")
121
+ model = RobertaForSequenceClassification.from_pretrained("PlanTL-GOB-ES/bsc-bio-ehr-es")
122
+
123
+ def reset_model():
124
+ model = RobertaForSequenceClassification.from_pretrained("PlanTL-GOB-ES/bsc-bio-ehr-es")
125
+
126
+ def cascade(inputs,information_list,model,tokenizer,predictions=None):
127
+ if not information_list:
128
+ return predictions
129
+ else :
130
+ inputs = tokenizer("Ejemplo de texto dermatol贸gico".tolist(),
131
+ truncation=True,
132
+ padding='max_length',
133
+ max_length=max_length, # Replace with your desired maximum sequence length
134
+ return_tensors='pt',
135
+ return_attention_mask=True,
136
+ ))
137
+ labels = information_list[0].tolist()
138
+ outputs = model(input_ids, attention_mask=attention_mask,labels=labels)
139
+ predictions = torch.argmax(outputs.logits,dim=1)
140
+ inputs = [tokenizer.decode(input+predictions[i]) for i,input in enumerate(inputs)]
141
+ model = reset_model()
142
+ return cascade(inputs,information_list[1:],model,tokenizer,predictions)
143
+ inputs = ["un informe,","otro informe"]
144
+ information_list = [[tipo1,tipo2],[sitio1,sitio2],[gravedad1,gravedad2]]
145
+ predicted_diseases = cascade(inputs,information_list,model,tokenizer)
146
+ ```
147
 
 
148
 
149
  #### Preprocessing [optional]
150
 
151
+ Lowercased, anonymized and accents removed texts
152
 
153
 
154
  #### Training Hyperparameters
155
 
156
+ - **Training regime:** fp32
157
 
158
  #### Speeds, Sizes, Times [optional]
159
 
160
+ Epochs: 7
161
+ Batch size: 64
162
+ Learning rate: 0.0001
163
 
164
  [More Information Needed]
165
 
166
  ## Evaluation
167
 
168
+
169
 
170
  ### Testing Data, Factors & Metrics
171
 
172
+
173
  #### Testing Data
174
 
175
+ The evaluation was performed on 0.2 of the [DermatES](https://huggingface.co/datasets/fundacionctic/DermatES) dataset.
176
 
177
  [More Information Needed]
178
 
 
184
 
185
  #### Metrics
186
 
187
+ - *Accuracy:* 0.51
188
+ - *F1 Score:* 0.42
189
+ - *top-k (k=2) accuracy:* 0.67
190
+ - *top-k (k=2) f1 Score:* 0.61
191
 
 
 
 
192
 
193
  #### Summary
194
 
195
+ The model achieves poor accuracy and F1 score on dermatological text classification, demonstrating the need of using ontologies (see [oracle-dermat](https://huggingface.co/datasets/fundacionctic/oracle-dermat) ) for this specific medical domain.
196
 
197
  ## Model Examination [optional]
198
 
 
206
 
207
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
208
 
209
+ - **Hardware Type:** GPU and CPU with avx
210
+ - **Hours used:** >96
211
  - **Cloud Provider:** [More Information Needed]
212
+ - **Compute Region:** EU
213
  - **Carbon Emitted:** [More Information Needed]
214
 
215
  ## Technical Specifications [optional]
216
 
217
  ### Model Architecture and Objective
218
 
219
+ The model is based on the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) architecture, fine-tuned for the objective of text classification in the biomedical domain.
220
 
221
  ### Compute Infrastructure
222
 
 
223
 
 
224
 
225
+ #### Hardware
226
 
227
+ Two NVIDIA GPUs were used for the fine-tuning process.
228
  #### Software
229
 
230
+ The fine-tuning was performed using the 馃 Transformers library.
231
 
232
  ## Citation [optional]
233
 
234
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
235
 
236
+ **BibTeX:** Coming soon
237
 
238
  [More Information Needed]
239
 
 
243
 
244
  ## Glossary [optional]
245
 
 
246
 
 
247
 
248
  ## More Information [optional]
249
 
 
251
 
252
  ## Model Card Authors [optional]
253
 
254
+ Leon-Paul Schaub Torre, Pelayo Quiros and Helena Garcia-Mieres
255
 
256
  ## Model Card Contact
257
 
258
+ leon.schaub@fundacionctic.org