alibert-quinten
commited on
Commit
•
6850efc
1
Parent(s):
e919f92
Update README.md
Browse filesDescription of corpus updated.
README.md
CHANGED
@@ -45,16 +45,9 @@ Here are the main contributions of our work:
|
|
45 |
The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
|
46 |
|
47 |
# Data
|
48 |
-
The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents.
|
|
|
49 |
|
50 |
-
|Dataset name| Quantity| Size |
|
51 |
-
|----|---|---|
|
52 |
-
|Drug leaflets (Base de données publique des médicament)| 23K| 550Mb |
|
53 |
-
|RCP (a French equivalent of Physician’s Desk Reference)| 35K| 2200Mb|
|
54 |
-
|Articles (biomedical articles from ScienceDirect)| 500K| 4300Mb |
|
55 |
-
|Thesis (Thesis manuscripts in French)| 300K|300Mb |
|
56 |
-
|Cochrane (articles from Cochrane database)| 7.6K| 27Mb|
|
57 |
-
*Table 1: Pretraining dataset*
|
58 |
|
59 |
# How to use alibert-quinten/Oncology-NER with HuggingFace
|
60 |
|
@@ -115,28 +108,10 @@ The model is evaluated on two (CAS and QUAERO) publically available Frech biomed
|
|
115 |
</thead>
|
116 |
<tbody>
|
117 |
<tr>
|
118 |
-
<td>Entities</td>
|
119 |
-
<td>P<br></td>
|
120 |
-
<td>R</td>
|
121 |
-
<td>F1</td>
|
122 |
-
<td>P<br></td>
|
123 |
-
<td>R</td>
|
124 |
-
<td>F1</td>
|
125 |
-
<td>P<br></td>
|
126 |
-
<td>R</td>
|
127 |
-
<td>F1</td>
|
128 |
</tr>
|
129 |
<tr>
|
130 |
-
<td>Substance</td>
|
131 |
-
<td>0.96</td>
|
132 |
-
<td>0.87</td>
|
133 |
-
<td>0.91</td>
|
134 |
-
<td>0.96</td>
|
135 |
-
<td>0.91</td>
|
136 |
-
<td>0.93</td>
|
137 |
-
<td>0.83</td>
|
138 |
-
<td>0.83</td>
|
139 |
-
<td>0.82</td>
|
140 |
</tr>
|
141 |
<tr>
|
142 |
<td>Symptom</td> <td>0.89</td> <td>0.91</td> <td>0.90</td> <td>0.96</td> <td>0.98</td> <td>0.97</td> <td>0.93</td> <td>0.90</td> <td>0.91</td>
|
@@ -155,7 +130,7 @@ The model is evaluated on two (CAS and QUAERO) publically available Frech biomed
|
|
155 |
</tr>
|
156 |
</tbody>
|
157 |
</table>
|
158 |
-
Table
|
159 |
|
160 |
#### QUAERO dataset
|
161 |
|
@@ -192,6 +167,6 @@ Table 2: NER performances on CAS dataset
|
|
192 |
</tr>
|
193 |
</tbody>
|
194 |
</table>
|
195 |
-
Table
|
196 |
|
197 |
##AliBERT: A Pre-trained Language Model for French Biomedical Text
|
|
|
45 |
The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
|
46 |
|
47 |
# Data
|
48 |
+
The pre-training corpus was gathered from different sub-corpora. It is composed of 7GB French biomedical textual documents. The corpora were collected from different sources. Scientific articles are collected from ScienceDirect using an API provided on subscription and where French articles in biomedical domain were selected. The summaries of thesis manuscripts are collected from "Système universitaire de documentation (SuDoc)" which is a catalog of universities documentation system. Short texts and some complete sentences were collected from the public drug database which lists the characteristics of tens of thousands of drugs. Furthermore, a similar drug database known as "Résumé des Caractéristiques du Produit (RCP)" is also used to represent a description of medications that are intended to be utilized by biomedicine professionals.
|
49 |
+
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
# How to use alibert-quinten/Oncology-NER with HuggingFace
|
53 |
|
|
|
108 |
</thead>
|
109 |
<tbody>
|
110 |
<tr>
|
111 |
+
<td>Entities</td><td>P<br></td><td>R</td><td>F1</td><td>P<br></td><td>R</td><td>F1</td><td>P<br></td><td>R</td><td>F1</td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
</tr>
|
113 |
<tr>
|
114 |
+
<td>Substance</td><td>0.96</td><td>0.87</td><td>0.91</td><td>0.96</td><td>0.91</td><td>0.93</td><td>0.83</td><td>0.83</td><td>0.82</td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
</tr>
|
116 |
<tr>
|
117 |
<td>Symptom</td> <td>0.89</td> <td>0.91</td> <td>0.90</td> <td>0.96</td> <td>0.98</td> <td>0.97</td> <td>0.93</td> <td>0.90</td> <td>0.91</td>
|
|
|
130 |
</tr>
|
131 |
</tbody>
|
132 |
</table>
|
133 |
+
Table 1: NER performances on CAS dataset
|
134 |
|
135 |
#### QUAERO dataset
|
136 |
|
|
|
167 |
</tr>
|
168 |
</tbody>
|
169 |
</table>
|
170 |
+
Table 2: NER performances on QUAERO dataset
|
171 |
|
172 |
##AliBERT: A Pre-trained Language Model for French Biomedical Text
|