🤗 + 📚🩺🇮🇹 = BioBIT
From this repository you can download the BioBIT (Biomedical Bert for ITalian) checkpoint.
BioBIT stems from Italian XXL BERT, obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
To pretrain BioBIT, we followed the general approach outlined in BioBERT paper, built on the foundation of the BERT architecture. The pretraining objective is a combination of MLM (Masked Language Modelling) and NSP (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
Due to the unavailability of an Italian equivalent for the millions of abstracts and full-text scientific papers used by English, BERT-based biomedical models, in this work we leveraged machine translation to obtain an Italian biomedical corpus based on PubMed abstracts and train BioBIT. More details in the paper.
BioBIT has been evaluated on 3 downstream tasks: NER (Named Entity Recognition), extractive QA (Question Answering), RE (Relation Extraction). Here are the results, summarized:
- NER:
- BC2GM = 82.14%
- BC4CHEMD = 80.70%
- BC5CDR(CDR) = 82.15%
- BC5CDR(DNER) = 76.27%
- NCBI_DISEASE = 65.06%
- SPECIES-800 = 61.86%
- QA:
- RE:
Check the full paper for further details, and feel free to contact us if you have some inquiry!
- Downloads last month
- 581