File size: 1,347 Bytes
4ead442 cb302af b3f38c5 cb302af 44dfd56 cb302af 4ead442 cb302af 0ede01e cb302af 4ead442 cb302af |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
---
tags:
- medical
---
# ClinicalBERT
<!-- Provide a quick summary of what the model is/does. -->
This model card describes the ClinicalBERT model, which was trained on a large multicenter dataset with a large corpus of 1.2B words of diverse diseases we constructed.
We then utilized a large-scale corpus of EHRs from over 3 million patient records to fine tune the base language model.
## Pretraining Data
The ClinicalBERT model was trained on a large multicenter dataset with a large corpus of 1.2B words of diverse diseases we constructed.
<!-- For more details, see here. -->
## Model Pretraining
### Pretraining Procedures
The ClinicalBERT was initialized from BERT. Then the training followed the principle of masked language model, in which given a piece of text, we randomly replace some tokens by MASKs,
special tokens for masking, and then require the model to predict the original tokens via contextual text.
### Pretraining Hyperparameters
We used a batch size of 32, a maximum sequence length of 256, and a learning rate of 5e-5 for pre-training our models.
## How to use the model
Load the model via the transformers library:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("medicalai/ClinicalBERT")
model = AutoModel.from_pretrained("medicalai/ClinicalBERT")
```
|