joeranbosma commited on
Commit
e9986d4
·
verified ·
1 Parent(s): 48fd789

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md CHANGED
@@ -1,3 +1,104 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
  ---
4
+
5
+ # DRAGON RoBERTa base domain-specific
6
+
7
+ Pretrained model on Dutch clinical reports using a masked language modeling (MLM) objective. It was introduced in [this](TODO: add link) paper.The model was pretrained using domain-specific data (i.e., clinical reports) from scratch. The architecture is the same as [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) from HuggingFace. The tokenizer was fitted to the dataset of Dutch medical reports, using the same settings for the tokenizer as [`roberta-base`](https://huggingface.co/FacebookAI/roberta-base).
8
+
9
+
10
+
11
+ ## Model description
12
+ RoBERTa is a transformers model that was pretrained on a large corpus of Dutch clinical reports in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the MLM objective:
13
+
14
+ Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
15
+
16
+ This way, the model learns an inner representation of the Dutch medical language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled reports, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
17
+
18
+ ## Model variations
19
+ Multiple architectures were pretrained for the DRAGON challenge.
20
+
21
+ | Model | #params | Language |
22
+ |------------------------|--------------------------------|-------|
23
+ | [`joeranbosma/dragon-bert-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-bert-base-mixed-domain) | 109M | Dutch → Dutch |
24
+ | [`joeranbosma/dragon-roberta-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-roberta-base-mixed-domain) | 278M | Multiple → Dutch |
25
+ | [`joeranbosma/dragon-roberta-large-mixed-domain`](https://huggingface.co/joeranbosma/dragon-roberta-large-mixed-domain) | 560M | Multiple → Dutch |
26
+ | [`joeranbosma/dragon-longformer-base-mixed-domain`](https://huggingface.co/joeranbosma/dragon-longformer-base-mixed-domain) | 149M | English → Dutch |
27
+ | [`joeranbosma/dragon-longformer-large-mixed-domain`](https://huggingface.co/joeranbosma/dragon-longformer-large-mixed-domain) | 435M | English → Dutch |
28
+ | [`joeranbosma/dragon-bert-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-bert-base-domain-specific) | 109M | Dutch |
29
+ | [`joeranbosma/dragon-roberta-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-roberta-base-domain-specific) | 278M | Dutch |
30
+ | [`joeranbosma/dragon-roberta-large-domain-specific`](https://huggingface.co/joeranbosma/dragon-roberta-large-domain-specific) | 560M | Dutch |
31
+ | [`joeranbosma/dragon-longformer-base-domain-specific`](https://huggingface.co/joeranbosma/dragon-longformer-base-domain-specific) | 149M | Dutch |
32
+ | [`joeranbosma/dragon-longformer-large-domain-specific`](https://huggingface.co/joeranbosma/dragon-longformer-large-domain-specific) | 435M | Dutch |
33
+
34
+
35
+ ## Intended uses & limitations
36
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
37
+
38
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole text (e.g., a clinical report) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.
39
+
40
+ ## How to use
41
+ You can use this model directly with a pipeline for masked language modeling:
42
+
43
+ ```python
44
+ from transformers import pipeline
45
+ unmasker = pipeline("fill-mask", model="joeranbosma/dragon-roberta-base-domain-specific")
46
+ unmasker("Dit onderzoek geen aanwijzingen voor significant carcinoom. PIRADS <mask>.")
47
+ ```
48
+
49
+ Here is how to use this model to get the features of a given text in PyTorch:
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModel
53
+ tokenizer = AutoTokenizer.from_pretrained("joeranbosma/dragon-roberta-base-domain-specific")
54
+ model = AutoModel.from_pretrained("joeranbosma/dragon-roberta-base-domain-specific")
55
+ text = "Replace me by any text you'd like."
56
+ encoded_input = tokenizer(text, return_tensors="pt")
57
+ output = model(**encoded_input)
58
+ ```
59
+
60
+ ## Limitations and bias
61
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
62
+
63
+ ## Training data
64
+ For pretraining, 4,333,201 clinical reports (466,351 consecutive patients) were selected from Ziekenhuisgroep Twente from patients with a diagnostic or interventional visit between 13 July 2000 and 25 April 2023. 180,439 duplicate clinical reports (179,808 patients) were excluded, resulting in 4,152,762 included reports (463,692 patients). These reports were split into training (80%, 3,322,209 reports), validation (10%, 415,276 reports), and testing (10%, 415,277 reports). The testing reports were set aside for future analysis and are not used for pretraining.
65
+
66
+ ## Training procedure
67
+
68
+ ### Pretraining
69
+ The details of the masking procedure for each sentence are the following:
70
+ - 15% of the tokens are masked.
71
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
72
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
73
+ - In the 10% remaining cases, the masked tokens are left as is.
74
+
75
+ ### Pretraining hyperparameters
76
+
77
+ The following hyperparameters were used during pretraining:
78
+ - `learning_rate`: 0.0006
79
+ - `train_batch_size`: 16
80
+ - `eval_batch_size`: 16
81
+ - `seed`: 42
82
+ - `gradient_accumulation_steps`: 16
83
+ - `total_train_batch_size`: 256
84
+ - `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
85
+ - `lr_scheduler_type`: linear
86
+ - `num_epochs`: 10.0
87
+ - `max_seq_length`: 512
88
+
89
+ ### Framework versions
90
+
91
+ - Transformers 4.29.0.dev0
92
+ - Pytorch 2.0.0+cu117
93
+ - Datasets 2.11.0
94
+ - Tokenizers 0.13.3
95
+
96
+ ## Evaluation results
97
+
98
+ Pending evaluation on the DRAGON benchmark.
99
+
100
+ ### BibTeX entry and citation info
101
+
102
+ ```bibtex
103
+ @article{PENDING}
104
+ ```