joeranbosma commited on
Commit
1885037
·
verified ·
1 Parent(s): e9986d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -4,14 +4,12 @@ license: cc-by-nc-sa-4.0
4
 
5
  # DRAGON RoBERTa base domain-specific
6
 
7
- Pretrained model on Dutch clinical reports using a masked language modeling (MLM) objective. It was introduced in [this](TODO: add link) paper.The model was pretrained using domain-specific data (i.e., clinical reports) from scratch. The architecture is the same as [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) from HuggingFace. The tokenizer was fitted to the dataset of Dutch medical reports, using the same settings for the tokenizer as [`roberta-base`](https://huggingface.co/FacebookAI/roberta-base).
8
 
9
 
10
 
11
  ## Model description
12
- RoBERTa is a transformers model that was pretrained on a large corpus of Dutch clinical reports in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the MLM objective:
13
-
14
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
15
 
16
  This way, the model learns an inner representation of the Dutch medical language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled reports, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
17
 
@@ -66,16 +64,20 @@ For pretraining, 4,333,201 clinical reports (466,351 consecutive patients) were
66
  ## Training procedure
67
 
68
  ### Pretraining
 
 
69
  The details of the masking procedure for each sentence are the following:
70
  - 15% of the tokens are masked.
71
  - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
72
  - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
73
  - In the 10% remaining cases, the masked tokens are left as is.
74
 
 
 
75
  ### Pretraining hyperparameters
76
 
77
  The following hyperparameters were used during pretraining:
78
- - `learning_rate`: 0.0006
79
  - `train_batch_size`: 16
80
  - `eval_batch_size`: 16
81
  - `seed`: 42
 
4
 
5
  # DRAGON RoBERTa base domain-specific
6
 
7
+ Pretrained model on Dutch clinical reports using a masked language modeling (MLM) objective. It was introduced in [this](#pending) paper. The model was pretrained using domain-specific data (i.e., clinical reports) from scratch. The architecture is the same as [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) from HuggingFace. The tokenizer was fitted to the dataset of Dutch medical reports, using the same settings for the tokenizer as [`roberta-base`](https://huggingface.co/FacebookAI/roberta-base).
8
 
9
 
10
 
11
  ## Model description
12
+ RoBERTa is a transformers model that was pretrained on a large corpus of Dutch clinical reports in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts.
 
 
13
 
14
  This way, the model learns an inner representation of the Dutch medical language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled reports, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.
15
 
 
64
  ## Training procedure
65
 
66
  ### Pretraining
67
+ The model was pretrained using masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
68
+
69
  The details of the masking procedure for each sentence are the following:
70
  - 15% of the tokens are masked.
71
  - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
72
  - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
73
  - In the 10% remaining cases, the masked tokens are left as is.
74
 
75
+ The HuggingFace implementation was used for pretraining: [`run_mlm.py`](https://github.com/huggingface/transformers/blob/7c6ec195adbfcd22cb6baeee64dd3c24a4b80c74/examples/pytorch/language-modeling/run_mlm.py).
76
+
77
  ### Pretraining hyperparameters
78
 
79
  The following hyperparameters were used during pretraining:
80
+ - `learning_rate`: 6e-4
81
  - `train_batch_size`: 16
82
  - `eval_batch_size`: 16
83
  - `seed`: 42