tlemberger commited on
Commit
c615d3a
1 Parent(s): 404b681

update with new training on geneprod roles

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
 
 
 
 
 
2
  license: agpl-3.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - english
4
+ thumbnail:
5
+ tags:
6
+ - token classification
7
  license: agpl-3.0
8
+ datasets:
9
+ - EMBO/sd-panels
10
+ metrics:
11
+ -
12
  ---
13
+
14
+ # sd-roles
15
+
16
+ ## Model description
17
+
18
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It as then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `ROLES` task to perform pure context-dependent semantic role classification of bioentities.
19
+
20
+
21
+ ## Intended uses & limitations
22
+
23
+ #### How to use
24
+
25
+ The intended use of this model is to infer the semantic role of gene products (genes and proteins) with regard to the causal hypotheses tested in experiments reported in scientific papers.
26
+
27
+ To have a quick check of the model:
28
+
29
+ ```python
30
+ from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
31
+ example = """<s>The <mask> overexpression in cells caused an increase in <mask> expression.</s>"""
32
+ tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
33
+ model = RobertaForTokenClassification.from_pretrained('EMBO/sd-roles')
34
+ ner = pipeline('ner', model, tokenizer=tokenizer)
35
+ res = ner(example)
36
+ for r in res:
37
+ print(r['word'], r['entity'])
38
+ ```
39
+
40
+ #### Limitations and bias
41
+
42
+ The model must be used with the `roberta-base` tokenizer.
43
+
44
+ ## Training data
45
+
46
+ The model was trained for token classification using the [EMBO/sd-panels dataset](https://huggingface.co/datasets/EMBO/sd-panels) which includes manually annotated examples.
47
+
48
+ ## Training procedure
49
+
50
+ The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
51
+
52
+ Training code is available at https://github.com/source-data/soda-roberta
53
+
54
+ - Tokenizer vocab size: 50265
55
+ - Training data: EMBO/biolang MLM
56
+ - Training with 48771 examples.
57
+ - Evaluating on 13801 examples.
58
+ - Training on 15 features: O, I-CONTROLLED_VAR, B-CONTROLLED_VAR, I-MEASURED_VAR, B-MEASURED_VAR
59
+ - Epochs: 0.9
60
+ - `per_device_train_batch_size`: 16
61
+ - `per_device_eval_batch_size`: 16
62
+ - `learning_rate`: 0.0001
63
+ - `weight_decay`: 0.0
64
+ - `adam_beta1`: 0.9
65
+ - `adam_beta2`: 0.999
66
+ - `adam_epsilon`: 1e-08
67
+ - `max_grad_norm`: 1.0
68
+
69
+ ## Eval results
70
+
71
+ On 7178 example of test set with `sklearn.metrics`:
72
+
73
+ ```
74
+ precision recall f1-score support
75
+
76
+ CONTROLLED_VAR 0.81 0.86 0.83 7835
77
+ MEASURED_VAR 0.82 0.85 0.84 9330
78
+
79
+ micro avg 0.82 0.85 0.83 17165
80
+ macro avg 0.82 0.85 0.83 17165
81
+ weighted avg 0.82 0.85 0.83 17165
82
+
83
+ {'test_loss': 0.03846803680062294, 'test_accuracy_score': 0.9854472664459946, 'test_precision': 0.8156312625250501, 'test_recall': 0.8535974366443344, 'test_f1': 0.8341825841897008, 'test_runtime': 58.7369, 'test_samples_per_second': 122.206, 'test_steps_per_second': 1.924}
84
+ ```