Add auto-generated model card
Browse files
README.md
CHANGED
@@ -1,4 +1,6 @@
|
|
1 |
---
|
|
|
|
|
2 |
library_name: span-marker
|
3 |
tags:
|
4 |
- span-marker
|
@@ -6,35 +8,102 @@ tags:
|
|
6 |
- ner
|
7 |
- named-entity-recognition
|
8 |
- generated_from_span_marker_trainer
|
|
|
|
|
9 |
metrics:
|
10 |
- precision
|
11 |
- recall
|
12 |
- f1
|
13 |
-
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
pipeline_tag: token-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
16 |
|
17 |
-
# SpanMarker
|
18 |
|
19 |
-
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
|
20 |
|
21 |
## Model Details
|
22 |
|
23 |
### Model Description
|
24 |
|
25 |
- **Model Type:** SpanMarker
|
26 |
-
|
27 |
- **Maximum Sequence Length:** 256 tokens
|
28 |
- **Maximum Entity Length:** 8 words
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
|
33 |
### Model Sources
|
34 |
|
35 |
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
|
36 |
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
## Uses
|
39 |
|
40 |
### Direct Use
|
@@ -43,9 +112,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
|
|
43 |
from span_marker import SpanMarkerModel
|
44 |
|
45 |
# Download from the 🤗 Hub
|
46 |
-
model = SpanMarkerModel.from_pretrained("
|
47 |
# Run inference
|
48 |
-
entities = model.predict("
|
49 |
```
|
50 |
|
51 |
### Downstream Use
|
@@ -57,7 +126,7 @@ You can finetune this model on your own dataset.
|
|
57 |
from span_marker import SpanMarkerModel, Trainer
|
58 |
|
59 |
# Download from the 🤗 Hub
|
60 |
-
model = SpanMarkerModel.from_pretrained("
|
61 |
|
62 |
# Specify a Dataset with "tokens" and "ner_tag" columns
|
63 |
dataset = load_dataset("conll2003") # For example CoNLL2003
|
@@ -69,12 +138,49 @@ trainer = Trainer(
|
|
69 |
eval_dataset=dataset["validation"],
|
70 |
)
|
71 |
trainer.train()
|
72 |
-
trainer.save_model("
|
73 |
```
|
74 |
</details>
|
75 |
|
76 |
## Training Details
|
77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
### Framework Versions
|
79 |
|
80 |
- Python: 3.9.16
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
+
license: other
|
4 |
library_name: span-marker
|
5 |
tags:
|
6 |
- span-marker
|
|
|
8 |
- ner
|
9 |
- named-entity-recognition
|
10 |
- generated_from_span_marker_trainer
|
11 |
+
datasets:
|
12 |
+
- tner/bionlp2004
|
13 |
metrics:
|
14 |
- precision
|
15 |
- recall
|
16 |
- f1
|
17 |
+
widget:
|
18 |
+
- text: Coexpression of HMG I/Y and Oct-2 in cell lines lacking Oct-2 results in high
|
19 |
+
levels of HLA-DRA gene expression , and in vitro DNA-binding studies reveal that
|
20 |
+
HMG I/Y stimulates Oct-2A binding to the HLA-DRA promoter .
|
21 |
+
- text: In erythroid cells most of the transcription activity was contained in a 150
|
22 |
+
bp promoter fragment with binding sites for transcription factors AP2 , Sp1 and
|
23 |
+
the erythroid-specific GATA-1 .
|
24 |
+
- text: 'Synergy between signal transduction pathways is obligatory for expression
|
25 |
+
of c-fos in B and T cell lines : implication for c-fos control via surface immunoglobulin
|
26 |
+
and T cell antigen receptors .'
|
27 |
+
- text: CIITA mRNA is normally inducible by IFN-gamma in class II non-inducible ,
|
28 |
+
RB-defective lines , and in one line , re-expression of RB has no effect on CIITA
|
29 |
+
mRNA induction levels .
|
30 |
+
- text: As we reported previously , MNDA mRNA level in adherent monocytes is elevated
|
31 |
+
by IFN-alpha ; in this study , we further assessed MNDA expression in in vitro
|
32 |
+
monocyte-derived macrophages .
|
33 |
pipeline_tag: token-classification
|
34 |
+
co2_eq_emissions:
|
35 |
+
emissions: 45.104
|
36 |
+
source: codecarbon
|
37 |
+
training_type: fine-tuning
|
38 |
+
on_cloud: false
|
39 |
+
gpu_model: 1 x NVIDIA GeForce RTX 3090
|
40 |
+
cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
|
41 |
+
ram_total_size: 31.777088165283203
|
42 |
+
hours_used: 0.296
|
43 |
+
model-index:
|
44 |
+
- name: SpanMarker with bert-base-uncased on BioNLP2004
|
45 |
+
results:
|
46 |
+
- task:
|
47 |
+
type: token-classification
|
48 |
+
name: Named Entity Recognition
|
49 |
+
dataset:
|
50 |
+
name: BioNLP2004
|
51 |
+
type: tner/bionlp2004
|
52 |
+
split: test
|
53 |
+
metrics:
|
54 |
+
- type: f1
|
55 |
+
value: 0.7620637836032726
|
56 |
+
name: F1
|
57 |
+
- type: precision
|
58 |
+
value: 0.7289958470876371
|
59 |
+
name: Precision
|
60 |
+
- type: recall
|
61 |
+
value: 0.7982742537313433
|
62 |
+
name: Recall
|
63 |
---
|
64 |
|
65 |
+
# SpanMarker with bert-base-uncased on BioNLP2004
|
66 |
|
67 |
+
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/models/bert-base-uncased) as the underlying encoder.
|
68 |
|
69 |
## Model Details
|
70 |
|
71 |
### Model Description
|
72 |
|
73 |
- **Model Type:** SpanMarker
|
74 |
+
- **Encoder:** [bert-base-uncased](https://huggingface.co/models/bert-base-uncased)
|
75 |
- **Maximum Sequence Length:** 256 tokens
|
76 |
- **Maximum Entity Length:** 8 words
|
77 |
+
- **Training Dataset:** [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004)
|
78 |
+
- **Language:** en
|
79 |
+
- **License:** other
|
80 |
|
81 |
### Model Sources
|
82 |
|
83 |
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
|
84 |
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
|
85 |
|
86 |
+
### Model Labels
|
87 |
+
| Label | Examples |
|
88 |
+
|:----------|:-------------------------------------------------------------------------------------------------|
|
89 |
+
| DNA | "immunoglobulin heavy-chain enhancer", "enhancer", "immunoglobulin heavy-chain ( IgH ) enhancer" |
|
90 |
+
| RNA | "GATA-1 mRNA", "c-myb mRNA", "antisense myb RNA" |
|
91 |
+
| cell_line | "monocytic U937 cells", "TNF-treated HUVECs", "HUVECs" |
|
92 |
+
| cell_type | "B cells", "non-B cells", "human red blood cells" |
|
93 |
+
| protein | "ICAM-1", "VCAM-1", "NADPH oxidase" |
|
94 |
+
|
95 |
+
## Evaluation
|
96 |
+
|
97 |
+
### Metrics
|
98 |
+
| Label | Precision | Recall | F1 |
|
99 |
+
|:----------|:----------|:-------|:-------|
|
100 |
+
| **all** | 0.7290 | 0.7983 | 0.7621 |
|
101 |
+
| DNA | 0.7174 | 0.7505 | 0.7336 |
|
102 |
+
| RNA | 0.6977 | 0.7692 | 0.7317 |
|
103 |
+
| cell_line | 0.5831 | 0.7020 | 0.6370 |
|
104 |
+
| cell_type | 0.8222 | 0.7381 | 0.7779 |
|
105 |
+
| protein | 0.7196 | 0.8407 | 0.7755 |
|
106 |
+
|
107 |
## Uses
|
108 |
|
109 |
### Direct Use
|
|
|
112 |
from span_marker import SpanMarkerModel
|
113 |
|
114 |
# Download from the 🤗 Hub
|
115 |
+
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")
|
116 |
# Run inference
|
117 |
+
entities = model.predict("In erythroid cells most of the transcription activity was contained in a 150 bp promoter fragment with binding sites for transcription factors AP2 , Sp1 and the erythroid-specific GATA-1 .")
|
118 |
```
|
119 |
|
120 |
### Downstream Use
|
|
|
126 |
from span_marker import SpanMarkerModel, Trainer
|
127 |
|
128 |
# Download from the 🤗 Hub
|
129 |
+
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")
|
130 |
|
131 |
# Specify a Dataset with "tokens" and "ner_tag" columns
|
132 |
dataset = load_dataset("conll2003") # For example CoNLL2003
|
|
|
138 |
eval_dataset=dataset["validation"],
|
139 |
)
|
140 |
trainer.train()
|
141 |
+
trainer.save_model("tomaarsen/span-marker-bert-base-uncased-bionlp-finetuned")
|
142 |
```
|
143 |
</details>
|
144 |
|
145 |
## Training Details
|
146 |
|
147 |
+
### Training Set Metrics
|
148 |
+
| Training set | Min | Median | Max |
|
149 |
+
|:----------------------|:----|:--------|:----|
|
150 |
+
| Sentence length | 2 | 26.5790 | 166 |
|
151 |
+
| Entities per sentence | 0 | 2.7528 | 23 |
|
152 |
+
|
153 |
+
### Training Hyperparameters
|
154 |
+
- learning_rate: 5e-05
|
155 |
+
- train_batch_size: 32
|
156 |
+
- eval_batch_size: 32
|
157 |
+
- seed: 42
|
158 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
159 |
+
- lr_scheduler_type: linear
|
160 |
+
- lr_scheduler_warmup_ratio: 0.1
|
161 |
+
- num_epochs: 3
|
162 |
+
|
163 |
+
### Training Results
|
164 |
+
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
|
165 |
+
|:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
|
166 |
+
| 0.4505 | 300 | 0.0210 | 0.7497 | 0.7659 | 0.7577 | 0.9254 |
|
167 |
+
| 0.9009 | 600 | 0.0162 | 0.8048 | 0.8217 | 0.8131 | 0.9432 |
|
168 |
+
| 1.3514 | 900 | 0.0154 | 0.8126 | 0.8249 | 0.8187 | 0.9434 |
|
169 |
+
| 1.8018 | 1200 | 0.0149 | 0.8148 | 0.8451 | 0.8296 | 0.9481 |
|
170 |
+
| 2.2523 | 1500 | 0.0150 | 0.8297 | 0.8438 | 0.8367 | 0.9501 |
|
171 |
+
| 2.7027 | 1800 | 0.0145 | 0.8280 | 0.8443 | 0.8361 | 0.9501 |
|
172 |
+
|
173 |
+
### Environmental Impact
|
174 |
+
Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
|
175 |
+
- **Carbon Emitted**: 0.045 kg of CO2
|
176 |
+
- **Hours Used**: 0.296 hours
|
177 |
+
|
178 |
+
### Training Hardware
|
179 |
+
- **On Cloud**: No
|
180 |
+
- **GPU Model**: 1 x NVIDIA GeForce RTX 3090
|
181 |
+
- **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
|
182 |
+
- **RAM Size**: 31.78 GB
|
183 |
+
|
184 |
### Framework Versions
|
185 |
|
186 |
- Python: 3.9.16
|