tomaarsen HF staff commited on
Commit
df37464
·
1 Parent(s): 0d63046

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +109 -11
README.md CHANGED
@@ -1,4 +1,7 @@
1
  ---
 
 
 
2
  library_name: span-marker
3
  tags:
4
  - span-marker
@@ -6,34 +9,92 @@ tags:
6
  - ner
7
  - named-entity-recognition
8
  - generated_from_span_marker_trainer
 
 
9
  metrics:
10
  - precision
11
  - recall
12
  - f1
13
- widget: []
 
 
 
 
 
 
 
 
14
  pipeline_tag: token-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # SpanMarker
18
 
19
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
 
 
 
20
 
21
  ## Model Details
22
 
23
  ### Model Description
24
  - **Model Type:** SpanMarker
25
- <!-- - **Encoder:** [Unknown](https://huggingface.co/unknown) -->
26
  - **Maximum Sequence Length:** 256 tokens
27
  - **Maximum Entity Length:** 8 words
28
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
29
- <!-- - **Language:** Unknown -->
30
- <!-- - **License:** Unknown -->
31
 
32
  ### Model Sources
33
 
34
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
35
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Uses
38
 
39
  ### Direct Use for Inference
@@ -42,9 +103,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
42
  from span_marker import SpanMarkerModel
43
 
44
  # Download from the 🤗 Hub
45
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
46
  # Run inference
47
- entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
48
  ```
49
 
50
  ### Downstream Use
@@ -56,7 +117,7 @@ You can finetune this model on your own dataset.
56
  from span_marker import SpanMarkerModel, Trainer
57
 
58
  # Download from the 🤗 Hub
59
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
60
 
61
  # Specify a Dataset with "tokens" and "ner_tag" columns
62
  dataset = load_dataset("conll2003") # For example CoNLL2003
@@ -68,7 +129,7 @@ trainer = Trainer(
68
  eval_dataset=dataset["validation"],
69
  )
70
  trainer.train()
71
- trainer.save_model("span_marker_model_id-finetuned")
72
  ```
73
  </details>
74
 
@@ -92,6 +153,43 @@ trainer.save_model("span_marker_model_id-finetuned")
92
 
93
  ## Training Details
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ### Framework Versions
96
  - Python: 3.9.16
97
  - SpanMarker: 1.3.1.dev
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
  library_name: span-marker
6
  tags:
7
  - span-marker
 
9
  - ner
10
  - named-entity-recognition
11
  - generated_from_span_marker_trainer
12
+ datasets:
13
+ - acronym_identification
14
  metrics:
15
  - precision
16
  - recall
17
  - f1
18
+ widget:
19
+ - text: "Here, DA = direct assessment, RR = relative ranking, DS = discrete scale and CS = continuous scale."
20
+ example_title: "Example 1"
21
+ - text: "Modifying or replacing the Erasable Programmable Read Only Memory (EPROM) in a phone would allow the configuration of any ESN and MIN via software for cellular devices."
22
+ example_title: "Example 2"
23
+ - text: "We propose a technique called Aggressive Stochastic Weight Averaging (ASWA) and an extension called Norm-filtered Aggressive Stochastic Weight Averaging (NASWA) which improves the stability of models over random seeds."
24
+ example_title: "Example 3"
25
+ - text: "The choice of the encoder and decoder modules of DNPG can be quite flexible, for instance long-short term memory networks (LSTM) or convolutional neural network (CNN)."
26
+ example_title: "Example 4"
27
  pipeline_tag: token-classification
28
+ co2_eq_emissions:
29
+ emissions: 30.818996419923273
30
+ source: codecarbon
31
+ training_type: fine-tuning
32
+ on_cloud: false
33
+ cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
34
+ ram_total_size: 31.777088165283203
35
+ hours_used: 0.204
36
+ hardware_used: 1 x NVIDIA GeForce RTX 3090
37
+ base_model: bert-base-cased
38
+ model-index:
39
+ - name: SpanMarker with bert-base-cased on Acronym Identification
40
+ results:
41
+ - task:
42
+ type: token-classification
43
+ name: Named Entity Recognition
44
+ dataset:
45
+ name: Acronym Identification
46
+ type: acronym_identification
47
+ split: validation
48
+ metrics:
49
+ - type: f1
50
+ value: 0.9336161187698834
51
+ name: F1
52
+ - type: precision
53
+ value: 0.942208904109589
54
+ name: Precision
55
+ - type: recall
56
+ value: 0.9251786464901219
57
+ name: Recall
58
  ---
59
 
60
+ # SpanMarker with bert-base-cased on Acronym Identification
61
 
62
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [Acronym Identification](https://huggingface.co/datasets/acronym_identification) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-cased](https://huggingface.co/bert-base-cased) as the underlying encoder. See [train.py](train.py) for the training script.
63
+
64
+ Is your data not (always) capitalized correctly? Then consider using the uncased variant of this model instead for better performance:
65
+ [tomaarsen/span-marker-bert-base-uncased-acronyms](https://huggingface.co/tomaarsen/span-marker-bert-base-uncased-acronyms).
66
 
67
  ## Model Details
68
 
69
  ### Model Description
70
  - **Model Type:** SpanMarker
71
+ - **Encoder:** [bert-base-cased](https://huggingface.co/bert-base-cased)
72
  - **Maximum Sequence Length:** 256 tokens
73
  - **Maximum Entity Length:** 8 words
74
+ - **Training Dataset:** [Acronym Identification](https://huggingface.co/datasets/acronym_identification)
75
+ - **Language:** en
76
+ - **License:** apache-2.0
77
 
78
  ### Model Sources
79
 
80
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
81
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
82
 
83
+ ### Model Labels
84
+ | Label | Examples |
85
+ |:------|:------------------------------------------------------------------------------------------------------|
86
+ | long | "Conversational Question Answering", "controlled natural language", "successive convex approximation" |
87
+ | short | "SODA", "CNL", "CoQA" |
88
+
89
+ ## Evaluation
90
+
91
+ ### Metrics
92
+ | Label | Precision | Recall | F1 |
93
+ |:--------|:----------|:-------|:-------|
94
+ | **all** | 0.9422 | 0.9252 | 0.9336 |
95
+ | long | 0.9308 | 0.9013 | 0.9158 |
96
+ | short | 0.9479 | 0.9374 | 0.9426 |
97
+
98
  ## Uses
99
 
100
  ### Direct Use for Inference
 
103
  from span_marker import SpanMarkerModel
104
 
105
  # Download from the 🤗 Hub
106
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-acronyms")
107
  # Run inference
108
+ entities = model.predict("Compression algorithms like Principal Component Analysis (PCA) can reduce noise and complexity.")
109
  ```
110
 
111
  ### Downstream Use
 
117
  from span_marker import SpanMarkerModel, Trainer
118
 
119
  # Download from the 🤗 Hub
120
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-acronyms")
121
 
122
  # Specify a Dataset with "tokens" and "ner_tag" columns
123
  dataset = load_dataset("conll2003") # For example CoNLL2003
 
129
  eval_dataset=dataset["validation"],
130
  )
131
  trainer.train()
132
+ trainer.save_model("tomaarsen/span-marker-bert-base-acronyms-finetuned")
133
  ```
134
  </details>
135
 
 
153
 
154
  ## Training Details
155
 
156
+ ### Training Set Metrics
157
+ | Training set | Min | Median | Max |
158
+ |:----------------------|:----|:--------|:----|
159
+ | Sentence length | 4 | 32.3372 | 170 |
160
+ | Entities per sentence | 0 | 2.6775 | 24 |
161
+
162
+ ### Training Hyperparameters
163
+ - learning_rate: 5e-05
164
+ - train_batch_size: 32
165
+ - eval_batch_size: 32
166
+ - seed: 42
167
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
168
+ - lr_scheduler_type: linear
169
+ - lr_scheduler_warmup_ratio: 0.1
170
+ - num_epochs: 2
171
+
172
+ ### Training Results
173
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
174
+ |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
175
+ | 0.3101 | 200 | 0.0083 | 0.9170 | 0.8894 | 0.9030 | 0.9766 |
176
+ | 0.6202 | 400 | 0.0063 | 0.9329 | 0.9149 | 0.9238 | 0.9807 |
177
+ | 0.9302 | 600 | 0.0060 | 0.9279 | 0.9338 | 0.9309 | 0.9819 |
178
+ | 1.2403 | 800 | 0.0058 | 0.9406 | 0.9092 | 0.9247 | 0.9812 |
179
+ | 1.5504 | 1000 | 0.0056 | 0.9453 | 0.9155 | 0.9302 | 0.9825 |
180
+ | 1.8605 | 1200 | 0.0054 | 0.9411 | 0.9271 | 0.9340 | 0.9831 |
181
+
182
+ ### Environmental Impact
183
+ Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
184
+ - **Carbon Emitted**: 0.031 kg of CO2
185
+ - **Hours Used**: 0.204 hours
186
+
187
+ ### Training Hardware
188
+ - **On Cloud**: No
189
+ - **GPU Model**: 1 x NVIDIA GeForce RTX 3090
190
+ - **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
191
+ - **RAM Size**: 31.78 GB
192
+
193
  ### Framework Versions
194
  - Python: 3.9.16
195
  - SpanMarker: 1.3.1.dev