team-data-ktzh commited on
Commit
6d91486
1 Parent(s): 4be246d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -70
README.md CHANGED
@@ -1,108 +1,127 @@
1
  ---
 
2
  library_name: span-marker
 
3
  tags:
4
  - span-marker
5
  - token-classification
6
  - ner
7
  - named-entity-recognition
8
- - generated_from_span_marker_trainer
9
- metrics:
10
- - precision
11
- - recall
12
- - f1
13
- widget: []
14
  pipeline_tag: token-classification
 
 
 
 
 
 
 
15
  ---
16
 
17
- # SpanMarker
18
 
19
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
20
 
21
- ## Model Details
22
-
23
- ### Model Description
24
- - **Model Type:** SpanMarker
25
- <!-- - **Encoder:** [Unknown](https://huggingface.co/unknown) -->
26
- - **Maximum Sequence Length:** 256 tokens
27
- - **Maximum Entity Length:** 8 words
28
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
29
- <!-- - **Language:** Unknown -->
30
- <!-- - **License:** Unknown -->
31
-
32
- ### Model Sources
33
 
34
- - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
35
- - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
36
 
37
- ## Uses
38
-
39
- ### Direct Use for Inference
40
 
41
  ```python
42
  from span_marker import SpanMarkerModel
43
 
44
  # Download from the 🤗 Hub
45
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 
46
  # Run inference
47
- entities = model.predict("None")
48
  ```
49
 
50
- ### Downstream Use
51
- You can finetune this model on your own dataset.
52
-
53
- <details><summary>Click to expand</summary>
54
-
55
- ```python
56
- from span_marker import SpanMarkerModel, Trainer
57
-
58
- # Download from the 🤗 Hub
59
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
60
-
61
- # Specify a Dataset with "tokens" and "ner_tag" columns
62
- dataset = load_dataset("conll2003") # For example CoNLL2003
63
-
64
- # Initialize a Trainer using the pretrained model & dataset
65
- trainer = Trainer(
66
- model=model,
67
- train_dataset=dataset["train"],
68
- eval_dataset=dataset["validation"],
69
- )
70
- trainer.train()
71
- trainer.save_model("span_marker_model_id-finetuned")
72
- ```
73
- </details>
74
 
75
- <!--
76
- ### Out-of-Scope Use
 
 
 
 
 
 
77
 
78
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
79
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- <!--
82
  ## Bias, Risks and Limitations
83
 
84
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
85
- -->
86
 
87
- <!--
88
  ### Recommendations
89
 
90
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
91
- -->
92
 
93
  ## Training Details
94
 
95
- ### Framework Versions
96
- - Python: 3.11.7
97
- - SpanMarker: 1.5.0
98
- - Transformers: 4.36.2
99
- - PyTorch: 2.2.1
100
- - Datasets: 2.16.1
101
- - Tokenizers: 0.15.0
 
 
 
 
 
 
 
102
 
103
- ## Citation
 
 
104
 
105
- ### BibTeX
 
 
106
  ```
107
  @software{Aarsen_SpanMarker,
108
  author = {Aarsen, Tom},
@@ -110,6 +129,53 @@ trainer.save_model("span_marker_model_id-finetuned")
110
  title = {{SpanMarker for Named Entity Recognition}},
111
  url = {https://github.com/tomaarsen/SpanMarkerNER}
112
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ```
114
 
115
  <!--
 
1
  ---
2
+ license: mit
3
  library_name: span-marker
4
+ base_model: stefan-it/span-marker-gelectra-large-germeval14
5
  tags:
6
  - span-marker
7
  - token-classification
8
  - ner
9
  - named-entity-recognition
 
 
 
 
 
 
10
  pipeline_tag: token-classification
11
+ widget:
12
+ - text: "Konstruiertes Beispiel: Hans Meier besitzt eine Firma im zürcherischen Wil. Die Meier AG war Thema einer kantonsrätlichen Sitzung. Im Meierschen Besitz ist auch ein Anwesen, das sich in unmittelbarer Nachbarschaft zu Liegenschaften des Kantons befindet und unweit der Grenze zum Kanton Aargau gelegen ist."
13
+ example_title: "Test sentence with all labels"
14
+ - text: "Böckli - Zürich erklärt, daß die Sozialdemokratische Fraktion bei allem Verständnis für die Kritik Winigers der Vorlage mit großer Mehrheit zustimmt. Das ungeschickte Vorgehen der «Swissair» bei der Kapitalerhöhung ist zu bedauern. Es war unglücklich, daß keine öffentliche Auflage der neuen Aktien durchgeführt wurde. Für die Stellungnahme der Zürcher Kantonalbank war ausschlaggebend, daß das Bankgesetz solche Beteiligungen ablehnt."
15
+ example_title: "Cantonal council meeting notes"
16
+ language:
17
+ - de
18
  ---
19
 
20
+ # SpanMarker KtZH StaZH
21
 
22
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that is based on the [GELECTRA Large](https://huggingface.co/stefan-it/span-marker-gelectra-large-germeval14) variant of the **SpanMarker for GermEval 2014 NER** and further fine-tuned on meeting notes from the cantonal council, resolutions of the governing council and law text from the corpus juris of the Canton of Zurich. The documents span the 19th and 20th century, covering both historical language with varying degrees of standardization and contemporary language. Distinguished are `PER`son, `LOC`ation, `ORG`anisation, as well as derivations of Named Entities (tag suffix `-deriv`).
23
 
24
+ The `ORG`anisation class has been extended to encompass institutions that have been deemed to be reasonably unambiguous in isolation or by virtue of their usage in the training data. Purely abstract/prototypical uses of institutions are generally out of scope (the model does not perform concept classification), can however occasionally arise.
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ ## Usage
 
27
 
28
+ The fine-tuned model can be used like:
 
 
29
 
30
  ```python
31
  from span_marker import SpanMarkerModel
32
 
33
  # Download from the 🤗 Hub
34
+ model = SpanMarkerModel.from_pretrained("team-data-ktzh/span-marker-ktzh-stazh")
35
+
36
  # Run inference
37
+ entities = model.predict("Hans Meier aus Dielsdorf vertritt im Kantonsrat die FDP.")
38
  ```
39
 
40
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ ### Model Description
43
+ - **Model Type:** SpanMarker
44
+ - **Encoder:** [deepset/gelectra-large](https://huggingface.co/deepset/gelectra-large) (ELECTRA Large)
45
+ - **Maximum Sequence Length:** 256 tokens
46
+ - **Maximum Entity Length:** 8 words
47
+ - **Training Dataset:** see https:// TODO
48
+ - **Language:** de
49
+ - **License:** MIT
50
 
51
+ ### Model Sources
52
+ - **Training repository (TODO):** []()
53
+ - **SpanMarker:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
54
+
55
+ ### Model Labels
56
+ | Label | Examples |
57
+ |:------|:------------------------------------------------------------------------------------------------------|
58
+ | PER | Hans Müller |
59
+ | LOC | Zürich |
60
+ | ORG | SBB, Swissair, Kantonsrat, Bundesgericht |
61
+ | PERderiv | Müllersche |
62
+ | LOCderiv | zürcherische |
63
+ | ORGderiv | bundesgerichtlicher |
64
+
65
+ ## Cross-validation evaluation
66
+
67
+ Evaluation relies on SpanMarker's internal evaluation code, which is based on `seqeval`.
68
+
69
+ ### Average per-label metrics
70
+ | Label | P | R | F1 |
71
+ |:---------|-----:|-----:|-----:|
72
+ | PER | 0.97 | 0.97 | 0.97 |
73
+ | LOC | 0.95 | 0.96 | 0.96 |
74
+ | ORG | 0.92 | 0.95 | 0.93 |
75
+ | PERderiv | 0.40 | 0.30 | 0.33 |
76
+ | LOCderiv | 0.86 | 0.85 | 0.85 |
77
+ | ORGderiv | 0.73 | 0.76 | 0.74 |
78
+
79
+ ### Overall per-fold validation metrics
80
+ | Fold | Precision | Recall | F1 | Accuracy |
81
+ |:-----:|:---------------------:|:------------------:|:-------------:|:-------------------:|
82
+ | 0 | 0.927 | 0.952 | 0.939 | 0.992 |
83
+ | 1 | 0.942 | 0.957 | 0.949 | 0.993 |
84
+ | 2 | 0.938 | 0.946 | 0.942 | 0.992 |
85
+ | 3 | 0.921 | 0.951 | 0.936 | 0.992 |
86
+ | 4 | 0.945 | 0.949 | 0.947 | 0.993 |
87
+
88
+ ### Confusion matrix
89
+ ![Confusion matrix](confusion_matrix.png)
90
+
91
+ (Note that the confusion matrix also lists other labels from the GermEval 2014 dataset which are ignored in the context of this model.)
92
 
 
93
  ## Bias, Risks and Limitations
94
 
95
+ Please note that this is released strictly as a task-bound model for the purpose of annotating historical and future documents from the collections it was trained on, as well as the official gazette of the Canton of Zurich. No claims of generalization are made outside of the specific use case it was developed for. The training data was annotated according to a specific but informal annotation scheme and the bias of the original model has been retained where it was found not to interfere with the use case. Be mindful of idiosyncrasies when applying to other documents.
 
96
 
 
97
  ### Recommendations
98
 
99
+ The original XML documents of the training set can be found here: TODO. The annotations may be freely modified to tailor the model to an alternative use case. Note that the modified TEI Publisher version in TODO and the notebook at TODO are required to generate a Huggingface Dataset.
 
100
 
101
  ## Training Details
102
 
103
+ ### Training Hyperparameters
104
+ - learning_rate: Decay from 1e-05 to 5e-07
105
+ - train_batch_size: 4
106
+ - seed: 42
107
+ - optimizer: AdamW with betas=(0.9,0.999), epsilon=1e-08, weight_decay=0.01
108
+ - lr_scheduler_type: Polynomial (cubic)
109
+ - lr_scheduler_warmup_ratio: 0.05
110
+ - num_epochs: 10
111
+ - gradient_accumulation_steps: 2
112
+ - steps: 16000
113
+
114
+ ## Training data sources
115
+
116
+ The training data was sampled from the following collections from the [data catalog of the Canton of Zurich](https://www.zh.ch/de/politik-staat/statistik-daten/datenkatalog.html#/), curated by the [Staatsarchiv des Kantons Zürich (state archives of the Canton of Zurich)](https://www.zh.ch/de/direktion-der-justiz-und-des-innern/staatsarchiv.html):
117
 
118
+ * **Meeting notes of the cantonal council**: [Zürcher Kantonsratsprotokolle des 19. und 20. Jahrhunderts](https://www.zh.ch/de/politik-staat/statistik-daten/datenkatalog.html#/datasets/732@staatsarchiv-kanton-zuerich)
119
+ * **Resolutions of the governing council**: [Zürcher Regierungsratsbeschlüsse des 19. und 20. Jahrhunderts](https://www.zh.ch/de/politik-staat/statistik-daten/datenkatalog.html#/datasets/466@staatsarchiv-kanton-zuerich)
120
+ * **Corpus juris**: [Erlasse der Zürcher Gesetzessammlung ab 1803](https://www.zh.ch/de/politik-staat/statistik-daten/datenkatalog.html#/datasets/712@staatsarchiv-kanton-zuerich)
121
 
122
+ ## Bibliography
123
+
124
+ This work builds upon:
125
  ```
126
  @software{Aarsen_SpanMarker,
127
  author = {Aarsen, Tom},
 
129
  title = {{SpanMarker for Named Entity Recognition}},
130
  url = {https://github.com/tomaarsen/SpanMarkerNER}
131
  }
132
+
133
+ @article{aarsenspanmarker,
134
+ title={SpanMarker for Named Entity Recognition},
135
+ author={Aarsen, Tom and del Prado Martin, Fermin Moscoso and Suero, Daniel Vila and Oosterhuis, Harrie}
136
+ }
137
+
138
+ @inproceedings{ye-etal-2022-packed,
139
+ title = "Packed Levitated Marker for Entity and Relation Extraction",
140
+ author = "Ye, Deming and
141
+ Lin, Yankai and
142
+ Li, Peng and
143
+ Sun, Maosong",
144
+ editor = "Muresan, Smaranda and
145
+ Nakov, Preslav and
146
+ Villavicencio, Aline",
147
+ booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
148
+ month = may,
149
+ year = "2022",
150
+ address = "Dublin, Ireland",
151
+ publisher = "Association for Computational Linguistics",
152
+ url = "https://aclanthology.org/2022.acl-long.337",
153
+ doi = "10.18653/v1/2022.acl-long.337",
154
+ pages = "4904--4917"}",
155
+ }
156
+
157
+ @misc{chan2020germans,
158
+ author = {Chan, Branden and Schweter, Stefan and Möller, Timo},
159
+ description = {German's Next Language Model},
160
+ keywords = {bert gbert languagemodel lm},
161
+ title = {German's Next Language Model},
162
+ url = {http://arxiv.org/abs/2010.10906},
163
+ year = 2020
164
+ }
165
+
166
+ @inproceedings{benikova-etal-2014-nosta,
167
+ title = {NoSta-D Named Entity Annotation for German: Guidelines and Dataset},
168
+ author = {Benikova, Darina and
169
+ Biemann, Chris and
170
+ Reznicek, Marc},
171
+ booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)},
172
+ month = {may},
173
+ year = {2014},
174
+ address = {Reykjavik, Iceland},
175
+ publisher = {European Language Resources Association (ELRA)},
176
+ url = {http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdf},
177
+ pages = {2524--2531},
178
+ }
179
  ```
180
 
181
  <!--