stefan-it commited on
Commit
dc9624d
1 Parent(s): 6b6ea8b

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md CHANGED
@@ -1,3 +1,82 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ library_name: span-marker
4
+ tags:
5
+ - span-marker
6
+ - token-classification
7
+ - ner
8
+ - named-entity-recognition
9
+ pipeline_tag: token-classification
10
+ widget:
11
+ - text: "Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München ."
12
+ example_title: "Wikipedia"
13
+ datasets:
14
+ - gwlms/germeval2014
15
+ language:
16
+ - de
17
+ model-index:
18
+ - name: SpanMarker with GWLMS BERT on GermEval 2014 NER Dataset by Stefan Schweter (@stefan-it)
19
+ results:
20
+ - task:
21
+ type: token-classification
22
+ name: Named Entity Recognition
23
+ dataset:
24
+ type: gwlms/germeval2014
25
+ name: GermEval 2014
26
+ split: test
27
+ revision: f3647c56803ce67c08ee8d15f4611054c377b226
28
+ metrics:
29
+ - type: f1
30
+ value: 0.8744
31
+ name: F1
32
+ metrics:
33
+ - f1
34
  ---
35
+
36
+ # SpanMarker for GermEval 2014 NER
37
+
38
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that
39
+ was fine-tuned on the [GermEval 2014 NER Dataset](https://sites.google.com/site/germeval2014ner/home).
40
+
41
+ The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following
42
+ properties: The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset
43
+ covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines,
44
+ which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating
45
+ embeddings among NEs such as `[ORG FC Kickers [LOC Darmstadt]]`.
46
+
47
+ 12 classes of Named Entites are annotated and must be recognized: four main classes `PER`son, `LOC`ation, `ORG`anisation,
48
+ and `OTH`er and their subclasses by introducing two fine-grained labels: `-deriv` marks derivations from NEs such as
49
+ "englisch" (“English”), and `-part` marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).
50
+
51
+ # Fine-Tuning
52
+
53
+ We use the same hyper-parameters as used in the
54
+ ["German's Next Language Model"](https://aclanthology.org/2020.coling-main.598/) paper using the
55
+ [GWLMS Token Dropping BERT](https://huggingface.co/gwlms/bert-base-token-dropping-dewiki-v1) model as backbone.
56
+
57
+ Evaluation is performed with SpanMarkers internal evaluation code that uses `seqeval`.
58
+
59
+ We fine-tune 5 models and upload the model with best F1-Score on development set. Results on development set are
60
+ in brackets:
61
+
62
+ | Model | Run 1 | Run 2 (This) | Run 3 | Run 4 | Run 5 | Avg.
63
+ | ------------------------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | ---------------
64
+ | GWLMS Token Dropping BERT (5e-05, 3e) | (87.85) / 87.28 | (88.09) / 87.44 | (87.59) / 87.26 | (87.71) / 87.43 | (87.83) / 87.24 | (87.81) / 87.33
65
+
66
+ The best model achieves a final test score of 87.45%:
67
+
68
+ Scripts for [training](trainer.py) and [evaluation](evaluator.py) are also available.
69
+
70
+ # Usage
71
+
72
+ The fine-tuned model can be used like:
73
+
74
+ ```python
75
+ from span_marker import SpanMarkerModel
76
+
77
+ # Download from the 🤗 Hub
78
+ model = SpanMarkerModel.from_pretrained("gwlms/span-marker-token-dropping-bert-germeval14")
79
+
80
+ # Run inference
81
+ entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")
82
+ ```