KennethEnevoldsen
commited on
Commit
•
b2c4ef2
1
Parent(s):
ad1f1f2
Update spaCy pipeline
Browse files- README.md +193 -1
- config.cfg +1 -2
- da_dacy_small_trf-any-py3-none-any.whl +2 -2
- meta.json +218 -192
- morphologizer/model +1 -1
- ner/model +1 -1
- parser/model +1 -1
- transformer/model/pytorch_model.bin +1 -1
- transformer/model/tokenizer_config.json +1 -1
- vocab/strings.json +2 -2
README.md
CHANGED
@@ -1 +1,193 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- spacy
|
4 |
+
- token-classification
|
5 |
+
language:
|
6 |
+
- da
|
7 |
+
license: Apache-2.0-License
|
8 |
+
model-index:
|
9 |
+
- name: da_dacy_small_trf
|
10 |
+
results:
|
11 |
+
- tasks:
|
12 |
+
name: NER
|
13 |
+
type: token-classification
|
14 |
+
metrics:
|
15 |
+
- name: Precision
|
16 |
+
type: precision
|
17 |
+
value: 0.81724846
|
18 |
+
- name: Recall
|
19 |
+
type: recall
|
20 |
+
value: 0.8291666667
|
21 |
+
- name: F Score
|
22 |
+
type: f_score
|
23 |
+
value: 0.8231644261
|
24 |
+
- tasks:
|
25 |
+
name: SENTER
|
26 |
+
type: token-classification
|
27 |
+
metrics:
|
28 |
+
- name: Precision
|
29 |
+
type: precision
|
30 |
+
value: 0.8603839442
|
31 |
+
- name: Recall
|
32 |
+
type: recall
|
33 |
+
value: 0.8741134752
|
34 |
+
- name: F Score
|
35 |
+
type: f_score
|
36 |
+
value: 0.8671943712
|
37 |
+
- tasks:
|
38 |
+
name: UNLABELED_DEPENDENCIES
|
39 |
+
type: token-classification
|
40 |
+
metrics:
|
41 |
+
- name: Accuracy
|
42 |
+
type: accuracy
|
43 |
+
value: 0.8492442546
|
44 |
+
- tasks:
|
45 |
+
name: LABELED_DEPENDENCIES
|
46 |
+
type: token-classification
|
47 |
+
metrics:
|
48 |
+
- name: Accuracy
|
49 |
+
type: accuracy
|
50 |
+
value: 0.8492442546
|
51 |
+
---
|
52 |
+
|
53 |
+
<a href="https://github.com/centre-for-humanities-computing/Dacy"><img src="https://centre-for-humanities-computing.github.io/DaCy/_static/icon.png" width="175" height="175" align="right" /></a>
|
54 |
+
|
55 |
+
# DaCy small transformer
|
56 |
+
|
57 |
+
DaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines.
|
58 |
+
DaCy's largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency
|
59 |
+
parsing for Danish on the DaNE dataset. Check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results.
|
60 |
+
DaCy also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.
|
61 |
+
|
62 |
+
|
63 |
+
| Feature | Description |
|
64 |
+
| --- | --- |
|
65 |
+
| **Name** | `da_dacy_small_trf` |
|
66 |
+
| **Version** | `0.1.0` |
|
67 |
+
| **spaCy** | `>=3.1.1,<3.2.0` |
|
68 |
+
| **Default Pipeline** | `transformer`, `morphologizer`, `parser`, `attribute_ruler`, `lemmatizer`, `ner` |
|
69 |
+
| **Components** | `transformer`, `morphologizer`, `parser`, `attribute_ruler`, `lemmatizer`, `ner` |
|
70 |
+
| **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
|
71 |
+
| **Sources** | [UD Danish DDT v2.5](https://github.com/UniversalDependencies/UD_Danish-DDT) (Johannsen, Anders; Martínez Alonso, Héctor; Plank, Barbara)<br />[DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane) (Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard, Anders Søgaard)<br />[Maltehb/-l-ctra-danish-electra-small-cased](https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased) (Malte Højmark-Bertelsen) |
|
72 |
+
| **License** | `Apache-2.0 License` |
|
73 |
+
| **Author** | [Centre for Humanities Computing Aarhus](https://chcaa.io/#/) |
|
74 |
+
|
75 |
+
### Label Scheme
|
76 |
+
|
77 |
+
<details>
|
78 |
+
|
79 |
+
<summary>View label scheme (192 labels for 3 components)</summary>
|
80 |
+
|
81 |
+
| Component | Labels |
|
82 |
+
| --- | --- |
|
83 |
+
| **`morphologizer`** | `AdpType=Prep\|POS=ADP`, `Definite=Ind\|Gender=Com\|Number=Sing\|POS=NOUN`, `Mood=Ind\|POS=AUX\|Tense=Pres\|VerbForm=Fin\|Voice=Act`, `POS=PROPN`, `Definite=Ind\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Def\|Gender=Neut\|Number=Sing\|POS=NOUN`, `POS=SCONJ`, `Definite=Def\|Gender=Com\|Number=Sing\|POS=NOUN`, `Mood=Ind\|POS=VERB\|Tense=Pres\|VerbForm=Fin\|Voice=Act`, `POS=ADV`, `Number=Plur\|POS=DET\|PronType=Dem`, `Degree=Pos\|Number=Plur\|POS=ADJ`, `Definite=Ind\|Gender=Com\|Number=Plur\|POS=NOUN`, `POS=PUNCT`, `POS=CCONJ`, `Definite=Ind\|Degree=Cmp\|Number=Sing\|POS=ADJ`, `Degree=Cmp\|POS=ADJ`, `POS=PRON\|PartType=Inf`, `Gender=Com\|Number=Sing\|POS=DET\|PronType=Ind`, `Definite=Ind\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Case=Acc\|Gender=Neut\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `Definite=Ind\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Definite=Def\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Gender=Neut\|Number=Sing\|POS=DET\|PronType=Dem`, `Degree=Pos\|POS=ADV`, `Definite=Def\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Ind\|Gender=Neut\|Number=Sing\|POS=NOUN`, `POS=PRON\|PronType=Dem`, `NumType=Card\|POS=NUM`, `Definite=Ind\|Degree=Pos\|Gender=Neut\|Number=Sing\|POS=ADJ`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `Degree=Pos\|Gender=Com\|Number=Sing\|POS=ADJ`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `NumType=Ord\|POS=ADJ`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Mood=Ind\|POS=AUX\|Tense=Past\|VerbForm=Fin\|Voice=Act`, `POS=VERB\|VerbForm=Inf\|Voice=Act`, `Mood=Ind\|POS=VERB\|Tense=Past\|VerbForm=Fin\|Voice=Act`, `POS=NOUN`, `Mood=Ind\|POS=VERB\|Tense=Pres\|VerbForm=Fin\|Voice=Pass`, `POS=ADP\|PartType=Inf`, `Degree=Pos\|POS=ADJ`, `Definite=Def\|Gender=Com\|Number=Plur\|POS=NOUN`, `Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Com\|Number=Sing\|POS=NOUN`, `POS=AUX\|VerbForm=Inf\|Voice=Act`, `Definite=Ind\|Degree=Pos\|Gender=Com\|Number=Sing\|POS=ADJ`, `Gender=Com\|Number=Sing\|POS=DET\|PronType=Dem`, `Number=Plur\|POS=DET\|PronType=Ind`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Ind`, `Case=Acc\|POS=PRON\|Person=3\|PronType=Prs\|Reflex=Yes`, `POS=PART\|PartType=Inf`, `Gender=Neut\|Number=Sing\|POS=DET\|PronType=Ind`, `Case=Acc\|Number=Plur\|POS=PRON\|Person=3\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Neut\|Number=Sing\|POS=NOUN`, `Case=Nom\|Number=Plur\|POS=PRON\|Person=3\|PronType=Prs`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Nom\|Gender=Com\|POS=PRON\|PronType=Ind`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Ind`, `Mood=Imp\|POS=VERB`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Definite=Ind\|Number=Sing\|POS=AUX\|Tense=Past\|VerbForm=Part`, `POS=X`, `Case=Nom\|Gender=Com\|Number=Plur\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Com\|Number=Plur\|POS=NOUN`, `POS=VERB\|Tense=Pres\|VerbForm=Part`, `Number=Plur\|POS=PRON\|PronType=Int,Rel`, `POS=VERB\|VerbForm=Inf\|Voice=Pass`, `Case=Gen\|Definite=Ind\|Gender=Com\|Number=Sing\|POS=NOUN`, `Degree=Cmp\|POS=ADV`, `POS=ADV\|PartType=Inf`, `Degree=Sup\|POS=ADV`, `Number=Plur\|POS=PRON\|PronType=Dem`, `Number=Plur\|POS=PRON\|PronType=Ind`, `Definite=Def\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|POS=PROPN`, `POS=ADP`, `Degree=Cmp\|Number=Plur\|POS=ADJ`, `Definite=Def\|Degree=Sup\|POS=ADJ`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Degree=Pos\|Number=Sing\|POS=ADJ`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Gender=Com\|Number=Sing\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Number=Plur\|POS=PRON\|PronType=Rcp`, `Case=Gen\|Degree=Cmp\|POS=ADJ`, `Case=Gen\|Definite=Def\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Number[psor]=Plur\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs`, `POS=INTJ`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Degree=Pos\|Gender=Neut\|Number=Sing\|POS=ADJ`, `Gender=Neut\|Number=Sing\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Definite=Ind\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Number=Sing\|POS=PRON\|PronType=Int,Rel`, `Number=Plur\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Int,Rel`, `Definite=Def\|Degree=Sup\|Number=Plur\|POS=ADJ`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Definite=Ind\|Number=Sing\|POS=NOUN`, `Number=Plur\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Number=Plur\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `POS=SYM`, `Case=Nom\|Gender=Com\|POS=PRON\|Person=2\|Polite=Form\|PronType=Prs`, `Degree=Sup\|POS=ADJ`, `Number=Plur\|POS=DET\|PronType=Ind\|Style=Arch`, `Case=Gen\|Gender=Com\|Number=Sing\|POS=DET\|PronType=Dem`, `Foreign=Yes\|POS=X`, `POS=DET\|Person=2\|Polite=Form\|Poss=Yes\|PronType=Prs`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Dem`, `Case=Acc\|Gender=Com\|Number=Plur\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|Definite=Ind\|Gender=Neut\|Number=Sing\|POS=NOUN`, `Case=Gen\|POS=PRON\|PronType=Int,Rel`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Dem`, `Abbr=Yes\|POS=X`, `Case=Gen\|Definite=Ind\|Gender=Com\|Number=Plur\|POS=NOUN`, `Definite=Def\|Degree=Abs\|POS=ADJ`, `Definite=Ind\|Degree=Sup\|Number=Sing\|POS=ADJ`, `Definite=Ind\|POS=NOUN`, `Gender=Com\|Number=Plur\|POS=NOUN`, `Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Gender=Com\|POS=PRON\|PronType=Int,Rel`, `Case=Nom\|Gender=Com\|Number=Plur\|POS=PRON\|Person=2\|PronType=Prs`, `Degree=Abs\|POS=ADV`, `POS=VERB\|VerbForm=Ger`, `POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Def\|Degree=Sup\|Number=Sing\|POS=ADJ`, `Number=Plur\|Number[psor]=Plur\|POS=PRON\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Case=Gen\|Definite=Def\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Case=Gen\|Degree=Pos\|Number=Plur\|POS=ADJ`, `Case=Acc\|Gender=Com\|POS=PRON\|Person=2\|Polite=Form\|PronType=Prs`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Int,Rel`, `POS=VERB\|Tense=Pres`, `Case=Gen\|Number=Plur\|POS=DET\|PronType=Ind`, `Number[psor]=Plur\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `POS=PRON\|Person=2\|Polite=Form\|Poss=Yes\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `POS=AUX\|Tense=Pres\|VerbForm=Part`, `Mood=Ind\|POS=VERB\|Tense=Past\|VerbForm=Fin\|Voice=Pass`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Degree=Sup\|Number=Plur\|POS=ADJ`, `Case=Acc\|Gender=Com\|Number=Plur\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Definite=Ind\|Number=Plur\|POS=NOUN`, `Case=Gen\|Number=Plur\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Mood=Imp\|POS=AUX`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=1\|Poss=Yes\|PronType=Prs`, `Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs`, `Definite=Def\|Gender=Com\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Gender=Com\|Number=Sing\|POS=DET\|PronType=Ind`, `Case=Gen\|POS=NOUN`, `Number[psor]=Plur\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs`, `POS=DET\|PronType=Dem`, `Definite=Def\|Number=Plur\|POS=NOUN` |
|
84 |
+
| **`parser`** | `ROOT`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `aux`, `case`, `cc`, `ccomp`, `compound:prt`, `conj`, `cop`, `dep`, `det`, `expl`, `fixed`, `flat`, `iobj`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obl`, `obl:loc`, `obl:tmod`, `punct`, `xcomp` |
|
85 |
+
| **`ner`** | `LOC`, `MISC`, `ORG`, `PER` |
|
86 |
+
|
87 |
+
</details>
|
88 |
+
|
89 |
+
### Accuracy
|
90 |
+
|
91 |
+
| Type | Score |
|
92 |
+
| --- | --- |
|
93 |
+
| `POS_ACC` | 95.83 |
|
94 |
+
| `MORPH_ACC` | 95.70 |
|
95 |
+
| `DEP_UAS` | 84.92 |
|
96 |
+
| `DEP_LAS` | 81.76 |
|
97 |
+
| `SENTS_P` | 86.04 |
|
98 |
+
| `SENTS_R` | 87.41 |
|
99 |
+
| `SENTS_F` | 86.72 |
|
100 |
+
| `LEMMA_ACC` | 84.91 |
|
101 |
+
| `ENTS_F` | 82.32 |
|
102 |
+
| `ENTS_P` | 81.72 |
|
103 |
+
| `ENTS_R` | 82.92 |
|
104 |
+
| `TRANSFORMER_LOSS` | 41746686.63 |
|
105 |
+
| `MORPHOLOGIZER_LOSS` | 3458966.49 |
|
106 |
+
| `PARSER_LOSS` | 15104898.38 |
|
107 |
+
| `NER_LOSS` | 546098.45 |
|
108 |
+
|
109 |
+
|
110 |
+
## Bias and Robustness
|
111 |
+
|
112 |
+
Besides the validation done by SpaCy on the DaNE testset, DaCy also provides a series of augmentations to the DaNE test set to see how well the models deal with these types of augmentations.
|
113 |
+
The can be seen as behavioural probes akinn to the NLP checklist.
|
114 |
+
|
115 |
+
### Deterministic Augmentations
|
116 |
+
Deterministic augmentations are augmentation which always yield the same result.
|
117 |
+
|
118 |
+
| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) | Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |
|
119 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
120 |
+
| No augmentation | 0.98 | 0.974 | 0.868 | 0.836 | 0.936 | 0.844 | 0.765 |
|
121 |
+
| Æøå Augmentation | 0.955 | 0.948 | 0.823 | 0.783 | 0.922 | 0.754 | 0.718 |
|
122 |
+
| Lowercase | 0.974 | 0.97 | 0.862 | 0.828 | 0.905 | 0.848 | 0.681 |
|
123 |
+
| No Spacing | 0.229 | 0.229 | 0.004 | 0.003 | 0.824 | 0.225 | 0.048 |
|
124 |
+
| Abbreviated first names | 0.979 | 0.973 | 0.864 | 0.832 | 0.94 | 0.845 | 0.699 |
|
125 |
+
| Input size augmentation 5 sentences | 0.956 | 0.956 | 0.851 | 0.818 | 0.883 | 0.844 | 0.743 |
|
126 |
+
| Input size augmentation 10 sentences | 0.959 | 0.958 | 0.853 | 0.821 | 0.897 | 0.844 | 0.755 |
|
127 |
+
|
128 |
+
|
129 |
+
|
130 |
+
### Stochastic Augmentations
|
131 |
+
Stochastic augmentations are augmentation which are repeated mulitple times to estimate the effect of the augmentation.
|
132 |
+
|
133 |
+
| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) | Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |
|
134 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
135 |
+
| Keystroke errors 2% | 0.931 (0.003) | 0.929 (0.003) | 0.797 (0.003) | 0.753 (0.003) | 0.884 (0.003) | 0.772 (0.003) | 0.657 (0.003) |
|
136 |
+
| Keystroke errors 5% | 0.859 (0.003) | 0.863 (0.003) | 0.699 (0.003) | 0.641 (0.003) | 0.824 (0.003) | 0.681 (0.003) | 0.53 (0.003) |
|
137 |
+
| Keystroke errors 15% | 0.633 (0.006) | 0.662 (0.006) | 0.439 (0.006) | 0.358 (0.006) | 0.688 (0.006) | 0.459 (0.006) | 0.293 (0.006) |
|
138 |
+
| Danish names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |
|
139 |
+
| Muslim names | 0.979 (0.0) | 0.974 (0.0) | 0.865 (0.0) | 0.833 (0.0) | 0.94 (0.0) | 0.847 (0.0) | 0.732 (0.0) |
|
140 |
+
| Female names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.946 (0.0) | 0.847 (0.0) | 0.754 (0.0) |
|
141 |
+
| Male names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |
|
142 |
+
| Spacing Augmention 5% | 0.941 (0.002) | 0.936 (0.002) | 0.755 (0.002) | 0.725 (0.002) | 0.907 (0.002) | 0.811 (0.002) | 0.699 (0.002) |
|
143 |
+
|
144 |
+
<details>
|
145 |
+
|
146 |
+
<summary> Description of Augmenters </summary>
|
147 |
+
|
148 |
+
|
149 |
+
|
150 |
+
**No augmentation:**
|
151 |
+
Applies no augmentation to the DaNE test set.
|
152 |
+
|
153 |
+
**Æøå Augmentation:**
|
154 |
+
This augmentation replace the æ,ø, and å with their spelling variations ae, oe and aa respectively.
|
155 |
+
|
156 |
+
**Lowercase:**
|
157 |
+
This augmentation lowercases all text.
|
158 |
+
|
159 |
+
**No Spacing:**
|
160 |
+
This augmentation removed all spacing from the text.
|
161 |
+
|
162 |
+
**Abbreviated first names:**
|
163 |
+
This agmentation abbreviates the first names of entities. For instance 'Kenneth Enevoldsen' would turn to 'K. Enevoldsen'.
|
164 |
+
|
165 |
+
**Keystroke errors 2%:**
|
166 |
+
This agmentation simulate keystroke errors by replacing 2% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
167 |
+
|
168 |
+
**Keystroke errors 5%:**
|
169 |
+
This agmentation simulate keystroke errors by replacing 5% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
170 |
+
|
171 |
+
**Keystroke errors 15%:**
|
172 |
+
This agmentation simulate keystroke errors by replacing 15% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
173 |
+
|
174 |
+
**Danish names:**
|
175 |
+
This agmentation replace all names with Danish names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
176 |
+
|
177 |
+
**Muslim names:**
|
178 |
+
This agmentation replace all names with Muslim names derived from Meldgaard (2005). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
179 |
+
|
180 |
+
**Female names:**
|
181 |
+
This agmentation replace all names with Danish female names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
182 |
+
|
183 |
+
**Male names:**
|
184 |
+
This agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
185 |
+
|
186 |
+
**Spacing Augmention 5%:**
|
187 |
+
This agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
|
188 |
+
</details>
|
189 |
+
<br />
|
190 |
+
|
191 |
+
|
192 |
+
### Hardware
|
193 |
+
This was run an trained on a Quadro RTX 8000 GPU.
|
config.cfg
CHANGED
@@ -104,7 +104,6 @@ stride = 96
|
|
104 |
|
105 |
[components.transformer.model.tokenizer_config]
|
106 |
use_fast = true
|
107 |
-
strip_accents = false
|
108 |
|
109 |
[corpora]
|
110 |
|
@@ -136,7 +135,7 @@ dropout = 0.1
|
|
136 |
accumulate_gradient = 3
|
137 |
patience = 5000
|
138 |
max_epochs = 0
|
139 |
-
max_steps =
|
140 |
eval_frequency = 1000
|
141 |
frozen_components = []
|
142 |
before_to_disk = null
|
|
|
104 |
|
105 |
[components.transformer.model.tokenizer_config]
|
106 |
use_fast = true
|
|
|
107 |
|
108 |
[corpora]
|
109 |
|
|
|
135 |
accumulate_gradient = 3
|
136 |
patience = 5000
|
137 |
max_epochs = 0
|
138 |
+
max_steps = 40000
|
139 |
eval_frequency = 1000
|
140 |
frozen_components = []
|
141 |
before_to_disk = null
|
da_dacy_small_trf-any-py3-none-any.whl
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9a76f9af63a196fccfc13b6dab46ef46ac1ba1202c15ad38b7189b07ee6e62be
|
3 |
+
size 57514565
|
meta.json
CHANGED
@@ -2,13 +2,13 @@
|
|
2 |
"lang":"da",
|
3 |
"name":"dacy_small_trf",
|
4 |
"version":"0.1.0",
|
5 |
-
"description":"DaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines
|
6 |
-
"author":"
|
7 |
-
"email":"
|
8 |
-
"url":"https://
|
9 |
"license":"Apache-2.0 License",
|
10 |
-
"spacy_version":">=3.1.
|
11 |
-
"spacy_git_version":"
|
12 |
"vectors":{
|
13 |
"width":0,
|
14 |
"vectors":0,
|
@@ -243,248 +243,251 @@
|
|
243 |
"disabled":[
|
244 |
|
245 |
],
|
|
|
|
|
|
|
246 |
"performance":{
|
247 |
-
"pos_acc":0.
|
248 |
-
"morph_acc":0.
|
249 |
"morph_per_feat":{
|
250 |
"Mood":{
|
251 |
-
"p":0.
|
252 |
-
"r":0.
|
253 |
-
"f":0.
|
254 |
},
|
255 |
"Tense":{
|
256 |
-
"p":0.
|
257 |
-
"r":0.
|
258 |
-
"f":0.
|
259 |
},
|
260 |
"VerbForm":{
|
261 |
-
"p":0.
|
262 |
-
"r":0.
|
263 |
-
"f":0.
|
264 |
},
|
265 |
"Voice":{
|
266 |
-
"p":0.
|
267 |
-
"r":0.
|
268 |
-
"f":0.
|
269 |
},
|
270 |
"Definite":{
|
271 |
-
"p":0.
|
272 |
-
"r":0.
|
273 |
-
"f":0.
|
274 |
},
|
275 |
"Gender":{
|
276 |
-
"p":0.
|
277 |
-
"r":0.
|
278 |
-
"f":0.
|
279 |
},
|
280 |
"Number":{
|
281 |
-
"p":0.
|
282 |
-
"r":0.
|
283 |
-
"f":0.
|
284 |
},
|
285 |
"AdpType":{
|
286 |
-
"p":0
|
287 |
-
"r":
|
288 |
-
"f":0.
|
289 |
},
|
290 |
"PartType":{
|
291 |
-
"p":
|
292 |
-
"r":0.
|
293 |
-
"f":0.
|
294 |
},
|
295 |
"Case":{
|
296 |
-
"p":0.
|
297 |
-
"r":0.
|
298 |
-
"f":0.
|
299 |
},
|
300 |
"Person":{
|
301 |
-
"p":0.
|
302 |
-
"r":0.
|
303 |
-
"f":0.
|
304 |
},
|
305 |
"PronType":{
|
306 |
-
"p":0.
|
307 |
-
"r":0.
|
308 |
-
"f":0.
|
309 |
},
|
310 |
"NumType":{
|
311 |
-
"p":0.
|
312 |
-
"r":0.
|
313 |
-
"f":0.
|
314 |
},
|
315 |
"Degree":{
|
316 |
-
"p":0.
|
317 |
-
"r":0.
|
318 |
-
"f":0.
|
319 |
},
|
320 |
"Reflex":{
|
321 |
-
"p":
|
322 |
-
"r":
|
323 |
-
"f":
|
324 |
},
|
325 |
"Number[psor]":{
|
326 |
-
"p":
|
327 |
-
"r":0.
|
328 |
-
"f":0.
|
329 |
},
|
330 |
"Poss":{
|
331 |
-
"p":
|
332 |
-
"r":0.
|
333 |
-
"f":0.
|
334 |
},
|
335 |
"Foreign":{
|
336 |
-
"p":0.
|
337 |
-
"r":0.
|
338 |
-
"f":0.
|
339 |
},
|
340 |
"Abbr":{
|
341 |
-
"p":
|
342 |
-
"r":0.
|
343 |
-
"f":0.
|
344 |
},
|
345 |
"Style":{
|
346 |
-
"p":
|
347 |
-
"r":
|
348 |
-
"f":
|
349 |
},
|
350 |
"Polite":{
|
351 |
-
"p":0.
|
352 |
-
"r":0.
|
353 |
-
"f":0.
|
354 |
}
|
355 |
},
|
356 |
-
"dep_uas":0.
|
357 |
-
"dep_las":0.
|
358 |
"dep_las_per_type":{
|
359 |
"advmod":{
|
360 |
-
"p":0.
|
361 |
-
"r":0.
|
362 |
-
"f":0.
|
363 |
},
|
364 |
"root":{
|
365 |
-
"p":0.
|
366 |
-
"r":0.
|
367 |
-
"f":0.
|
368 |
},
|
369 |
"nsubj":{
|
370 |
-
"p":0.
|
371 |
-
"r":0.
|
372 |
-
"f":0.
|
373 |
},
|
374 |
"case":{
|
375 |
-
"p":0.
|
376 |
-
"r":0.
|
377 |
-
"f":0.
|
378 |
},
|
379 |
"obl":{
|
380 |
-
"p":0.
|
381 |
-
"r":0.
|
382 |
-
"f":0.
|
383 |
},
|
384 |
"cc":{
|
385 |
-
"p":0.
|
386 |
-
"r":0.
|
387 |
-
"f":0.
|
388 |
},
|
389 |
"conj":{
|
390 |
-
"p":0.
|
391 |
-
"r":0.
|
392 |
-
"f":0.
|
393 |
},
|
394 |
"obj":{
|
395 |
-
"p":0.
|
396 |
-
"r":0.
|
397 |
-
"f":0.
|
398 |
},
|
399 |
"aux":{
|
400 |
-
"p":0.
|
401 |
-
"r":0.
|
402 |
-
"f":0.
|
403 |
},
|
404 |
"acl:relcl":{
|
405 |
-
"p":0.
|
406 |
-
"r":0.
|
407 |
-
"f":0.
|
408 |
},
|
409 |
"obl:loc":{
|
410 |
-
"p":0.
|
411 |
-
"r":0.
|
412 |
-
"f":0.
|
413 |
},
|
414 |
"det":{
|
415 |
-
"p":0.
|
416 |
-
"r":0.
|
417 |
-
"f":0.
|
418 |
},
|
419 |
"amod":{
|
420 |
-
"p":0.
|
421 |
-
"r":0.
|
422 |
-
"f":0.
|
423 |
},
|
424 |
"nmod:poss":{
|
425 |
-
"p":0.
|
426 |
-
"r":0.
|
427 |
-
"f":0.
|
428 |
},
|
429 |
"ccomp":{
|
430 |
-
"p":0.
|
431 |
-
"r":0.
|
432 |
-
"f":0.
|
433 |
},
|
434 |
"nummod":{
|
435 |
-
"p":0.
|
436 |
-
"r":0.
|
437 |
-
"f":0.
|
438 |
},
|
439 |
"flat":{
|
440 |
-
"p":0.
|
441 |
-
"r":0.
|
442 |
-
"f":0.
|
443 |
},
|
444 |
"compound:prt":{
|
445 |
-
"p":0.
|
446 |
-
"r":0.
|
447 |
-
"f":0.
|
448 |
},
|
449 |
"advcl":{
|
450 |
-
"p":0.
|
451 |
-
"r":0.
|
452 |
-
"f":0.
|
453 |
},
|
454 |
"mark":{
|
455 |
-
"p":0.
|
456 |
-
"r":0.
|
457 |
-
"f":0.
|
458 |
},
|
459 |
"cop":{
|
460 |
-
"p":0.
|
461 |
-
"r":0.
|
462 |
-
"f":0.
|
463 |
},
|
464 |
"dep":{
|
465 |
-
"p":0.
|
466 |
-
"r":0.
|
467 |
-
"f":0.
|
468 |
},
|
469 |
"nmod":{
|
470 |
-
"p":0.
|
471 |
-
"r":0.
|
472 |
-
"f":0.
|
473 |
},
|
474 |
"iobj":{
|
475 |
-
"p":0.
|
476 |
-
"r":0.
|
477 |
-
"f":0.
|
478 |
-
},
|
479 |
-
"list":{
|
480 |
-
"p":0.0,
|
481 |
-
"r":0.0,
|
482 |
-
"f":0.0
|
483 |
},
|
484 |
"xcomp":{
|
485 |
-
"p":0.
|
486 |
-
"r":0.
|
487 |
-
"f":0.
|
|
|
|
|
|
|
|
|
|
|
488 |
},
|
489 |
"vocative":{
|
490 |
"p":0.0,
|
@@ -492,24 +495,24 @@
|
|
492 |
"f":0.0
|
493 |
},
|
494 |
"fixed":{
|
495 |
-
"p":0.
|
496 |
-
"r":0.
|
497 |
-
"f":0.
|
498 |
-
},
|
499 |
-
"appos":{
|
500 |
-
"p":0.0,
|
501 |
-
"r":0.0,
|
502 |
-
"f":0.0
|
503 |
},
|
504 |
"expl":{
|
505 |
-
"p":0.
|
506 |
-
"r":0.
|
507 |
-
"f":0.
|
|
|
|
|
|
|
|
|
|
|
508 |
},
|
509 |
"obl:tmod":{
|
510 |
-
"p":0.
|
511 |
-
"r":0.
|
512 |
-
"f":0.
|
513 |
},
|
514 |
"discourse":{
|
515 |
"p":0.0,
|
@@ -517,39 +520,62 @@
|
|
517 |
"f":0.0
|
518 |
}
|
519 |
},
|
520 |
-
"sents_p":0.
|
521 |
-
"sents_r":0.
|
522 |
-
"sents_f":0.
|
523 |
"lemma_acc":0.8491041162,
|
524 |
-
"ents_f":0.
|
525 |
-
"ents_p":0.
|
526 |
-
"ents_r":0.
|
527 |
"ents_per_type":{
|
528 |
-
"ORG":{
|
529 |
-
"p":0.0040957782,
|
530 |
-
"r":0.2888888889,
|
531 |
-
"f":0.0080770426
|
532 |
-
},
|
533 |
"PER":{
|
534 |
-
"p":0.
|
535 |
-
"r":0.
|
536 |
-
"f":0.
|
|
|
|
|
|
|
|
|
|
|
537 |
},
|
538 |
"MISC":{
|
539 |
-
"p":0.
|
540 |
-
"r":0.
|
541 |
-
"f":0.
|
542 |
},
|
543 |
"LOC":{
|
544 |
-
"p":0.
|
545 |
-
"r":0.
|
546 |
-
"f":0.
|
547 |
}
|
548 |
},
|
549 |
-
"transformer_loss":
|
550 |
-
"morphologizer_loss":
|
551 |
-
"parser_loss":
|
552 |
-
"ner_loss":
|
553 |
},
|
554 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
555 |
}
|
|
|
2 |
"lang":"da",
|
3 |
"name":"dacy_small_trf",
|
4 |
"version":"0.1.0",
|
5 |
+
"description":"\n<a href=\"https://github.com/centre-for-humanities-computing/Dacy\"><img src=\"https://centre-for-humanities-computing.github.io/DaCy/_static/icon.png\" width=\"175\" height=\"175\" align=\"right\" /></a>\n\n# DaCy small transformer\n\nDaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines.\nDaCy's largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency \nparsing for Danish on the DaNE dataset. Check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results. \nDaCy also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.\n ",
|
6 |
+
"author":"Centre for Humanities Computing Aarhus",
|
7 |
+
"email":"Kenneth.enevoldsen@cas.au.dk",
|
8 |
+
"url":"https://chcaa.io/#/",
|
9 |
"license":"Apache-2.0 License",
|
10 |
+
"spacy_version":">=3.1.1,<3.2.0",
|
11 |
+
"spacy_git_version":"ffaead8fe",
|
12 |
"vectors":{
|
13 |
"width":0,
|
14 |
"vectors":0,
|
|
|
243 |
"disabled":[
|
244 |
|
245 |
],
|
246 |
+
"_sourced_vectors_hashes":{
|
247 |
+
|
248 |
+
},
|
249 |
"performance":{
|
250 |
+
"pos_acc":0.9583030655,
|
251 |
+
"morph_acc":0.9570439246,
|
252 |
"morph_per_feat":{
|
253 |
"Mood":{
|
254 |
+
"p":0.9950690335,
|
255 |
+
"r":0.9618684461,
|
256 |
+
"f":0.9781871062
|
257 |
},
|
258 |
"Tense":{
|
259 |
+
"p":0.9859922179,
|
260 |
+
"r":0.9540662651,
|
261 |
+
"f":0.9697665519
|
262 |
},
|
263 |
"VerbForm":{
|
264 |
+
"p":0.9823343849,
|
265 |
+
"r":0.952876377,
|
266 |
+
"f":0.9673811743
|
267 |
},
|
268 |
"Voice":{
|
269 |
+
"p":0.9938414165,
|
270 |
+
"r":0.9648729447,
|
271 |
+
"f":0.9791429655
|
272 |
},
|
273 |
"Definite":{
|
274 |
+
"p":0.9872480461,
|
275 |
+
"r":0.9482418017,
|
276 |
+
"f":0.9673518742
|
277 |
},
|
278 |
"Gender":{
|
279 |
+
"p":0.9793956044,
|
280 |
+
"r":0.9478231971,
|
281 |
+
"f":0.9633507853
|
282 |
},
|
283 |
"Number":{
|
284 |
+
"p":0.985179197,
|
285 |
+
"r":0.9535732916,
|
286 |
+
"f":0.9691186216
|
287 |
},
|
288 |
"AdpType":{
|
289 |
+
"p":1.0,
|
290 |
+
"r":0.9752431477,
|
291 |
+
"f":0.9874664279
|
292 |
},
|
293 |
"PartType":{
|
294 |
+
"p":1.0,
|
295 |
+
"r":0.9675324675,
|
296 |
+
"f":0.9834983498
|
297 |
},
|
298 |
"Case":{
|
299 |
+
"p":0.9934640523,
|
300 |
+
"r":0.9605055292,
|
301 |
+
"f":0.9767068273
|
302 |
},
|
303 |
"Person":{
|
304 |
+
"p":0.9908925319,
|
305 |
+
"r":0.9662522202,
|
306 |
+
"f":0.9784172662
|
307 |
},
|
308 |
"PronType":{
|
309 |
+
"p":0.9941077441,
|
310 |
+
"r":0.9712171053,
|
311 |
+
"f":0.9825291181
|
312 |
},
|
313 |
"NumType":{
|
314 |
+
"p":0.9791666667,
|
315 |
+
"r":0.9337748344,
|
316 |
+
"f":0.9559322034
|
317 |
},
|
318 |
"Degree":{
|
319 |
+
"p":0.9726708075,
|
320 |
+
"r":0.943373494,
|
321 |
+
"f":0.9577981651
|
322 |
},
|
323 |
"Reflex":{
|
324 |
+
"p":1.0,
|
325 |
+
"r":1.0,
|
326 |
+
"f":1.0
|
327 |
},
|
328 |
"Number[psor]":{
|
329 |
+
"p":1.0,
|
330 |
+
"r":0.988372093,
|
331 |
+
"f":0.9941520468
|
332 |
},
|
333 |
"Poss":{
|
334 |
+
"p":1.0,
|
335 |
+
"r":0.9772727273,
|
336 |
+
"f":0.9885057471
|
337 |
},
|
338 |
"Foreign":{
|
339 |
+
"p":0.8888888889,
|
340 |
+
"r":0.8,
|
341 |
+
"f":0.8421052632
|
342 |
},
|
343 |
"Abbr":{
|
344 |
+
"p":1.0,
|
345 |
+
"r":0.4,
|
346 |
+
"f":0.5714285714
|
347 |
},
|
348 |
"Style":{
|
349 |
+
"p":1.0,
|
350 |
+
"r":1.0,
|
351 |
+
"f":1.0
|
352 |
},
|
353 |
"Polite":{
|
354 |
+
"p":0.3333333333,
|
355 |
+
"r":0.25,
|
356 |
+
"f":0.2857142857
|
357 |
}
|
358 |
},
|
359 |
+
"dep_uas":0.8492442546,
|
360 |
+
"dep_las":0.8176199573,
|
361 |
"dep_las_per_type":{
|
362 |
"advmod":{
|
363 |
+
"p":0.7724637681,
|
364 |
+
"r":0.7528248588,
|
365 |
+
"f":0.7625178827
|
366 |
},
|
367 |
"root":{
|
368 |
+
"p":0.8561403509,
|
369 |
+
"r":0.865248227,
|
370 |
+
"f":0.860670194
|
371 |
},
|
372 |
"nsubj":{
|
373 |
+
"p":0.8939393939,
|
374 |
+
"r":0.8713080169,
|
375 |
+
"f":0.8824786325
|
376 |
},
|
377 |
"case":{
|
378 |
+
"p":0.9141414141,
|
379 |
+
"r":0.8942687747,
|
380 |
+
"f":0.9040959041
|
381 |
},
|
382 |
"obl":{
|
383 |
+
"p":0.7286585366,
|
384 |
+
"r":0.7433903577,
|
385 |
+
"f":0.7359507313
|
386 |
},
|
387 |
"cc":{
|
388 |
+
"p":0.8486646884,
|
389 |
+
"r":0.8313953488,
|
390 |
+
"f":0.8399412628
|
391 |
},
|
392 |
"conj":{
|
393 |
+
"p":0.671957672,
|
394 |
+
"r":0.6773333333,
|
395 |
+
"f":0.6746347942
|
396 |
},
|
397 |
"obj":{
|
398 |
+
"p":0.8560747664,
|
399 |
+
"r":0.8893203883,
|
400 |
+
"f":0.8723809524
|
401 |
},
|
402 |
"aux":{
|
403 |
+
"p":0.8885542169,
|
404 |
+
"r":0.860058309,
|
405 |
+
"f":0.8740740741
|
406 |
},
|
407 |
"acl:relcl":{
|
408 |
+
"p":0.6936416185,
|
409 |
+
"r":0.6486486486,
|
410 |
+
"f":0.6703910615
|
411 |
},
|
412 |
"obl:loc":{
|
413 |
+
"p":0.7222222222,
|
414 |
+
"r":0.7428571429,
|
415 |
+
"f":0.7323943662
|
416 |
},
|
417 |
"det":{
|
418 |
+
"p":0.9346733668,
|
419 |
+
"r":0.9192751236,
|
420 |
+
"f":0.926910299
|
421 |
},
|
422 |
"amod":{
|
423 |
+
"p":0.8549488055,
|
424 |
+
"r":0.8549488055,
|
425 |
+
"f":0.8549488055
|
426 |
},
|
427 |
"nmod:poss":{
|
428 |
+
"p":0.75,
|
429 |
+
"r":0.7128712871,
|
430 |
+
"f":0.730964467
|
431 |
},
|
432 |
"ccomp":{
|
433 |
+
"p":0.6885245902,
|
434 |
+
"r":0.6774193548,
|
435 |
+
"f":0.6829268293
|
436 |
},
|
437 |
"nummod":{
|
438 |
+
"p":0.8181818182,
|
439 |
+
"r":0.825,
|
440 |
+
"f":0.8215767635
|
441 |
},
|
442 |
"flat":{
|
443 |
+
"p":0.8636363636,
|
444 |
+
"r":0.880794702,
|
445 |
+
"f":0.8721311475
|
446 |
},
|
447 |
"compound:prt":{
|
448 |
+
"p":0.6551724138,
|
449 |
+
"r":0.4634146341,
|
450 |
+
"f":0.5428571429
|
451 |
},
|
452 |
"advcl":{
|
453 |
+
"p":0.6967213115,
|
454 |
+
"r":0.7327586207,
|
455 |
+
"f":0.7142857143
|
456 |
},
|
457 |
"mark":{
|
458 |
+
"p":0.9018789144,
|
459 |
+
"r":0.887063655,
|
460 |
+
"f":0.8944099379
|
461 |
},
|
462 |
"cop":{
|
463 |
+
"p":0.8514285714,
|
464 |
+
"r":0.8514285714,
|
465 |
+
"f":0.8514285714
|
466 |
},
|
467 |
"dep":{
|
468 |
+
"p":0.1960784314,
|
469 |
+
"r":0.3773584906,
|
470 |
+
"f":0.2580645161
|
471 |
},
|
472 |
"nmod":{
|
473 |
+
"p":0.7197452229,
|
474 |
+
"r":0.662109375,
|
475 |
+
"f":0.6897253306
|
476 |
},
|
477 |
"iobj":{
|
478 |
+
"p":0.7333333333,
|
479 |
+
"r":0.5,
|
480 |
+
"f":0.5945945946
|
|
|
|
|
|
|
|
|
|
|
481 |
},
|
482 |
"xcomp":{
|
483 |
+
"p":0.6315789474,
|
484 |
+
"r":0.406779661,
|
485 |
+
"f":0.4948453608
|
486 |
+
},
|
487 |
+
"list":{
|
488 |
+
"p":0.3636363636,
|
489 |
+
"r":0.2222222222,
|
490 |
+
"f":0.275862069
|
491 |
},
|
492 |
"vocative":{
|
493 |
"p":0.0,
|
|
|
495 |
"f":0.0
|
496 |
},
|
497 |
"fixed":{
|
498 |
+
"p":0.8947368421,
|
499 |
+
"r":0.8095238095,
|
500 |
+
"f":0.85
|
|
|
|
|
|
|
|
|
|
|
501 |
},
|
502 |
"expl":{
|
503 |
+
"p":0.9090909091,
|
504 |
+
"r":0.8823529412,
|
505 |
+
"f":0.8955223881
|
506 |
+
},
|
507 |
+
"appos":{
|
508 |
+
"p":0.6097560976,
|
509 |
+
"r":0.7575757576,
|
510 |
+
"f":0.6756756757
|
511 |
},
|
512 |
"obl:tmod":{
|
513 |
+
"p":0.8,
|
514 |
+
"r":0.2222222222,
|
515 |
+
"f":0.347826087
|
516 |
},
|
517 |
"discourse":{
|
518 |
"p":0.0,
|
|
|
520 |
"f":0.0
|
521 |
}
|
522 |
},
|
523 |
+
"sents_p":0.8603839442,
|
524 |
+
"sents_r":0.8741134752,
|
525 |
+
"sents_f":0.8671943712,
|
526 |
"lemma_acc":0.8491041162,
|
527 |
+
"ents_f":0.8231644261,
|
528 |
+
"ents_p":0.81724846,
|
529 |
+
"ents_r":0.8291666667,
|
530 |
"ents_per_type":{
|
|
|
|
|
|
|
|
|
|
|
531 |
"PER":{
|
532 |
+
"p":0.9290322581,
|
533 |
+
"r":0.8674698795,
|
534 |
+
"f":0.8971962617
|
535 |
+
},
|
536 |
+
"ORG":{
|
537 |
+
"p":0.7619047619,
|
538 |
+
"r":0.7111111111,
|
539 |
+
"f":0.7356321839
|
540 |
},
|
541 |
"MISC":{
|
542 |
+
"p":0.6739130435,
|
543 |
+
"r":0.8230088496,
|
544 |
+
"f":0.7410358566
|
545 |
},
|
546 |
"LOC":{
|
547 |
+
"p":0.8818181818,
|
548 |
+
"r":0.8738738739,
|
549 |
+
"f":0.8778280543
|
550 |
}
|
551 |
},
|
552 |
+
"transformer_loss":417466.8663170633,
|
553 |
+
"morphologizer_loss":34589.6649030063,
|
554 |
+
"parser_loss":151048.9837691551,
|
555 |
+
"ner_loss":5460.9844742843
|
556 |
},
|
557 |
+
"sources":[
|
558 |
+
{
|
559 |
+
"name":"UD Danish DDT v2.5",
|
560 |
+
"url":"https://github.com/UniversalDependencies/UD_Danish-DDT",
|
561 |
+
"license":"CC BY-SA 4.0",
|
562 |
+
"author":"Johannsen, Anders; Mart\u00ednez Alonso, H\u00e9ctor; Plank, Barbara"
|
563 |
+
},
|
564 |
+
{
|
565 |
+
"name":"DaNE",
|
566 |
+
"url":"https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane",
|
567 |
+
"license":"CC BY-SA 4.0",
|
568 |
+
"author":"Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard, Anders S\u00f8gaard"
|
569 |
+
},
|
570 |
+
{
|
571 |
+
"name":"Maltehb/-l-ctra-danish-electra-small-cased",
|
572 |
+
"author":"Malte H\u00f8jmark-Bertelsen",
|
573 |
+
"url":"https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased",
|
574 |
+
"license":"CC BY 4.0"
|
575 |
+
}
|
576 |
+
],
|
577 |
+
"requirements":[
|
578 |
+
"spacy-transformers>=1.0.3,<1.1.0"
|
579 |
+
],
|
580 |
+
"notes":"\n## Bias and Robustness\n\nBesides the validation done by SpaCy on the DaNE testset, DaCy also provides a series of augmentations to the DaNE test set to see how well the models deal with these types of augmentations.\nThe can be seen as behavioural probes akinn to the NLP checklist.\n\n### Deterministic Augmentations\nDeterministic augmentations are augmentation which always yield the same result.\n\n| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) |\u00a0Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| No augmentation | 0.98 | 0.974 | 0.868 | 0.836 | 0.936 | 0.844 | 0.765 |\n| \u00c6\u00f8\u00e5 Augmentation | 0.955 | 0.948 | 0.823 | 0.783 | 0.922 | 0.754 | 0.718 |\n| Lowercase | 0.974 | 0.97 | 0.862 | 0.828 | 0.905 | 0.848 | 0.681 |\n| No Spacing | 0.229 | 0.229 | 0.004 | 0.003 | 0.824 | 0.225 | 0.048 |\n| Abbreviated first names | 0.979 | 0.973 | 0.864 | 0.832 | 0.94 | 0.845 | 0.699 |\n| Input size augmentation 5 sentences | 0.956 | 0.956 | 0.851 | 0.818 | 0.883 | 0.844 | 0.743 |\n| Input size augmentation 10 sentences | 0.959 | 0.958 | 0.853 | 0.821 | 0.897 | 0.844 | 0.755 |\n\n\n\n### Stochastic Augmentations\nStochastic augmentations are augmentation which are repeated mulitple times to estimate the effect of the augmentation.\n\n| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) |\u00a0Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| Keystroke errors 2% | 0.931 (0.003) | 0.929 (0.003) | 0.797 (0.003) | 0.753 (0.003) | 0.884 (0.003) | 0.772 (0.003) | 0.657 (0.003) |\n| Keystroke errors 5% | 0.859 (0.003) | 0.863 (0.003) | 0.699 (0.003) | 0.641 (0.003) | 0.824 (0.003) | 0.681 (0.003) | 0.53 (0.003) |\n| Keystroke errors 15% | 0.633 (0.006) | 0.662 (0.006) | 0.439 (0.006) | 0.358 (0.006) | 0.688 (0.006) | 0.459 (0.006) | 0.293 (0.006) |\n| Danish names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |\n| Muslim names | 0.979 (0.0) | 0.974 (0.0) | 0.865 (0.0) | 0.833 (0.0) | 0.94 (0.0) | 0.847 (0.0) | 0.732 (0.0) |\n| Female names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.946 (0.0) | 0.847 (0.0) | 0.754 (0.0) |\n| Male names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |\n| Spacing Augmention 5% | 0.941 (0.002) | 0.936 (0.002) | 0.755 (0.002) | 0.725 (0.002) | 0.907 (0.002) | 0.811 (0.002) | 0.699 (0.002) |\n\n<details>\n\n<summary> Description of Augmenters </summary>\n\n \n\n**No augmentation:**\nApplies no augmentation to the DaNE test set.\n\n**\u00c6\u00f8\u00e5 Augmentation:**\nThis augmentation replace the \u00e6,\u00f8, and \u00e5 with their spelling variations ae, oe and aa respectively.\n\n**Lowercase:**\nThis augmentation lowercases all text.\n\n**No Spacing:**\nThis augmentation removed all spacing from the text.\n\n**Abbreviated first names:**\nThis agmentation abbreviates the first names of entities. For instance 'Kenneth Enevoldsen' would turn to 'K. Enevoldsen'.\n\n**Keystroke errors 2%:**\nThis agmentation simulate keystroke errors by replacing 2% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Keystroke errors 5%:**\nThis agmentation simulate keystroke errors by replacing 5% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Keystroke errors 15%:**\nThis agmentation simulate keystroke errors by replacing 15% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Danish names:**\nThis agmentation replace all names with Danish names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Muslim names:**\nThis agmentation replace all names with Muslim names derived from Meldgaard (2005). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Female names:**\nThis agmentation replace all names with Danish female names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Male names:**\nThis agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Spacing Augmention 5%:**\nThis agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n </details> \n <br /> \n\n\n### Hardware\nThis was run an trained on a Quadro RTX 8000 GPU."
|
581 |
}
|
morphologizer/model
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 161992
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:601cec06d7bb6f1e2025cf6878f5c8fb02d89b5fc71ba82c80e718a28c63f87f
|
3 |
size 161992
|
ner/model
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 94890
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6c7bd95a31a59f7cb632de4a99c12643602828d312d04a7ba233f3bdb7f15778
|
3 |
size 94890
|
parser/model
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 325085
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:db9711e97c156d5c9892a65b87d6a185289f74b92dcec527cf6906dfb6e821a6
|
3 |
size 325085
|
transformer/model/pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 54773654
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d65643fe23c672180685635b539688406638af1f7e515cb89505ea7626127400
|
3 |
size 54773654
|
transformer/model/tokenizer_config.json
CHANGED
@@ -1 +1 @@
|
|
1 |
-
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents":
|
|
|
1 |
+
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "full_tokenizer_file": null, "model_max_length": 128, "name_or_path": "Maltehb/-l-ctra-danish-electra-small-cased", "do_basic_tokenize": true, "never_split": null}
|
vocab/strings.json
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5b50a86603f748496e4fd87a8aaa203a32bf82d4b3768bf54187ff40de3ca6f9
|
3 |
+
size 460120
|