|
--- |
|
language: de |
|
library_name: sfst |
|
license: gpl-2.0 |
|
tags: |
|
- sfst |
|
- dwdsmor |
|
- token-classification |
|
- lemmatisation |
|
model-index: |
|
- name: dwdsmor |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Lemmatisation |
|
dataset: |
|
name: Universal Dependencies Treebank (de-hdt) |
|
type: universal_dependencies |
|
config: de_hdt |
|
split: train |
|
metrics: |
|
- type: coverage |
|
value: 0.8415293963067323 |
|
name: Coverage |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage ($() |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage ($,) |
|
- type: coverage |
|
value: 0.9999580703997988 |
|
name: Coverage ($.) |
|
- type: coverage |
|
value: 0.774030155216797 |
|
name: Coverage (ADJA) |
|
- type: coverage |
|
value: 0.7548407611333322 |
|
name: Coverage (ADJD) |
|
- type: coverage |
|
value: 0.9682621529723873 |
|
name: Coverage (ADV) |
|
- type: coverage |
|
value: 0.9989939637826962 |
|
name: Coverage (APPO) |
|
- type: coverage |
|
value: 0.9308645050358152 |
|
name: Coverage (APPR) |
|
- type: coverage |
|
value: 0.9967651071695788 |
|
name: Coverage (APPRART) |
|
- type: coverage |
|
value: 0.7916666666666666 |
|
name: Coverage (APZR) |
|
- type: coverage |
|
value: 0.9999603964317185 |
|
name: Coverage (ART) |
|
- type: coverage |
|
value: 0.9613524039049266 |
|
name: Coverage (CARD) |
|
- type: coverage |
|
value: 0.13320473120462967 |
|
name: Coverage (FM) |
|
- type: coverage |
|
value: 0.7142857142857143 |
|
name: Coverage (ITJ) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (KOKOM) |
|
- type: coverage |
|
value: 0.9995274949083504 |
|
name: Coverage (KON) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (KOUI) |
|
- type: coverage |
|
value: 0.9858579967925354 |
|
name: Coverage (KOUS) |
|
- type: coverage |
|
value: 0.0618080812117821 |
|
name: Coverage (NE) |
|
- type: coverage |
|
value: 0.7440482047389456 |
|
name: Coverage (NN) |
|
- type: coverage |
|
value: 0.9799275737196068 |
|
name: Coverage (PDAT) |
|
- type: coverage |
|
value: 0.9995682832062167 |
|
name: Coverage (PDS) |
|
- type: coverage |
|
value: 0.9879094306440976 |
|
name: Coverage (PIAT) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PIDAT) |
|
- type: coverage |
|
value: 0.9951910051476565 |
|
name: Coverage (PIS) |
|
- type: coverage |
|
value: 0.999888876541838 |
|
name: Coverage (PPER) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PPOSAT) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PPOSS) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PRELAT) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PRELS) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PRF) |
|
- type: coverage |
|
value: 0.9861938278289117 |
|
name: Coverage (PROAV) |
|
- type: coverage |
|
value: 0.3082133784928027 |
|
name: Coverage (PTKA) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PTKANT) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (PTKNEG) |
|
- type: coverage |
|
value: 0.7705097087378641 |
|
name: Coverage (PTKVZ) |
|
- type: coverage |
|
value: 0.0 |
|
name: Coverage (PTKZU) |
|
- type: coverage |
|
value: 0.9551166965888689 |
|
name: Coverage (PWAT) |
|
- type: coverage |
|
value: 0.9937264742785445 |
|
name: Coverage (PWAV) |
|
- type: coverage |
|
value: 0.9946524064171123 |
|
name: Coverage (PWS) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VAFIN) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VAIMP) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VAINF) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VAPP) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VMFIN) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VMINF) |
|
- type: coverage |
|
value: 1.0 |
|
name: Coverage (VMPP) |
|
- type: coverage |
|
value: 0.886487187323461 |
|
name: Coverage (VVFIN) |
|
- type: coverage |
|
value: 0.9596122778675282 |
|
name: Coverage (VVIMP) |
|
- type: coverage |
|
value: 0.8214535019002559 |
|
name: Coverage (VVINF) |
|
- type: coverage |
|
value: 0.829683698296837 |
|
name: Coverage (VVIZU) |
|
- type: coverage |
|
value: 0.7996866513473992 |
|
name: Coverage (VVPP) |
|
- type: coverage |
|
value: 0.4148471615720524 |
|
name: Coverage (XY) |
|
--- |
|
|
|
# DWDSmor |
|
|
|
_SFST/SMOR/DWDS-based German morphology_ |
|
|
|
|
|
|
|
|
|
|
|
DWDSmor implements the lemmatisation and morphological analysis of |
|
word forms as well as the generation of paradigms of lexical words in |
|
written German. |
|
|
|
## Usage |
|
|
|
DWDSmor is available via PyPI: |
|
|
|
``` plaintext |
|
pip install dwdsmor |
|
``` |
|
|
|
For lemmatisation: |
|
|
|
``` python-console |
|
>>> import dwsdmor |
|
>>> lemmatizer = dwdsmor.lemmatizer() |
|
>>> assert lemmatizer("getestet", pos={"+V"}) == "testen" |
|
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet" |
|
``` |
|
|
|
… |
|
|
|
## Development |
|
|
|
This repository provides source code for building DWDSmor lexica and transducers |
|
as well as for using DWDSmor transducers for morphological analysis and paradigm |
|
generation: |
|
|
|
* `dwdsmor/` contains Python packages for using DWDSmor, including |
|
scripts for morphological analysis and for paradigm generation by |
|
means of DWDSmor transducers. |
|
* `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma |
|
format form XML sources of DWDS articles. Sample inputs and outputs can be |
|
found in `samples/`. |
|
* `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the |
|
XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are |
|
not part of this repository. |
|
* `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means |
|
of the XSLT stylesheets in `share/` and the sample lexicon in |
|
`lexicon/sample/wb/`. |
|
* `grammar/` contains an FST grammar derived from SMORLemma, providing the |
|
morphology for building DWDSmor automata from DWDSmor lexica. |
|
* `test/` implements a test suite for the DWDSmor transducers. |
|
|
|
DWDSmor is in active development. In its current stage, DWDSmor supports most |
|
inflection classes and some productive word-formation patterns of written |
|
German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a |
|
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled |
|
from it. |
|
|
|
|
|
## Prerequisites |
|
|
|
[GNU/Linux](https://www.debian.org/) |
|
: Development, builds and tests of DWDSmor are performed |
|
on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating |
|
systems such as MacOS should work, too, they are not actively supported. |
|
|
|
[Python >= v3.9](https://www.python.org/) |
|
: DWDSmor targets Python as its primary runtime environment. The DWDSmor |
|
transducers can be used via SFST's commandline tools, queried in Python |
|
applications via language-specific |
|
[bindings](https://github.com/gremid/sfst-transduce), or used by the Python |
|
scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for |
|
paradigm generation. |
|
|
|
[Saxon-HE](https://www.saxonica.com/) |
|
: The extraction of lexical entries from XML sources of DWDS articles is |
|
implemented in XSLT 2, for which Saxon-HE is used as the runtime environment. |
|
|
|
[Java (JDK) >= v8](https://openjdk.java.net/) |
|
: Saxon requires a Java runtime. |
|
|
|
[SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/) |
|
: a C++ library and toolbox for finite-state transducers (FSTs); please take a |
|
look at its homepage for installation and usage instructions. |
|
|
|
On a Debian-based distribution, install the following packages: |
|
|
|
```sh |
|
apt install python3 default-jdk libsaxonhe-java sfst |
|
``` |
|
|
|
Set up a virtual environment for project builds, for example via Python's `venv`: |
|
|
|
```sh |
|
python3 -m venv .venv |
|
source .venv/bin/activate |
|
``` |
|
|
|
Then run the DWDSmor setup routine in order to install Python dependencies: |
|
|
|
```sh |
|
pip install -e .[dev] |
|
``` |
|
|
|
|
|
## Building DWDSmor lexica and transducers |
|
|
|
For building DWDSmor lexica and transducers, run: |
|
|
|
```sh |
|
make all |
|
``` |
|
|
|
Alternatively, you can run: |
|
|
|
```sh |
|
make dwds && make dwds-install && make dwdsmor |
|
``` |
|
|
|
Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are |
|
not part of this repository. |
|
|
|
Alternatively, you can build sample DWDSmor lexica and transducers from the |
|
sample lexicon in `lexicon/sample/wb/` by running: |
|
|
|
```sh |
|
make sample && make sample-install && make dwdsmor |
|
``` |
|
|
|
After building DWDSmor transducers, install them into `lib/`, where the |
|
Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default: |
|
|
|
```sh |
|
make install |
|
``` |
|
|
|
The installed DWDSmor transducers are: |
|
|
|
* `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation |
|
components, for lemmatisation and morphological analysis of word forms in |
|
terms of grammatical categories |
|
* `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation |
|
components, for the generation of morphologically segmented word forms |
|
* `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a |
|
finite word-formation component, for testing purposes |
|
* `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation |
|
components, for lexical analysis of word forms in terms of root lemmas (i.e., |
|
lemmas of ultimate word-formation bases), word-formation process, |
|
word-formation means, and grammatical categories in term of the |
|
Pattern-and-Restriction Theory of word formation (Nolda 2022) |
|
* `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with |
|
DWDS homographic lemma indices, for paradigm generation |
|
|
|
|
|
## Testing DWDSmor |
|
|
|
Run |
|
|
|
pytest |
|
|
|
in order to test basic transducer usage and for potential regressions. |
|
|
|
## Contact |
|
|
|
Feel free to contact [Andreas Nolda](mailto:andreas.nolda@bbaw.de) for |
|
questions regarding the lexicon or the grammar and |
|
[Gregor Middell](mailto:gregor.middell@bbaw.de) for question related |
|
to the integration of DWDSmor into your corpus-annotation pipeline. |
|
|
|
|
|
## License |
|
|
|
As the original SMOR and SMORLemma grammars, the DWDSmor grammar is |
|
licensed under the GNU General Public Licence v2.0. The same applies |
|
to the rest of this project. |
|
|
|
## Credits |
|
|
|
DWSDmor is based on the following software and datasets: |
|
|
|
1. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++ library |
|
and toolbox for finite-state transducers (FSTs) (Schmidt 2006) |
|
2. [SMORLemma](https://github.com/rsennrich/SMORLemma) (Sennrich and Kunz 2014), |
|
a modified version of the Stuttgart Morphology |
|
([SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/)) (Schmid, Fitschen, and |
|
Heid 2004) with an alternative lemmatisation component |
|
3. the [DWDS dictionary](https://www.dwds.de/) (BBAW n.d.) replacing the |
|
[IMSLex](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/imslex/) |
|
(Fitschen 2004) as the lexical data source for German words, their grammatical |
|
categories, and their morphological properties. |
|
|
|
## Bibliography |
|
|
|
* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). |
|
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur |
|
deutschen Sprache in Geschichte und Gegenwart. |
|
https://www.dwds.de |
|
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes |
|
System. Ph.D. thesis, Universität Stuttgart. |
|
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf) |
|
* Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on |
|
compounding and blending in German. In *Headedness and/or Grammatical |
|
Anarchy?*, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically |
|
Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, |
|
343–376. |
|
[PDF](https://zenodo.org/record/7142720/files/336-FreywaldSimonMüller-2022-11.pdf). |
|
* Schmid, Helmut (2006). A programming language for finite state transducers. In |
|
*Finite-State Methods and Natural Language Processing: 5th International |
|
Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005*, ed. by Anssi |
|
Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial |
|
Intelligence 4002, Berlin: Springer, 1263–1266. |
|
[PDF](https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf). |
|
* Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German |
|
computational morphology covering derivation, composition, and inflection. In |
|
LREC 2004: Fourth International Conference on Language Resources and |
|
Evaluation, ed. by Maria T. Lino *et al.*, European Language Resources |
|
Association, 1263–1266. |
|
[PDF](http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf) |
|
* Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon |
|
extracted from Wiktionary. In LREC 2014: Ninth International Conference on |
|
Language Resources and Evaluation, ed. by Nicoletta Calzolari *et al.*, |
|
European Language Resources Association, 1063–1067. |
|
[PDF](http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf). |
|
|