language: de
library_name: sfst
license: gpl-2.0
tags:
- sfst
- dwdsmor
- token-classification
- lemmatisation
model-index:
- name: dwdsmor
results:
- task:
type: token-classification
name: Lemmatisation
dataset:
name: Universal Dependencies Treebank (de-hdt)
type: universal_dependencies
config: de_hdt
split: train
metrics:
- type: coverage
value: 0.8415293963067323
name: Coverage
- type: coverage
value: 1
name: Coverage ($()
- type: coverage
value: 1
name: Coverage ($,)
- type: coverage
value: 0.9999580703997988
name: Coverage ($.)
- type: coverage
value: 0.774030155216797
name: Coverage (ADJA)
- type: coverage
value: 0.7548407611333322
name: Coverage (ADJD)
- type: coverage
value: 0.9682621529723873
name: Coverage (ADV)
- type: coverage
value: 0.9989939637826962
name: Coverage (APPO)
- type: coverage
value: 0.9308645050358152
name: Coverage (APPR)
- type: coverage
value: 0.9967651071695788
name: Coverage (APPRART)
- type: coverage
value: 0.7916666666666666
name: Coverage (APZR)
- type: coverage
value: 0.9999603964317185
name: Coverage (ART)
- type: coverage
value: 0.9613524039049266
name: Coverage (CARD)
- type: coverage
value: 0.13320473120462967
name: Coverage (FM)
- type: coverage
value: 0.7142857142857143
name: Coverage (ITJ)
- type: coverage
value: 1
name: Coverage (KOKOM)
- type: coverage
value: 0.9995274949083504
name: Coverage (KON)
- type: coverage
value: 1
name: Coverage (KOUI)
- type: coverage
value: 0.9858579967925354
name: Coverage (KOUS)
- type: coverage
value: 0.0618080812117821
name: Coverage (NE)
- type: coverage
value: 0.7440482047389456
name: Coverage (NN)
- type: coverage
value: 0.9799275737196068
name: Coverage (PDAT)
- type: coverage
value: 0.9995682832062167
name: Coverage (PDS)
- type: coverage
value: 0.9879094306440976
name: Coverage (PIAT)
- type: coverage
value: 1
name: Coverage (PIDAT)
- type: coverage
value: 0.9951910051476565
name: Coverage (PIS)
- type: coverage
value: 0.999888876541838
name: Coverage (PPER)
- type: coverage
value: 1
name: Coverage (PPOSAT)
- type: coverage
value: 1
name: Coverage (PPOSS)
- type: coverage
value: 1
name: Coverage (PRELAT)
- type: coverage
value: 1
name: Coverage (PRELS)
- type: coverage
value: 1
name: Coverage (PRF)
- type: coverage
value: 0.9861938278289117
name: Coverage (PROAV)
- type: coverage
value: 0.3082133784928027
name: Coverage (PTKA)
- type: coverage
value: 1
name: Coverage (PTKANT)
- type: coverage
value: 1
name: Coverage (PTKNEG)
- type: coverage
value: 0.7705097087378641
name: Coverage (PTKVZ)
- type: coverage
value: 0
name: Coverage (PTKZU)
- type: coverage
value: 0.9551166965888689
name: Coverage (PWAT)
- type: coverage
value: 0.9937264742785445
name: Coverage (PWAV)
- type: coverage
value: 0.9946524064171123
name: Coverage (PWS)
- type: coverage
value: 1
name: Coverage (VAFIN)
- type: coverage
value: 1
name: Coverage (VAIMP)
- type: coverage
value: 1
name: Coverage (VAINF)
- type: coverage
value: 1
name: Coverage (VAPP)
- type: coverage
value: 1
name: Coverage (VMFIN)
- type: coverage
value: 1
name: Coverage (VMINF)
- type: coverage
value: 1
name: Coverage (VMPP)
- type: coverage
value: 0.886487187323461
name: Coverage (VVFIN)
- type: coverage
value: 0.9596122778675282
name: Coverage (VVIMP)
- type: coverage
value: 0.8214535019002559
name: Coverage (VVINF)
- type: coverage
value: 0.829683698296837
name: Coverage (VVIZU)
- type: coverage
value: 0.7996866513473992
name: Coverage (VVPP)
- type: coverage
value: 0.4148471615720524
name: Coverage (XY)
DWDSmor
SFST/SMOR/DWDS-based German morphology
DWDSmor implements the lemmatisation and morphological analysis of word forms as well as the generation of paradigms of lexical words in written German.
Usage
DWDSmor is available via PyPI:
pip install dwdsmor
For lemmatisation:
>>> import dwsdmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"+V"}) == "testen"
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
…
Development
This repository provides source code for building DWDSmor lexica and transducers as well as for using DWDSmor transducers for morphological analysis and paradigm generation:
dwdsmor/
contains Python packages for using DWDSmor, including scripts for morphological analysis and for paradigm generation by means of DWDSmor transducers.share/
contains XSLT stylesheets for extracting lexical entries in SMORLemma format form XML sources of DWDS articles. Sample inputs and outputs can be found insamples/
.lexicon/dwds/
contains scripts for building DWDSmor lexica by means of the XSLT stylesheets inshare/
and DWDS sources inlexicon/dwds/wb/
, which are not part of this repository.lexicon/sample/
contains scripts for building sample DWDSmor lexica by means of the XSLT stylesheets inshare/
and the sample lexicon inlexicon/sample/wb/
.grammar/
contains an FST grammar derived from SMORLemma, providing the morphology for building DWDSmor automata from DWDSmor lexica.test/
implements a test suite for the DWDSmor transducers.
DWDSmor is in active development. In its current stage, DWDSmor supports most
inflection classes and some productive word-formation patterns of written
German. Note that the sample lexicon in lexicon/sample/wb/
only covers a
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled
from it.
Prerequisites
GNU/Linux : Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
Python >= v3.9
: DWDSmor targets Python as its primary runtime environment. The DWDSmor
transducers can be used via SFST's commandline tools, queried in Python
applications via language-specific
bindings, or used by the Python
scripts dwdsmor.py
and paradigm.py
for morphological analysis and for
paradigm generation.
Saxon-HE : The extraction of lexical entries from XML sources of DWDS articles is implemented in XSLT 2, for which Saxon-HE is used as the runtime environment.
Java (JDK) >= v8 : Saxon requires a Java runtime.
SFST : a C++ library and toolbox for finite-state transducers (FSTs); please take a look at its homepage for installation and usage instructions.
On a Debian-based distribution, install the following packages:
apt install python3 default-jdk libsaxonhe-java sfst
Set up a virtual environment for project builds, for example via Python's venv
:
python3 -m venv .venv
source .venv/bin/activate
Then run the DWDSmor setup routine in order to install Python dependencies:
pip install -e .[dev]
Building DWDSmor lexica and transducers
For building DWDSmor lexica and transducers, run:
make all
Alternatively, you can run:
make dwds && make dwds-install && make dwdsmor
Note that these commands require DWDS sources in lexicon/dwds/wb/
, which are
not part of this repository.
Alternatively, you can build sample DWDSmor lexica and transducers from the
sample lexicon in lexicon/sample/wb/
by running:
make sample && make sample-install && make dwdsmor
After building DWDSmor transducers, install them into lib/
, where the
Python scripts dwdsmor
and dwdsmor-paradigm
expect them by default:
make install
The installed DWDSmor transducers are:
lib/dwdsmor.{a,ca}
: transducer with inflection and word-formation components, for lemmatisation and morphological analysis of word forms in terms of grammatical categorieslib/dwdsmor-morph.{a,ca}
: transducer with inflection and word-formation components, for the generation of morphologically segmented word formslib/dwdsmor-finite.{a,ca}
: transducer with an inflection component and a finite word-formation component, for testing purposeslib/dwdsmor-root.{a,ca}
: transducer with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022)lib/dwdsmor-index.{a,ca}
: transducer with an inflection component only with DWDS homographic lemma indices, for paradigm generation
Testing DWDSmor
Run
pytest
in order to test basic transducer usage and for potential regressions.
Contact
Feel free to contact Andreas Nolda for questions regarding the lexicon or the grammar and Gregor Middell for question related to the integration of DWDSmor into your corpus-annotation pipeline.
License
As the original SMOR and SMORLemma grammars, the DWDSmor grammar is licensed under the GNU General Public Licence v2.0. The same applies to the rest of this project.
Credits
DWSDmor is based on the following software and datasets:
- SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006)
- SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component
- the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphological properties.
Bibliography
- Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. https://www.dwds.de
- Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
- Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
- Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
- Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
- Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.