--- language: de library_name: sfst license: gpl-2.0 tags: - sfst - dwdsmor - token-classification - lemmatisation model-index: - name: dwdsmor results: - task: type: token-classification name: Lemmatisation dataset: name: Universal Dependencies Treebank (de-hdt) type: universal_dependencies config: de_hdt split: train metrics: - type: coverage value: 0.8415293963067323 name: Coverage - type: coverage value: 1.0 name: Coverage ($() - type: coverage value: 1.0 name: Coverage ($,) - type: coverage value: 0.9999580703997988 name: Coverage ($.) - type: coverage value: 0.774030155216797 name: Coverage (ADJA) - type: coverage value: 0.7548407611333322 name: Coverage (ADJD) - type: coverage value: 0.9682621529723873 name: Coverage (ADV) - type: coverage value: 0.9989939637826962 name: Coverage (APPO) - type: coverage value: 0.9308645050358152 name: Coverage (APPR) - type: coverage value: 0.9967651071695788 name: Coverage (APPRART) - type: coverage value: 0.7916666666666666 name: Coverage (APZR) - type: coverage value: 0.9999603964317185 name: Coverage (ART) - type: coverage value: 0.9613524039049266 name: Coverage (CARD) - type: coverage value: 0.13320473120462967 name: Coverage (FM) - type: coverage value: 0.7142857142857143 name: Coverage (ITJ) - type: coverage value: 1.0 name: Coverage (KOKOM) - type: coverage value: 0.9995274949083504 name: Coverage (KON) - type: coverage value: 1.0 name: Coverage (KOUI) - type: coverage value: 0.9858579967925354 name: Coverage (KOUS) - type: coverage value: 0.0618080812117821 name: Coverage (NE) - type: coverage value: 0.7440482047389456 name: Coverage (NN) - type: coverage value: 0.9799275737196068 name: Coverage (PDAT) - type: coverage value: 0.9995682832062167 name: Coverage (PDS) - type: coverage value: 0.9879094306440976 name: Coverage (PIAT) - type: coverage value: 1.0 name: Coverage (PIDAT) - type: coverage value: 0.9951910051476565 name: Coverage (PIS) - type: coverage value: 0.999888876541838 name: Coverage (PPER) - type: coverage value: 1.0 name: Coverage (PPOSAT) - type: coverage value: 1.0 name: Coverage (PPOSS) - type: coverage value: 1.0 name: Coverage (PRELAT) - type: coverage value: 1.0 name: Coverage (PRELS) - type: coverage value: 1.0 name: Coverage (PRF) - type: coverage value: 0.9861938278289117 name: Coverage (PROAV) - type: coverage value: 0.3082133784928027 name: Coverage (PTKA) - type: coverage value: 1.0 name: Coverage (PTKANT) - type: coverage value: 1.0 name: Coverage (PTKNEG) - type: coverage value: 0.7705097087378641 name: Coverage (PTKVZ) - type: coverage value: 0.0 name: Coverage (PTKZU) - type: coverage value: 0.9551166965888689 name: Coverage (PWAT) - type: coverage value: 0.9937264742785445 name: Coverage (PWAV) - type: coverage value: 0.9946524064171123 name: Coverage (PWS) - type: coverage value: 1.0 name: Coverage (VAFIN) - type: coverage value: 1.0 name: Coverage (VAIMP) - type: coverage value: 1.0 name: Coverage (VAINF) - type: coverage value: 1.0 name: Coverage (VAPP) - type: coverage value: 1.0 name: Coverage (VMFIN) - type: coverage value: 1.0 name: Coverage (VMINF) - type: coverage value: 1.0 name: Coverage (VMPP) - type: coverage value: 0.886487187323461 name: Coverage (VVFIN) - type: coverage value: 0.9596122778675282 name: Coverage (VVIMP) - type: coverage value: 0.8214535019002559 name: Coverage (VVINF) - type: coverage value: 0.829683698296837 name: Coverage (VVIZU) - type: coverage value: 0.7996866513473992 name: Coverage (VVPP) - type: coverage value: 0.4148471615720524 name: Coverage (XY) --- # DWDSmor _SFST/SMOR/DWDS-based German morphology_ DWDSmor implements the lemmatisation and morphological analysis of word forms as well as the generation of paradigms of lexical words in written German. ## Usage DWDSmor is available via PyPI: ``` plaintext pip install dwdsmor ``` For lemmatisation: ``` python-console >>> import dwsdmor >>> lemmatizer = dwdsmor.lemmatizer() >>> assert lemmatizer("getestet", pos={"+V"}) == "testen" >>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet" ``` … ## Development This repository provides source code for building DWDSmor lexica and transducers as well as for using DWDSmor transducers for morphological analysis and paradigm generation: * `dwdsmor/` contains Python packages for using DWDSmor, including scripts for morphological analysis and for paradigm generation by means of DWDSmor transducers. * `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma format form XML sources of DWDS articles. Sample inputs and outputs can be found in `samples/`. * `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are not part of this repository. * `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means of the XSLT stylesheets in `share/` and the sample lexicon in `lexicon/sample/wb/`. * `grammar/` contains an FST grammar derived from SMORLemma, providing the morphology for building DWDSmor automata from DWDSmor lexica. * `test/` implements a test suite for the DWDSmor transducers. DWDSmor is in active development. In its current stage, DWDSmor supports most inflection classes and some productive word-formation patterns of written German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled from it. ## Prerequisites [GNU/Linux](https://www.debian.org/) : Development, builds and tests of DWDSmor are performed on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported. [Python >= v3.9](https://www.python.org/) : DWDSmor targets Python as its primary runtime environment. The DWDSmor transducers can be used via SFST's commandline tools, queried in Python applications via language-specific [bindings](https://github.com/gremid/sfst-transduce), or used by the Python scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for paradigm generation. [Saxon-HE](https://www.saxonica.com/) : The extraction of lexical entries from XML sources of DWDS articles is implemented in XSLT 2, for which Saxon-HE is used as the runtime environment. [Java (JDK) >= v8](https://openjdk.java.net/) : Saxon requires a Java runtime. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/) : a C++ library and toolbox for finite-state transducers (FSTs); please take a look at its homepage for installation and usage instructions. On a Debian-based distribution, install the following packages: ```sh apt install python3 default-jdk libsaxonhe-java sfst ``` Set up a virtual environment for project builds, for example via Python's `venv`: ```sh python3 -m venv .venv source .venv/bin/activate ``` Then run the DWDSmor setup routine in order to install Python dependencies: ```sh pip install -e .[dev] ``` ## Building DWDSmor lexica and transducers For building DWDSmor lexica and transducers, run: ```sh make all ``` Alternatively, you can run: ```sh make dwds && make dwds-install && make dwdsmor ``` Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are not part of this repository. Alternatively, you can build sample DWDSmor lexica and transducers from the sample lexicon in `lexicon/sample/wb/` by running: ```sh make sample && make sample-install && make dwdsmor ``` After building DWDSmor transducers, install them into `lib/`, where the Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default: ```sh make install ``` The installed DWDSmor transducers are: * `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation components, for lemmatisation and morphological analysis of word forms in terms of grammatical categories * `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation components, for the generation of morphologically segmented word forms * `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a finite word-formation component, for testing purposes * `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022) * `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with DWDS homographic lemma indices, for paradigm generation ## Testing DWDSmor Run pytest in order to test basic transducer usage and for potential regressions. ## Contact Feel free to contact [Andreas Nolda](mailto:andreas.nolda@bbaw.de) for questions regarding the lexicon or the grammar and [Gregor Middell](mailto:gregor.middell@bbaw.de) for question related to the integration of DWDSmor into your corpus-annotation pipeline. ## License As the original SMOR and SMORLemma grammars, the DWDSmor grammar is licensed under the GNU General Public Licence v2.0. The same applies to the rest of this project. ## Credits DWSDmor is based on the following software and datasets: 1. [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/), a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006) 2. [SMORLemma](https://github.com/rsennrich/SMORLemma) (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology ([SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/)) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component 3. the [DWDS dictionary](https://www.dwds.de/) (BBAW n.d.) replacing the [IMSLex](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/imslex/) (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphological properties. ## Bibliography * Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. https://www.dwds.de * Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. [PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf) * Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In *Headedness and/or Grammatical Anarchy?*, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. [PDF](https://zenodo.org/record/7142720/files/336-FreywaldSimonMüller-2022-11.pdf). * Schmid, Helmut (2006). A programming language for finite state transducers. In *Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005*, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. [PDF](https://www.cis.uni-muenchen.de/~schmid/papers/SFST-PL.pdf). * Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino *et al.*, European Language Resources Association, 1263–1266. [PDF](http://www.lrec-conf.org/proceedings/lrec2004/pdf/468.pdf) * Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari *et al.*, European Language Resources Association, 1063–1067. [PDF](http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf).