German Wikipedia LMs

non-profit

Activity Feed Request to join this org

AI & ML interests

language modeling

Recent Activity

stefan-it submitted a paper about 1 month ago

GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

stefan-it submitted a paper about 1 month ago

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

stefan-it submitted a paper 5 months ago

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

View all activity

Organization Card

Community About org cards

German Wikipedia LMs (GWLMs)

We present Language Models (BERT, BERT with Token Dropping, TEAMS, T5) pretrained on German Wikipedia.

This is an ongoing project!

German Wikipedia Corpus

We use a recent Wikipedia Dump, that can can be accessed here. Additionally, a sentence-segmented (using NLTK) is available here.

Fine-tuned Models

We fine-tuned NER models using SpanMarker library on GermEval 2014 NER dataset and upload the best models:

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

models 13

datasets 8

gwlms/dewiki-20230701-flair-corpus

Viewer • Updated Jun 10, 2024 • 45.6M • 22

gwlms/validation

Viewer • Updated Jan 5, 2024 • 15.6k • 16

gwlms/biofid

Updated Aug 23, 2023 • 3

gwlms/germeval2018

Updated Jul 26, 2023 • 6

gwlms/dewiki-20230701-chunks

Updated Jul 19, 2023 • 231

gwlms/dewiki-20230701-tfrecords-dupe5

Updated Jul 19, 2023 • 211

gwlms/dewiki-20230701-nltk-corpus

Viewer • Updated Jul 19, 2023 • 61.6M • 10

gwlms/dewiki-20230701

Viewer • Updated Jul 19, 2023 • 2.73M • 99