readme: add initial version

Browse files

Files changed (4) hide show

README.md +117 -0
delpher-corpus.urls +23 -0
figures/delpher_corpus_stats.png +0 -0
figures/training_loss.png +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+# Language Model for Historic Dutch
+In this repository we open source a language model for Historic Dutch, trained on the
+[Delpher Corpus](https://www.delpher.nl/over-delpher/delpher-open-krantenarchief/download-teksten-kranten-1618-1879\),
+that include digitized texts from Dutch newspapers, ranging from 1618 to 1879.
+# Changelog
+* 13.12.2021: Initial version of this repository.
+# Model Zoo
+The following models for Historic Dutch are available on the Hugging Face Model Hub:
+| Model identifier                       | Model Hub link
+| -------------------------------------- | -------------------------------------------------------------------
+| `dbmdz/bert-base-historic-dutch-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-dutch-cased)
+# Stats
+The download urls for all archives can be found [here](delpher-corpus.urls).
+We then used the awesome `alto-tools` from [this](https://github.com/cneud/alto-tools)
+repository to extract plain text. The following table shows the size overview per year range:
+| Period    | Extracted plain text size
+| --------- | -------------------------:
+| 1618-1699 | 170MB
+| 1700-1709 | 103MB
+| 1710-1719 |  65MB
+| 1720-1729 | 137MB
+| 1730-1739 | 144MB
+| 1740-1749 | 188MB
+| 1750-1759 | 171MB
+| 1760-1769 | 235MB
+| 1770-1779 | 271MB
+| 1780-1789 | 414MB
+| 1790-1799 | 614MB
+| 1800-1809 | 734MB
+| 1810-1819 | 807MB
+| 1820-1829 | 987MB
+| 1830-1839 | 1.7GB
+| 1840-1849 | 2.2GB
+| 1850-1854 | 1.3GB
+| 1855-1859 | 1.7GB
+| 1860-1864 | 2.0GB
+| 1865-1869 | 2.3GB
+| 1870-1874 | 1.9GB
+| 1875-1876 | 867MB
+| 1877-1879 | 1.9GB
+The total training corpus consists of 427,181,269 sentences and 3,509,581,683 tokens (counted via `wc`),
+resulting in a total corpus size of 21GB.
+The following figure shows an overview of the number of chars per year distribution:
+![Delpher Corpus Stats](/figures/delpher_corpus_stats.png)
+# Language Model Pretraining
+We use the official [BERT](https://github.com/google-research/bert) implementation using the following command
+to train the model:
+```bash
+python3 run_pretraining.py --input_file gs://delpher-bert/tfrecords/*.tfrecord \
+--output_dir gs://delpher-bert/bert-base-historic-dutch-cased \
+--bert_config_file ./config.json \
+--max_seq_length=512 \
+--max_predictions_per_seq=75 \
+--do_train=True \
+--train_batch_size=128 \
+--num_train_steps=3000000 \
+--learning_rate=1e-4 \
+--save_checkpoints_steps=100000 \
+--keep_checkpoint_max=20 \
+--use_tpu=True \
+--tpu_name=electra-2 \
+--num_tpu_cores=32
+```
+We train the model for 3M steps using a total batch size of 128 on a v3-32 TPU. The pretraining loss curve can be seen
+in the next figure:
+![Delpher Pretraining Loss Curve](/figures/training_loss.png)
+# Evaluation
+We evaluate our model on the preprocessed Europeana NER dataset for Dutch, that was presented in the
+["Data Centric Domain Adaptation for Historical Text with OCR Errors"](https://github.com/stefan-it/historic-domain-adaptation-icdar) paper.
+The data is available in their repository. We perform a hyper-parameter search for:
+* Batch sizes: `[4, 8]`
+* Learning rates: `[3e-5, 5e-5]`
+* Number of epochs: `[5, 10]`
+and report averaged F1-Score over 5 runs with different seeds. We also include [hmBERT](https://github.com/stefan-it/clef-hipe/blob/main/hlms.md) as baseline model.
+Results:
+| Model               | F1-Score (Dev / Test)
+| ------------------- | ---------------------
+| hmBERT              | (82.73) / 81.34
+| Maerz et al. (2021) | - / 84.2
+| Ours                | (89.73) / 87.45
+# License
+All models are licensed under [MIT](LICENSE).
+# Acknowledgments
+Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
+TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗

delpher-corpus.urls ADDED Viewed

	@@ -0,0 +1,23 @@

+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_180x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_181x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_174x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1875-6.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_175x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_16xx.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_170x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1870-4.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_183x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_173x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_172x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1855-9.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_184x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_176x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1865-9.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1860-4.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_182x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1877-9.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_177x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1850-4.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_178x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_171x.zip
+https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_179x.zip

figures/delpher_corpus_stats.png ADDED Viewed

figures/training_loss.png ADDED Viewed