hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

Dutch (Delpher Corpus)

More details can be found in our GitHub repository.

Pretraining

We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.

This model was trained with mean_noise_span_length=20.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 model on ICDAR Europeana dataset:

Configuration	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
`wsFalse-bs4-e10-lr0.00015-poolingfirst`	86.61	85.88	87.65	87.93	88.01	87.22 ± 0.83
`wsFalse-bs8-e10-lr0.00015-poolingfirst`	87.88	87.56	85.62	86.52	87.03	86.92 ± 0.8
`wsFalse-bs4-e10-lr0.00016-poolingfirst`	86.17	85.87	87.77	86.58	87.96	86.87 ± 0.85
`wsFalse-bs8-e10-lr0.00016-poolingfirst`	87.67	86.02	85.66	87	85.99	86.47 ± 0.75

The results show no performance improvement of the model trained with mean_noise_span_length=3, that achieved 87.90 ± 0.71.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️