hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

English (British Library Corpus - Books)

More details can be found in our GitHub repository.

Pretraining

We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 Base model on English AjMC dataset:

Configuration	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
`wsFalse-bs4-e10-lr0.00015-poolingfirst`	86.78	87.46	85.75	88.41	86.6	87.0 ± 0.89
`wsFalse-bs8-e10-lr0.00016-poolingfirst`	86.79	86.29	86.67	87.14	85.82	86.54 ± 0.45
`wsFalse-bs4-e10-lr0.00016-poolingfirst`	87.04	87.34	86.63	84.09	87.04	86.43 ± 1.19
`wsFalse-bs8-e10-lr0.00015-poolingfirst`	86.87	86.43	86.88	85.15	85.25	86.12 ± 0.77

The ByT5 Small model achieves 85.65 ± 1.21 on this dataset.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️