metadata

language:
  - dutch
tags:
  - seq2seq
  - lm-head
datasets:
  - yhavinga/mc4_nl_cleaned
license: apache-2.0
inference: false

t5-base-dutch

Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.

See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.

5 jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.

Model

Configuration based on google/t5-base
12 layers, 12 heads
Dropout set to 0.1

Dataset

This model was trained on the full configuration of cleaned Dutch mC4, which is the original mC4, except

Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
Sentences with less than 3 words are removed
Sentences with a word of more than 1000 characters are removed
Documents with less than 5 sentences are removed
Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Tokenization

A SentencePiece tokenizer was trained from scratch on this dataset. The total tokens of the full configuration is 34B

Training

The model was trained on the full mc4_nl_cleaned dataset configuration for 1 epoch, consisting of 34B tokens, for 528 482 steps with a batch size of 128 and took 57 hours. A triangle learning rate schedule was used, with peak learning rate 0.005.

Evaluation

Loss: 1.38
Accuracy: 0.70