t5-base-dutch / README.md
Yeb Havinga
Replace scripts and model with improved version
49e8767
|
raw
history blame
2.29 kB
metadata
language:
  - dutch
tags:
  - seq2seq
  - lm-head
datasets:
  - yhavinga/mc4_nl_cleaned
license: apache-2.0
inference: false

t5-base-dutch

Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.

See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.

5 jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.

Model

  • Configuration based on google/t5-base
  • 12 layers, 12 heads
  • Dropout set to 0.1

Dataset

This model was trained on the full configuration of cleaned Dutch mC4, which is the original mC4, except

  • Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
  • Sentences with less than 3 words are removed
  • Sentences with a word of more than 1000 characters are removed
  • Documents with less than 5 sentences are removed
  • Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Tokenization

A SentencePiece tokenizer was trained from scratch on this dataset. The total tokens of the full configuration is 34B

Training

The model was trained on the full mc4_nl_cleaned dataset configuration for 1 epoch, consisting of 34B tokens, for 528 482 steps with a batch size of 128 and took 57 hours. A triangle learning rate schedule was used, with peak learning rate 0.005.

Evaluation

  • Loss: 1.38
  • Accuracy: 0.70