t5-base-dutch / README.md
yhavinga's picture
Update README.md
e8cec0f
|
raw
history blame
1.76 kB
metadata
language:
  - dutch
tags:
  - seq2seq
  - text-generation
datasets:
  - mc4

t5-base-dutch

Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.

See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.

Dataset

This model was trained on a cleaned version of the Dutch part of mC4. See the clean directory for the clean script.

  • Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
  • Sentences with less than 3 words are removed
  • Sentences with a word of more than 1000 characters are removed
  • Documents with less than 5 sentences are removed
  • Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Training

The model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64.