slovak-t5-base / README.md
dhladek's picture
Update README.md
5670304 verified
---
license: cc-by-sa-4.0
datasets:
- mc4
- oscar-corpus/oscar
language:
- sk
---
# Slovak T5 Base
Monolingual Slovak model, trained from scratch on web data.
This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.
After fine-tuning, it is suitable for tasks such as:
- Question answering
- Summarization
- Generation of synthetic data
## Training data
Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings.
The training corpus has together 14B tokens after deduplication.
It consists of the Slovak data from:
- mc4
- Oscar
- Wikipedia
- custom ollection of newspaper articles
- custom collection of web pages
- Slovak part of the European Parliament Proceedings
## Hyperparameters:
- Input length: 512 tokens
- Effective Batch Size: 128
- Steps: 200000
- Optimizer: Adafactor
- Scheduler: Legacy
- Learning Rate: 0.2
- Gradient clip: 1
## Evaluation
After finetuning for question answering on SK-QUAD, it gives:
- Slovak T5 Base : 71.31 F1
- Umt5 Base: 69.22 F1
- Mt5 Base 65.29 F1
- Mt0 Base 65.17 F1
## Bias
The model is published as it is. We did not make any specific attempts to clean up the data.
## License
Free for scientific and commercial use under the terms of: cc-by-sa-4.0
## Creadits
- Daniel Hládek @ KEMT FIE TUKE