|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- mc4 |
|
- oscar-corpus/oscar |
|
language: |
|
- sk |
|
--- |
|
|
|
# Slovak T5 Base |
|
|
|
Monolingual Slovak model, trained from scratch on web data. |
|
|
|
This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet. |
|
|
|
After fine-tuning, it is suitable for tasks such as: |
|
|
|
- Question answering |
|
- Summarization |
|
- Generation of synthetic data |
|
|
|
|
|
## Training data |
|
|
|
Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings. |
|
|
|
The training corpus has together 14B tokens after deduplication. |
|
|
|
It consists of the Slovak data from: |
|
- mc4 |
|
- Oscar |
|
- Wikipedia |
|
- custom ollection of newspaper articles |
|
- custom collection of web pages |
|
- Slovak part of the European Parliament Proceedings |
|
|
|
|
|
## Hyperparameters: |
|
|
|
- Input length: 512 tokens |
|
- Effective Batch Size: 128 |
|
- Steps: 200000 |
|
- Optimizer: Adafactor |
|
- Scheduler: Legacy |
|
- Learning Rate: 0.2 |
|
- Gradient clip: 1 |
|
|
|
## Evaluation |
|
|
|
After finetuning for question answering on SK-QUAD, it gives: |
|
|
|
- Slovak T5 Base : 71.31 F1 |
|
- Umt5 Base: 69.22 F1 |
|
- Mt5 Base 65.29 F1 |
|
- Mt0 Base 65.17 F1 |
|
|
|
## Bias |
|
|
|
The model is published as it is. We did not make any specific attempts to clean up the data. |
|
|
|
## License |
|
|
|
Free for scientific and commercial use under the terms of: cc-by-sa-4.0 |
|
|
|
## Creadits |
|
|
|
- Daniel Hládek @ KEMT FIE TUKE |