TUKE-KEMT
/

slovak-t5-base

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

slovak-t5-base / README.md

dhladek's picture

Update README.md

5670304 verified 2 months ago

|

history blame contribute delete

1.41 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- mc4
	- oscar-corpus/oscar
	language:
	- sk
	---

	# Slovak T5 Base

	Monolingual Slovak model, trained from scratch on web data.

	This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.

	After fine-tuning, it is suitable for tasks such as:

	- Question answering
	- Summarization
	- Generation of synthetic data


	## Training data

	Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings.

	The training corpus has together 14B tokens after deduplication.

	It consists of the Slovak data from:
	- mc4
	- Oscar
	- Wikipedia
	- custom ollection of newspaper articles
	- custom collection of web pages
	- Slovak part of the European Parliament Proceedings


	## Hyperparameters:

	- Input length: 512 tokens
	- Effective Batch Size: 128
	- Steps: 200000
	- Optimizer: Adafactor
	- Scheduler: Legacy
	- Learning Rate: 0.2
	- Gradient clip: 1

	## Evaluation

	After finetuning for question answering on SK-QUAD, it gives:

	- Slovak T5 Base : 71.31 F1
	- Umt5 Base: 69.22 F1
	- Mt5 Base 65.29 F1
	- Mt0 Base 65.17 F1

	## Bias

	The model is published as it is. We did not make any specific attempts to clean up the data.

	## License

	Free for scientific and commercial use under the terms of: cc-by-sa-4.0

	## Creadits

	- Daniel Hládek @ KEMT FIE TUKE