lang-uk
/

flair-uk-forward-large

Text2Text Generation

Model card Files Files and versions Community

flair-uk-forward-large / README.md

dchaplinsky's picture

Update README.md

c25c1ba over 1 year ago

|

history blame contribute delete

3.01 kB

	---
	language:
	- uk
	tags:
	- text2text-generation
	- flair
	library_name: generic
	license: mit
	metrics:
	- perplexity
	datasets:
	- ubertext2.0
	widget:
	- text: "Росія зазнає поразки"
	- text: "Достеменно відомо, що Україна перемагає"
	---

	# Ukrainian flair embeddings (forward, large)

	Trained for 10 epochs on the texts from ubertext2.0 and corpus of Ukrainian scraped texts from Stefan Schweter (54GB in total).

	This is the forward version of the embeddings. You can find the backward version [here](https://huggingface.co/lang-uk/flair-uk-backward-large/)

	The characters dictionary used for training is in `flair_dictionary.pkl` file

	The model params are:
	```python
	is_forward_lm=True,
	hidden_size=2048,
	sequence_length=250,
	mini_batch_size=1024,
	max_epochs=30
	```

	For smaller size flair embeddings of the Ukrainian language please check [uk-forward](https://huggingface.co/lang-uk/flair-uk-forward)

	For more information on flair embeddings, see [the article](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) or the paper below:

	```bibtex
	@inproceedings{akbik2018coling,
	title={Contextual String Embeddings for Sequence Labeling},
	author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
	booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
	pages = {1638--1649},
	year = {2018}
	}
	```

	For more information on UberText 2.0 please see:
	```bibtex
	@inproceedings{chaplynskyi-2023-introducing,
	title = "Introducing {U}ber{T}ext 2.0: A Corpus of {M}odern {U}krainian at Scale",
	author = "Chaplynskyi, Dmytro",
	booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
	month = may,
	year = "2023",
	address = "Dubrovnik, Croatia",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2023.unlp-1.1",
	pages = "1--10",
	abstract = "This paper addresses the need for massive corpora for a low-resource language and presents the publicly available UberText 2.0 corpus for the Ukrainian language and discusses the methodology of its construction. While the collection and maintenance of such a corpus is more of a data extraction and data engineering task, the corpus itself provides a solid foundation for natural language processing tasks. It can enable the creation of contemporary language models and word embeddings, resulting in a better performance of numerous downstream tasks for the Ukrainian language. In addition, the paper and software developed can be used as a guidance and model solution for other low-resource languages. The resulting corpus is available for download on the project page. It has 3.274 billion tokens, consists of 8.59 million texts and takes up 32 gigabytes of space.",
	}
	```

	Copyright: [Dmytro Chaplynskyi](https://twitter.com/dchaplinsky), [lang-uk](https://lang.org.ua) project, 2023