racai/e4a-covid-distilbert-base-romanian-cased

The model generated in the Enrich4All project.
Evaluated the perplexity of MLM Task fine-tuned for COVID-related corpus.
Baseline model: https://huggingface.co/racai/distilbert-base-romanian-cased
Scripts and corpus used for training: https://github.com/racai-ai/e4all-models

Corpus

The COVID-19 datasets we designed are a small corpus and a question-answer dataset. The targeted sources were official websites of Romanian institutions involved in managing the COVID-19 pandemic, like The Ministry of Health, Bucharest Public Health Directorate, The National Information Platform on Vaccination against COVID-19, The Ministry of Foreign Affairs, as well as of the European Union. We also harvested the website of a non-profit organization initiative, in partnership with the Romanian Government through the Romanian Digitization Authority, that developed an ample platform with different sections dedicated to COVID-19 official news and recommendations. News websites were avoided due to the volatile character of the continuously changing pandemic situation, but a reliable source of information was a major private medical clinic website (Regina Maria), which provided detailed medical articles on important subjects of immediate interest to the readers and patients, like immunity, the emergent treating protocols or the new Omicron variant of the virus. The corpus dataset was manually collected and revised. Data were checked for grammatical correctness, and missing diacritics were introduced.

The corpus is structured in 55 UTF-8 documents and contains 147,297 words.

Results

MLM Task	Perplexity
Baseline	68.39
COVID Fine-tuning	5.56