RoBERTa Greek base model

Pretrained model on Greek language with the Masked Language Modeling (MLM) objective using Hugging Face's Transformers library. This model is NOT case-sensitive and all Greek diacritics retained.

How to use

You can use this model directly with a pipeline for masked language modeling:

# example url 
# https://www.news247.gr/politiki/misologa-maximoy-gia-tin-ekthesi-tsiodra-lytra-gia-ti-thnitotita-ektos-meth.9462425.html 
# not present in train/eval set
from transformers import pipeline
pipe = pipeline('fill-mask', model='cvcio/roberta-el-news')
pipe(
    'Η κυβέρνηση μουδιασμένη από τη <mask> της έκθεσης Τσιόδρα-Λύτρα, '
    'επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.'
)
# outputs
[
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη δημοσιοποίηση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.5881184339523315, 'token': 20235, 'token_str': ' δημοσιοποίηση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη δημοσίευση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.05952141433954239, 'token': 9696, 'token_str': ' δημοσίευση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη διαχείριση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.029887061566114426, 'token': 4315, 'token_str': ' διαχείριση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη διαρροή της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.022848669439554214, 'token': 24940, 'token_str': ' διαρροή'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη ματαίωση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.01729060709476471, 'token': 46913, 'token_str': ' ματαίωση'
    }
]

Training data

The model was pretrained on 8 millon unique news articles (~ approx 160M sentences, 33GB of text), collected with MediaWatch, from October 2016 upto December 2021.

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,265. During the preprocessing we only unescaped html text to the correspoing Unicode characters (ex. & => &).

Pretraining

The model was pretrained using an NVIDIA A10 GPU for 3 epochs (~ approx 760K steps, 182 hours) with a batch size of 14 (x2 gradient accumulation steps = 28) and a sequence length of 512 tokens. The optimizer used is Adam with a learning rate of 5e-5, and linear decay of the learning rate.

Training results

epochs	steps	train/train_loss	train/loss	eval/loss
3	765,414	0.3960	1.2356	0.9028

Evaluation results

The model fine-tuned on ner task using the elNER dataset and achieved the following results:

task	epochs	lr	batch	dataset	precision	recall	f1	accuracy
ner	5	1e-5	16/16	elNER4	0.8954	0.9280	0.9114	0.9872
ner	5	1e-4	16/16	elNER18	0.9069	0.9268	0.9168	0.9823

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-5
train_batch_size: 14
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 28
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3.0

Framework versions

Transformers 4.13.0
Pytorch 1.9.0+cu111
Datasets 1.16.1
Tokenizers 0.10.3

Authors

Dimitris Papaevagelou - @andefined

About Us

Civic Information Office is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.

cvcio
/

roberta-el-news

You need to agree to share your contact information to access this model

RoBERTa Greek base model

How to use

Training data

Preprocessing

Pretraining

Training results

Evaluation results

Training hyperparameters

Framework versions

Authors

About Us

Evaluation results