Edit model card

Model Overview

This is the model presented in the paper "Detecting Text Formality: A Study of Text Classification Approaches".

XLM-Roberta-based classifier trained on XFORMAL -- a multilingual formality classification dataset.

Results All languages

precision recall f1-score support
0 0.744912 0.927790 0.826354 108019
1 0.889088 0.645630 0.748048 96845
accuracy 0.794405 204864
macro avg 0.817000 0.786710 0.787201 204864
weighted avg 0.813068 0.794405 0.789337 204864

EN

precision recall f1-score support
0 0.800053 0.962981 0.873988 22151
1 0.945106 0.725899 0.821124 19449
accuracy 0.852139 41600
macro avg 0.872579 0.844440 0.847556 41600
weighted avg 0.867869 0.852139 0.849273 41600

FR

precision recall f1-score support
0 0.746709 0.925738 0.826641 21505
1 0.887305 0.650592 0.750731 19327
accuracy 0.795504 40832
macro avg 0.817007 0.788165 0.788686 40832
weighted avg 0.813257 0.795504 0.790711 40832

IT

precision recall f1-score support
0 0.721282 0.914669 0.806545 21528
1 0.864887 0.607135 0.713445 19368
accuracy 0.769024 40896
macro avg 0.793084 0.760902 0.759995 40896
weighted avg 0.789292 0.769024 0.762454 40896

PT

precision recall f1-score support
0 0.717546 0.908167 0.801681 21637
1 0.853628 0.599700 0.704481 19323
accuracy 0.762646 40960
macro avg 0.785587 0.753933 0.753081 40960
weighted avg 0.781743 0.762646 0.755826 40960

How to use

from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification

# load tokenizer and model weights
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')

id2formality = {0: "formal", 1: "informal"}
texts = [
    "I like you. I love you",
    "Hey, what's up?",
    "Siema, co porabiasz?",
    "I feel deep regret and sadness about the situation in international politics.",
]

# prepare the input
encoding = tokenizer(
    texts,
    add_special_tokens=True,
    return_token_type_ids=True,
    truncation=True,
    padding="max_length",
    return_tensors="pt",
)

# inference
output = model(**encoding)

formality_scores = [
    {id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
    for text_scores in output.logits.softmax(dim=1)
]
formality_scores
[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
 {'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
 {'formal': 0.936184287071228, 'informal': 0.06381577253341675},
 {'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]

Citation

@inproceedings{dementieva-etal-2023-detecting,
    title = "Detecting Text Formality: A Study of Text Classification Approaches",
    author = "Dementieva, Daryna  and
      Babakov, Nikolay  and
      Panchenko, Alexander",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.ranlp-1.31",
    pages = "274--284",
    abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}

Licensing Information

This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.

Downloads last month
339
Safetensors
Model size
278M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for s-nlp/xlmr_formality_classifier

Finetuned
(2598)
this model

Space using s-nlp/xlmr_formality_classifier 1