README.md · seara/rubert-base-cased-russian-sentiment at c89baab3db3c9e7a74a40964ce03e16c2745a5f1

metadata

license: mit
language:
  - ru
metrics:
  - f1
  - roc_auc
  - precision
  - recall
pipeline_tag: text-classification
tags:
  - rubert
  - sentiment
datasets:
  - sismetanin/rureviews
  - RuSentiment
  - LinisCrowd2015
  - LinisCrowd2016
  - KaggleRussianNews

This is RuBERT model fine-tuned for sentiment classification of short Russian texts. The task is a multi-class classification with the following labels:

0: neutral
1: positive
2: negative

Usage

from transformers import pipeline
model = pipeline(model="seara/rubert-base-cased-russian-sentiment")
model("Привет, ты мне нравишься!")
# [{'label': 'positive', 'score': 0.9818321466445923}]

Dataset

This model was trained on the union of the following datasets:

Kaggle Russian News Dataset
Linis Crowd 2015
Linis Crowd 2016
RuReviews
RuSentiment

An overview of the training data can be found on S. Smetanin Github repository.

Download links for all Russian sentiment datasets collected by Smetanin can be found in this repository.

Training

Training were done in this project with this parameters:

max_length: 512
batch_size: 64
optimizer: adam
lr: 0.00001
weight_decay: 0
num_epochs: 5

Train/validation/test splits are 80%/10%/10%.

Eval results (on test split)

	neutral	positive	negative	macro avg	weighted avg
precision	0.71	0.84	0.75	0.77	0.76
recall	0.74	0.84	0.71	0.76	0.76
f1-score	0.73	0.84	0.73	0.76	0.76
auc-roc	0.86	0.95	0.91	0.91	0.90
support	5196	3831	3599	12626	12626