Model description

This model is a fine-tuned version of sentence-transformers/LaBSE on my news dataset. The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space. It should be noted that the model allows to classify news articles in other languages available in LaBSE, but the quality of such classification will be worse than Russian-language news texts. The learning news dataset is a well-balanced sample of recent news from the last five years.

It achieves the following results on the evaluation set:

  • Loss: 0.7314
  • Accuracy: 0.7793
  • F1: 0.7753
  • Precision: 0.7785
  • Recall: 0.7793

How to use


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

universal_model_name = "data-silence/frozen_news_classifier_ft"
universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)

# Перевод моделей в режим оценки и на нужное устройство
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
universal_model = universal_model.to(device)
universal_model.eval()

id2label = {
    0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
    5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
}


def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
    """Получает эмбеддинги списка текстов"""
    # Токенизация входного текста
    inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = universal_model.base_model(**inputs)
    embeddings = outputs.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings, dim=1)
    return embeddings.tolist()


def predict_category(news: list[str]) -> list[str]:
    """Предсказывает категорию по тексту новости / новостей"""

    # Токенизация с активацией выравнивания и усечения
    inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
    # Получение логитов модели
    with torch.no_grad():
        outputs = universal_model(**inputs)
        logits = outputs.logits

    # Получение индексов предсказанных меток
    predicted_labels = torch.argmax(logits, dim=-1).tolist()
    # Преобразование индексов в категории
    predicted_categories = [id2label[label] for label in predicted_labels]
    return predicted_categories

Intended uses & limitations

Compared to my specialized model any-news-classifier, which is designed to solve news classification problems, this model shows meaningfully worse metrics.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Accuracy F1 Precision Recall
0.8422 1.0 3596 0.8104 0.7681 0.7632 0.7669 0.7681
0.7923 2.0 7192 0.7738 0.7711 0.7666 0.7700 0.7711
0.7597 3.0 10788 0.7485 0.7754 0.7716 0.7741 0.7754
0.7564 4.0 14384 0.7314 0.7793 0.7753 0.7785 0.7793

Framework versions

  • Transformers 4.42.4
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1
Downloads last month
23
Safetensors
Model size
471M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for data-silence/frozen_news_classifier_ft

Finetuned
(28)
this model

Dataset used to train data-silence/frozen_news_classifier_ft