metadata

tags:
  - summarization
  - news
language: es
datasets:
  - mlsum

Spanish RoBERTa2RoBERTa (roberta-base-bne) fine-tuned on MLSUM ES for summarization

Model

BSC-TeMU/roberta-base-bne (RoBERTa Checkpoint)

Dataset

MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

MLSUM es

Results (WIP)

Set	Metric	Value
Test	Rouge2 - mid -precision	11.42
Test	Rouge2 - mid - recall	10.58
Test	Rouge2 - mid - fmeasure	10.69
Test	Rouge1 - fmeasure	28.83
Test	RougeL - fmeasure	23.15

Usage

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids = inputs.input_ids.to(device)
   attention_mask = inputs.attention_mask.to(device)
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True)
   
text = "Your text here..."
generate_summary(text)

Created by: Narrativa

About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI