File size: 2,211 Bytes
07ae5fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b053bd
 
 
 
 
4758a67
 
07ae5fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0850cf5
07ae5fc
0850cf5
07ae5fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
tags:
- summarization
- news
language: es
datasets:
- mlsum
---

# Spanish RoBERTa2RoBERTa (roberta-base-bne) fine-tuned on MLSUM ES for summarization

## Model
[BSC-TeMU/roberta-base-bne](https://huggingface.co/BSC-TeMU/roberta-base-bne) (RoBERTa Checkpoint)

## Dataset
**MLSUM** is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, **Spanish**, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

[MLSUM es](https://huggingface.co/datasets/viewer/?dataset=mlsum)

## Results (WIP)

|Set|Metric| Value|
|----|------|------|
| Test  |Rouge2 - mid -precision | 11.42|
| Test | Rouge2 - mid - recall | 10.58 |
| Test | Rouge2 - mid - fmeasure | 10.69|
| Test | Rouge1 - fmeasure | 28.83 |
| Test | RougeL - fmeasure  | 23.15 |

## Usage

 ```python
 import torch
 from transformers import RobertaTokenizerFast, EncoderDecoderModel
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
 tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
text = "Your text here..."
generate_summary(text)
```

Created by: [Narrativa](https://www.narrativa.com/)

About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI