|
--- |
|
language: "it" |
|
license: mit |
|
datasets: |
|
- gsarti/clean_mc4_it |
|
tags: |
|
- bart |
|
- pytorch |
|
pipeline: |
|
- text2text-generation |
|
--- |
|
|
|
# BART-IT: Italian pretraining for BART sequence to sequence model |
|
|
|
BART-IT is a sequence-to-sequence model, based on the BART architecture that is specifically tailored to the Italian language. The model is pre-trained on a [large corpus of Italian text](https://huggingface.co/datasets/gsarti/clean_mc4_it), and can be fine-tuned on a variety of tasks. |
|
|
|
## Model description |
|
|
|
The model is a `base-`sized BART model, with a vocabulary size of 52,000 tokens. It has 140M parameters and can be used for any task that requires a sequence-to-sequence model. It is trained from scratch on a large corpus of Italian text, and can be fine-tuned on a variety of tasks. |
|
|
|
|
|
## Pre-training |
|
|
|
The code used to pre-train BART-IT together with additional information on model parameters can be found [here](https://github.com/MorenoLaQuatra/bart-it). |
|
|
|
## Fine-tuning |
|
|
|
The model in this repository is a pre-trained model without any fine-tuning. In order to use the model for a specific task, you can fine-tune it on a specific dataset. |
|
|
|
The model has been fine-tuned for the abstractive summarization task on 3 different Italian datasets: |
|
|
|
- [FanPage](https://huggingface.co/datasets/ARTeLab/fanpage) - finetuned model [here](https://huggingface.co/morenolq/bart-it-fanpage) |
|
- [IlPost](https://huggingface.co/datasets/ARTeLab/ilpost) - finetuned model [here](https://huggingface.co/morenolq/bart-it-ilpost) |
|
- [WITS](https://huggingface.co/datasets/Silvia/WITS) - finetuned model [here](https://huggingface.co/morenolq/bart-it-WITS) |
|
|
|
## Usage |
|
|
|
In order to use the model, you can use the following code: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("morenolq/bart-it") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("morenolq/bart-it") |
|
|
|
input_ids = tokenizer.encode("Il modello BART-IT è stato pre-addestrato su un corpus di testo italiano", return_tensors="pt") |
|
outputs = model.generate(input_ids, max_length=40, num_beams=4, early_stopping=True) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
# Citation |
|
|
|
If you find this model useful for your research, please cite the following paper: |
|
|
|
```bibtex |
|
@Article{BARTIT, |
|
AUTHOR = {La Quatra, Moreno and Cagliero, Luca}, |
|
TITLE = {BART-IT: An Efficient Sequence-to-Sequence Model for Italian Text Summarization}, |
|
JOURNAL = {Future Internet}, |
|
VOLUME = {15}, |
|
YEAR = {2023}, |
|
NUMBER = {1}, |
|
ARTICLE-NUMBER = {15}, |
|
URL = {https://www.mdpi.com/1999-5903/15/1/15}, |
|
ISSN = {1999-5903}, |
|
DOI = {10.3390/fi15010015} |
|
} |
|
``` |
|
|