Uploading model and tokenizer

Browse files

Files changed (8) hide show

README.md +65 -0
config.json +44 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +51 -0
tokenizer_config.json +61 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,68 @@
 ---
 license: mit
 ---

 ---
+language: "it"
 license: mit
+datasets:
+- gsarti/clean_mc4_it
+tags:
+- bart
+- pytorch
+pipeline:
+- text2text-generation
 ---
+# BART-IT: Italian pretraining for BART sequence to sequence model
+BART-IT is a sequence-to-sequence model, based on the BART architecture that is specifically tailored to the Italian language. The model is pre-trained on a [large corpus of Italian text](https://huggingface.co/datasets/gsarti/clean_mc4_it), and can be fine-tuned on a variety of tasks.
+## Model description
+The model is a `base-`sized BART model, with a vocabulary size of 52,000 tokens. It has 140M parameters and can be used for any task that requires a sequence-to-sequence model. It is trained from scratch on a large corpus of Italian text, and can be fine-tuned on a variety of tasks.
+## Pre-training
+The code used to pre-train BART-IT together with additional information on model parameters can be found [here](https://github.com/MorenoLaQuatra/bart-it).
+## Fine-tuning
+The model in this repository is a pre-trained model without any fine-tuning. In order to use the model for a specific task, you can fine-tune it on a specific dataset.
+The model has been fine-tuned for the abstractive summarization task on 3 different Italian datasets:
+- [FanPage](https://huggingface.co/datasets/ARTeLab/fanpage) - finetuned model [here](https://huggingface.co/MorenoLaQuatra/bart-it-fanpage)
+- [IlPost](https://huggingface.co/datasets/ARTeLab/ilpost) - finetuned model [here](https://huggingface.co/MorenoLaQuatra/bart-it-ilpost)
+- [WITS](https://huggingface.co/datasets/Silvia/WITS) - finetuned model [here](https://huggingface.co/MorenoLaQuatra/bart-it-WITS)
+## Usage
+In order to use the model, you can use the following code:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("morenolq/bart-it")
+model = AutoModelForSeq2SeqLM.from_pretrained("morenolq/bart-it")
+input_ids = tokenizer.encode("Il modello BART-IT è stato pre-addestrato su un corpus di testo italiano", return_tensors="pt")
+outputs = model.generate(input_ids, max_length=40, num_beams=4, early_stopping=True)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+# Citation
+If you find this model useful for your research, please cite the following paper:
+```bibtex
+@Article{BARTIT,
+	AUTHOR = {La Quatra, Moreno and Cagliero, Luca},
+	TITLE = {BART-IT: An Efficient Sequence-to-Sequence Model for Italian Text Summarization},
+	JOURNAL = {Future Internet},
+	VOLUME = {15},
+	YEAR = {2023},
+	NUMBER = {1},
+	ARTICLE-NUMBER = {15},
+	URL = {https://www.mdpi.com/1999-5903/15/1/15},
+	ISSN = {1999-5903},
+	DOI = {10.3390/fi15010015}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "architectures": [
+    "BartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "classifier_dropout": 0.0,
+  "d_model": 768,
+  "decoder_attention_heads": 12,
+  "decoder_ffn_dim": 3072,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 6,
+  "decoder_start_token_id": 2,
+  "dropout": 0.1,
+  "encoder_attention_heads": 12,
+  "encoder_ffn_dim": 3072,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 6,
+  "eos_token_id": 2,
+  "forced_eos_token_id": 2,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "max_position_embeddings": 1024,
+  "model_type": "bart",
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "scale_embedding": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.22.1",
+  "use_cache": true,
+  "vocab_size": 52000
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5330a2ce3a0b805d4168188392194e351554bbb8920849196ce57f6c4e402e83
+size 563305977

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "mask_token": {
+    "__type": "AddedToken",
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "__type": "AddedToken",
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "tokenizer_class": "BartTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a128215921a95e7afb04256a99f42f68c405b5883a0c6e05fa5e68d81828fd84
+size 3311

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff