|
--- |
|
language: |
|
- cs |
|
- cs |
|
tags: |
|
- abstractive summarization |
|
- mbart-cc25 |
|
- Czech |
|
license: apache-2.0 |
|
datasets: |
|
- private CNC dataset news-based |
|
metrics: |
|
- rouge |
|
- rougeraw |
|
--- |
|
|
|
# mBART fine-tuned model for Czech abstractive summarization (HT2A-C) |
|
This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Czech news dataset to produce Czech abstractive summaries. |
|
## Task |
|
The model deals with the task ``Headline + Text to Abstract`` (HT2A) which consists in generating a multi-sentence summary considered as an abstract from a Czech news text. |
|
|
|
## Dataset |
|
The model has been trained on the private CNC dataset provided by Czech News Center. The dataset includes 3/4M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were set to 512 tokens for the encoder and 128 for the decoder. |
|
|
|
## Training |
|
The model has been trained on 1x NVIDIA Tesla A100 40GB for 60 hours. During training, the model has seen 3712K documents corresponding to roughly 5.5 epochs. |
|
|
|
# Use |
|
Assuming you are using the provided Summarizer.ipynb file. |
|
```python |
|
def summ_config(): |
|
cfg = OrderedDict([ |
|
# summarization model - checkpoint from website |
|
("model_name", "krotima1/mbart-ht2a-c"), |
|
("inference_cfg", OrderedDict([ |
|
("num_beams", 4), |
|
("top_k", 40), |
|
("top_p", 0.92), |
|
("do_sample", True), |
|
("temperature", 0.89), |
|
("repetition_penalty", 1.2), |
|
("no_repeat_ngram_size", None), |
|
("early_stopping", True), |
|
("max_length", 128), |
|
("min_length", 10), |
|
])), |
|
#texts to summarize |
|
("text", |
|
[ |
|
"Input your Czech text", |
|
] |
|
), |
|
]) |
|
return cfg |
|
cfg = summ_config() |
|
#load model |
|
model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"]) |
|
tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"]) |
|
# init summarizer |
|
summarize = Summarizer(model, tokenizer, cfg["inference_cfg"]) |
|
summarize(cfg["text"]) |
|
``` |