bigbird pegasus on the booksum dataset
this is the "latest" version of the model that has been trained the longest, currently at 70k steps
- GOAL: A summarization model that 1) summarizes the source content accurately 2) more important IMO produces summaries that are easy to read and understand (* cough * unlike arXiv * cough *)
- This model attempts to help with that by using the booksum dataset to provide explanatory summarization
- Explanatory Summary - A summary that both consolidates information and also explains why said consolidated information is important.
- This model was trained for seven epochs total (approx 70,000 steps) and is closer to finished.
- Will continue to improve (slowly, now that it has been trained for a long time) based on any result findings/feedback.
- starting checkpoint was
google/bigbird-pegasus-large-bigpatent
example usage
An extended example, including a demo of batch summarization, is here.
- create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/bigbird-pegasus-large-K-booksum",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"pszemraj/bigbird-pegasus-large-K-booksum",
)
summarizer = pipeline(
"summarization",
model=model,
tokenizer=tokenizer,
)
- define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."
result = summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
clean_up_tokenization_spaces=True,
)
print(result[0]["summary_text"])
Alternate Checkpoint
- if experiencing runtime/memory issues, try this earlier checkpoint at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
- see similar summarization models fine-tuned on booksum but using different architectures: long-t5 base and LED-Large
- Downloads last month
- 50
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train pszemraj/bigbird-pegasus-large-K-booksum
Evaluation results
- ROUGE-1 on kmfoda/booksumtest set verified34.076
- ROUGE-2 on kmfoda/booksumtest set verified5.918
- ROUGE-L on kmfoda/booksumtest set verified16.387
- ROUGE-LSUM on kmfoda/booksumtest set verified31.612
- loss on kmfoda/booksumtest set verified3.522
- gen_len on kmfoda/booksumtest set verified254.368
- ROUGE-1 on launch/gov_reporttest set verified40.015
- ROUGE-2 on launch/gov_reporttest set verified10.741
- ROUGE-L on launch/gov_reporttest set verified20.134
- ROUGE-LSUM on launch/gov_reporttest set verified36.774