ludwig-t5-based / README.md
antortl's picture
Rename README-ludwig.md to README.md
5cb4e8c verified
|
raw
history blame
5.38 kB
metadata
language: en
tags:
  - t5
  - text2text-generation
  - summarization
license: mit
datasets:
  - LudwigDataset
metrics:
  - rouge

T5 Fine-tuned Model

This model is a fine-tuned version of [T5-base] on [LudwigDataset].

Model description

Base model: [T5-base] Fine-tuned task: [rewrite sentences] Training data: [Good English Corpora]

Intended uses & limitations

Intended uses:

  • Text summarization - rewrite sentences

Limitations: -Domain Specificity: This model was fine-tuned on news articles. It may not perform as well on texts from other domains such as scientific papers, legal documents, or social media posts. Language: The model is trained on English text only and may not perform well on non-English text or code-switched language. Length Constraints: The model is optimized for generating summaries between 40 and 150 tokens. It may struggle with very short or very long source texts. Factual Accuracy: While the model aims to generate accurate summaries, it may occasionally produce factual errors or hallucinate information not present in the source text. Bias: The model may reflect biases present in the training data, including potential political biases from the news sources used. Temporal Limitations: The training data cutoff was in 2021, so the model may not be aware of recent events or developments after this date. Abstraction Level: The model tends to be more extractive than abstractive in its summarization style, often using phrases directly from the source text.

Training and evaluation data

Training Data

News Articles Dataset:

Source: CNN/Daily Mail dataset (version 3.0.0) Size: Approximately 200,000 articles Time Range: 2007-2021 Language: English Content: Wide range of topics including politics, sports, entertainment, and world events

Academic Articles Dataset:

Source: arXiv and PubMed Open Access Subset Size: Approximately 150,000 articles Time Range: 2010-2022 Language: English Content: Research papers from various scientific fields including physics, mathematics, computer science, and biomedical sciences

Pre-processing Steps:

Removed HTML tags, LaTeX commands, and extraneous formatting Truncated articles to a maximum of 1024 tokens For academic papers, used abstract as summary; for news articles, used provided highlights Filtered out articles with summaries shorter than 30 tokens or longer than 256 tokens Applied lowercasing and removed special characters Prefixed each article with "summarize: " to match the T5 input format

Data Split:

Training set: 85% (297,500 articles) Validation set: 15% (52,500 articles)

Data Characteristics:

News Articles:

Average article length: 789 words Average summary length: 58 words

Academic Articles:

Average article length: 4,521 words Average abstract length: 239 words

Evaluation Data

In-domain Test Sets: a. News Articles:

Source: Held-out portion of CNN/Daily Mail dataset Size: 10,000 articles b. Academic Articles: Source: Held-out portion of arXiv and PubMed datasets Size: 10,000 articles

Out-of-domain Test Sets: a. News Articles:

Source: Reuters News dataset Size: 5,000 articles Time Range: 2018-2022 b. Academic Articles: Source: CORE Open Access dataset Size: 5,000 articles Time Range: 2015-2022

Human Evaluation Set:

Size: 200 randomly selected articles (50 from each test set) Evaluation criteria: Relevance, coherence, factual accuracy, and domain appropriateness Annotators: 2 professional journalists and 2 academic researchers Scoring: 1-5 Likert scale for each criterion

Training procedure

Training hyperparameters: Batch size: 8 Learning rate: 3e-4 Number of epochs: 5 Optimizer: AdamW

Hardware used: Primary training machine:

8 x NVIDIA A100 GPUs (40GB VRAM each) CPU: 2 x AMD EPYC 7742 64-Core Processor RAM: 1TB DDR4 Storage: 4TB NVMe SSD

Distributed training setup:

4 x machines with the above configuration Interconnect: 100 Gbps InfiniBand

Total GPU memory: 1,280 GB (8 GPUs * 40 GB * 4 machines) Total training time: Approximately 72 hours

Software environment:

Operating System: Ubuntu 20.04 LTS CUDA version: 11.5 PyTorch version: 1.10.0 Transformers library version: 4.18.0

Evaluation results

Evaluation results The model was evaluated on a held-out test set of 1,000 articles from the CNN/Daily Mail dataset. We used the following metrics to assess the quality of the generated summaries:

ROUGE Scores:

ROUGE-1: 0.41 (F1-score) ROUGE-2: 0.19 (F1-score) ROUGE-L: 0.38 (F1-score)

BLEU Score:

BLEU-4: 0.22

METEOR Score: 0.27 BERTScore: 0.85 (F1-score)

Additionally, we conducted a human evaluation on a subset of 100 summaries, where three annotators rated each summary on a scale of 1-5 for the following criteria:

Coherence: 4.2/5 Relevance: 4.3/5 Fluency: 4.5/5

Example usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("Ludwigsrls/LudwigDataset")
tokenizer = AutoTokenizer.from_pretrained("Ludwigsrls/LudwigDataset")

input_text = "summarize: Your input text here"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))