Update README.md

421834f verified 4 months ago

10.4 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: google/pegasus-xsum
	datasets:
	- eilamc14/wikilarge-clean
	language:
	- en
	tags:
	- pegasus
	- text-simplification
	- WikiLarge
	model-index:
	- name: pegasus-xsum-text-simplification
	results:
	- task:
	type: text2text-generation
	name: Text Simplification
	dataset:
	name: ASSET
	type: facebook/asset
	url: https://huggingface.co/datasets/facebook/asset
	split: test
	metrics:
	- type: SARI
	value: 33.80
	- type: FKGL
	value: 9.23
	- type: BERTScore
	value: 87.54
	- type: LENS
	value: 62.46
	- type: Identical ratio
	value: 0.29
	- type: Identical ratio (ci)
	value: 0.29

	- task:
	type: text2text-generation
	name: Text Simplification
	dataset:
	name: MEDEASI
	type: cbasu/Med-EASi
	url: https://huggingface.co/datasets/cbasu/Med-EASi
	split: test
	metrics:
	- type: SARI
	value: 32.68
	- type: FKGL
	value: 10.98
	- type: BERTScore
	value: 45.14
	- type: LENS
	value: 50.55
	- type: Identical ratio
	value: 0.30
	- type: Identical ratio (ci)
	value: 0.30

	- task:
	type: text2text-generation
	name: Text Simplification
	dataset:
	name: OneStopEnglish
	type: OneStopEnglish
	url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus
	split: advanced→elementary
	metrics:
	- type: SARI
	value: 37.07
	- type: FKGL
	value: 8.66
	- type: BERTScore
	value: 77.77
	- type: LENS
	value: 60.97
	- type: Identical ratio
	value: 0.40
	- type: Identical ratio (ci)
	value: 0.40
	---

	# Model Card for Model ID

	This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project.

	## Model Details

	### Model Description

	Fine-tuned sequence-to-sequence (encoder–decoder) Transformer for English text simplification.
	Trained on the dataset `eilamc14/wikilarge-clean` (cleaned WikiLarge-style pairs).

	- Model type: Seq2Seq Transformer (encoder–decoder)
	- Language (NLP): English
	- License: `apache-2.0`
	- Finetuned from model: `google/pegasus-xsum`

	### Model Sources

	- Repository (code): https://github.com/eilamc14/Simplify-This
	- Dataset: https://huggingface.co/datasets/eilamc14/wikilarge-clean
	- Paper: arxiv.org/abs/2601.05794

	## Uses

	### Direct Use

	The model is intended for English text simplification.

	- Input format: `Simplify: <complex sentence>`
	- Output: `<simplified sentence>`

	Typical uses
	- Research on automatic text simplification
	- Benchmarking against other simplification systems
	- Demos/prototypes that require simpler English rewrites

	### Downstream Use

	This repository already contains a fine-tuned model specialized for text simplification.

	Further fine-tuning is optional and mainly relevant when:
	- Adapting to a markedly different domain (e.g., medical/legal/news)
	- Addressing specific failure modes (e.g., over/under-simplification, factual drops)
	- Distilling/quantizing for deployment constraints

	When fine-tuning further, keep the same input convention: `Simplify: <...>`.

	### Out-of-Scope Use

	Not intended for:
	- Tasks unrelated to simplification (dialogue, translation etc.)
	- Production use without additional safety filtering (no toxicity/bias mitigation)
	- Languages other than English
	- High-stakes settings (legal/medical advice, safety-critical decisions)


	## Bias, Risks, and Limitations

	The model was trained on Wikipedia and Simple English Wikipedia alignments (via WikiLarge).
	As a result, it inherits the characteristics and limitations of this data:

	- Domain bias: Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news).
	- Content bias: Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these.
	- Simplification quality: The model may:
	- Over-simplify (drop important details)
	- Under-simplify (retain complex phrasing)
	- Produce ungrammatical or awkward rephrasings
	- Language limitation: Only suitable for English. Applying to other languages is unsupported.
	- Safety limitation: The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards.


	### Recommendations

	- Evaluation required: Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation).
	- Human oversight: Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.).
	- Attribution: Preserve source attribution where required (Wikipedia → CC BY-SA).
	- Not for high-stakes use: Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation.

	## How to Get Started with the Model

	Load the model and tokenizer directly from the Hugging Face Hub:

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model_id = "eilamc14/bart-base-text-simplification"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

	# Example input
	PREFIX = "Simplify: "
	text = "The committee deemed the proposal unnecessarily complicated."

	# Tokenize and generate
	inputs = tokenizer(PREFIX+text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset

	### Training Procedure

	- Hardware: NVIDIA L4 GPU on Google Colab
	- Objective: Standard sequence-to-sequence cross-entropy loss
	- Training type: Full fine-tuning of all parameters (no LoRA/PEFT used)
	- Batching: Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader
	- Evaluation: Monitored on the `validation` split with metrics (SARI and identical_ratio)
	- Stopping criteria: Early stopping CallBack based on validation performance

	#### Preprocessing

	The dataset was preprocessed by prefixing each source sentence with "Simplify: " and tokenizing both the source (inputs) and target (labels).

	#### Memory & Checkpointing

	To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled:

	```python
	model.config.use_cache = False # required when using gradient checkpointing
	model.gradient_checkpointing_enable() # saves memory at the cost of extra compute
	```

	Notes
	- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass.
	- Gradient checkpointing trades GPU memory ↓ for training speed ↓ (extra recomputation).
	- For inference/evaluation, re-enable the cache for faster generation:

	```python
	model.config.use_cache = True
	```

	#### Training Hyperparameters

	The models were trained with Hugging Face `Seq2SeqTrainingArguments`.
	Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved.
	Below are the typical defaults used:

	- Epochs: 5
	- Evaluation strategy: every 300 steps
	- Save strategy: every 300 steps (keep best model, `eval_loss` as criterion)
	- Learning rate: ~3e-5
	- Batch size: ~8-64 , depends on model size
	- Optimizer: `adamw_torch_fused`
	- Precision: bf16
	- Generation config (during eval): `max_length=128`, `num_beams=4`, `predict_with_generate=True`
	- Other settings:
	- Weight decay: 0.01
	- Label smoothing: 0.1
	- Warmup ratio: 0.1
	- Max grad norm: 0.5
	- Dataloader workers: 8 (L4 GPU)

	> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly.

	## Evaluation

	### Testing Data

	- [ASSET](https://huggingface.co/datasets/facebook/asset) (test subset)
	- [MEDEASI](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset)
	- [OneStopEnglish](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary)

	### Metrics

	- Identical ratio — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces
	- Identical ratio (ci) — case insensitive identical ratio
	- SARI — main simplification metric (higher is better)
	- FKGL — readability grade level (lower is simpler)
	- BERTScore (F1) — semantic similarity (higher is better)
	- LENS — composite simplification quality score (higher is better)

	### Generation Arguments

	```python
	gen_args = dict(
	max_new_tokens=64,
	num_beams=4,
	length_penalty=1.0,
	no_repeat_ngram_size=3,
	early_stopping=True,
	do_sample=False,
	)
	```

	### Results

	\| Dataset \| Identical ratio \| Identical ratio (ci) \| SARI \| FKGL \| BERTScore \| LENS \|
	\|--------------------\|----------------:\|---------------------:\|------:\|-----:\|----------:\|------:\|
	\| ASSET \| 0.29 \| 0.29 \| 33.80 \| 9.23 \| 87.54 \| 62.46 \|
	\| MEDEASI \| 0.30 \| 0.30 \| 32.68 \| 10.98\| 45.14 \| 50.55 \|
	\| OneStopEnglish \| 0.40 \| 0.40 \| 37.07 \| 8.66 \| 77.77 \| 60.97 \|


	## Environmental Impact

	- Hardware Type: Single NVIDIA L4 GPU (Google Colab)
	- Hours used: Approx. 5–10
	- Cloud Provider: Google Cloud (via Colab)
	- Compute Region: Unknown (Google Colab dynamic allocation)
	- Carbon Emitted: Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours.

	## Citation

	@misc{simplifythis2025,
	author = {Cohen, Eilam and others},
	title = {Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs},
	year = {2025},
	howpublished = {\url{https://github.com/eilamc14/Simplify-This}},
	note = {GitHub repository},
	urldate = {2025-09-30}
	}