jasonmcaffee
/

flan-t5-large-samsum

 ---
 license: mit
 ---
+# Overview
+A LoRA adapter created by fine tuning the flan-t5-large model using the [SAMsum training dataset](https://huggingface.co/datasets/samsum).
+SAMsum is a corpus comprised of 16k dialogues and corresponding summaries.
+Example entry:
+- Dialogue - "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)"
+- Summary - "Amanda baked cookies and will bring Jerry some tomorrow."
+[LoRA](https://github.com/microsoft/LoRA) is a performant mechanism for fine tuning models to become better at tasks.
+> An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
+In this case we are training the flan-t5 on the SAMsum dataset in order to create a model that is better at dialog summary.
+# Code
+## Notebook Source
+[Notebook used to create LoRA adapter](https://colab.research.google.com/drive/1z_mZL6CIRRA4AeF6GXe-zpfEGqqdMk-f?usp=sharing)
+## Load the samsum dataset that we will use to finetune the flan-t5-large model with.
+```
+from datasets import load_dataset
+dataset = load_dataset("samsum")
+```
+## Prepare the dataset
+```
+... see notebook
+# save datasets to disk for later easy loading
+tokenized_dataset["train"].save_to_disk("data/train")
+tokenized_dataset["test"].save_to_disk("data/eval")
+```
+## Load the flan-t5-large model
+Loading in 8bit greatly reduces the amount of GPU memory required.
+When combined with the accelerate library, device_map="auto" will use all available gpus for training.
+```
+from transformers import AutoModelForSeq2SeqLM
+model_id = "google/flan-t5-large"
+model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto", torch_dtype=torch.float16)
+```
+## Define LoRA config and prepare the model for training
+```
+from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
+lora_config = LoraConfig(
+ r=16,
+ lora_alpha=32,
+ target_modules=["q", "v"],
+ lora_dropout=0.05,
+ bias="none",
+ task_type=TaskType.SEQ_2_SEQ_LM
+)
+# prepare int-8 model for training
+model = prepare_model_for_int8_training(model)
+# add LoRA adaptor
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+```
+## Create data collator
+Data collators are objects that will form a batch by using a list of dataset elements as input.
+```
+from transformers import DataCollatorForSeq2Seq
+# we want to ignore tokenizer pad token in the loss
+label_pad_token_id = -100
+# Data collator
+data_collator = DataCollatorForSeq2Seq(
+    tokenizer,
+    model=model,
+    label_pad_token_id=label_pad_token_id,
+    pad_to_multiple_of=8
+)
+```
+## Create the training arguments and trainer
+```
+from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
+output_dir="lora-flan-t5-large"
+# Define training args
+training_args = Seq2SeqTrainingArguments(
+    output_dir=output_dir,
+		auto_find_batch_size=True,
+    learning_rate=1e-3, # higher learning rate
+    num_train_epochs=5,
+    logging_dir=f"{output_dir}/logs",
+    logging_strategy="steps",
+    logging_steps=500,
+    save_strategy="no",
+    report_to="tensorboard",
+)
+# Create Trainer instance
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    data_collator=data_collator,
+    train_dataset=tokenized_dataset["train"],
+)
+model.config.use_cache = False  # re-enable for inference!
+```
+## Train the model!
+This will take about 5-6 hours on a singe T4 GPU
+```
+trainer.train()
+```
+| Step | Training Loss |
+|------|---------------|
+| 500  | 1.302200      |
+| 1000 | 1.306300      |
+| 1500 | 1.341500      |
+| 2000 | 1.278500      |
+| 2500 | 1.237000      |
+| 3000 | 1.239200      |
+| 3500 | 1.250900      |
+| 4000 | 1.202100      |
+| 4500 | 1.165300      |
+| 5000 | 1.178900      |
+| 5500 | 1.181700      |
+| 6000 | 1.100600      |
+| 6500 | 1.119800      |
+| 7000 | 1.105700      |
+| 7500 | 1.097900      |
+| 8000 | 1.059500      |
+| 8500 | 1.047400      |
+| 9000 | 1.046100      |
+TrainOutput(global_step=9210, training_loss=1.1780610539108094, metrics={'train_runtime': 19217.7668, 'train_samples_per_second': 3.833, 'train_steps_per_second': 0.479, 'total_flos': 8.541847343333376e+16, 'train_loss': 1.1780610539108094, 'epoch': 5.0})