How to fine-tune this? + Training code
I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.
Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:
import os
from dataclasses import dataclass, field
from typing import Optional
import torch
from datasets import load_dataset
from datasets import load_from_disk
from peft import LoraConfig
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
AutoTokenizer,
TrainingArguments,
)
from tqdm.notebook import tqdm
from trl import SFTTrainer
from huggingface_hub import interpreter_login
interpreter_login()
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype='float16',
bnb_4bit_use_double_quant=False,
)
device_map = {"": 0}
#Download model
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
quantization_config=bnb_config,
device_map=device_map,
trust_remote_code=True,
use_auth_token=True
)
model.config.pretraining_tp = 1
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=32,
target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct?
bias="none",
task_type="CAUSAL_LM",
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
training_arguments = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
logging_steps=10,
learning_rate=2e-4,
fp16=False,
bf16=True,
max_grad_norm=.3,
max_steps=10000,
warmup_ratio=.03,
group_by_length=True,
lr_scheduler_type="constant",
)
model.config.use_cache = False
dataset = load_dataset("json", data_files="your_dataset.json", split="train")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_arguments,
packing=False,
)
trainer.train()
This might be naive, but if loading fp16, why do you train with bf16 true?
I'm guessing we need additional target modules + higher rank given the model is smaller? If you're only using one gpu the effective batch size is still really small - they trained over a ton of tokens, I'm wondering if the lr might need to be lower as well.
That being said you made it further than I did, I was running into the gradient checkpointing error (there's already a pull request, so I was hoping that would be merged in). So I haven't experimented nearly enough. Thanks for providing your code since at least it runs and you have me beat there...
Regarding your question about bf16 & fp16:
When you load a model in fp16 (float16), it uses less memory, which is great for handling large models. But, training a model can be more complex and requires better precision. That's where bf16 (bfloat16) comes in during training – it still saves memory like fp16, but it's better for the calculations needed in training, giving you a good balance between saving memory and having accurate training.
“I'm guessing we need additional target modules + higher rank given the model is smaller?” - Maybe. What I did was executing print(model), copying all the info about it, pasting it into GPT-4 and it selected the 2 modules specified in my previous message as the ones I should target.
Anyways I have no idea but my only hope is that I’ve missed some modules or messed something up otherwise the training results are disappointing. If you figure it out please let me know, will do the same if I come to some new info.
So in LoraConfig, I have read the paper and got to know that we have to use the Self attention layer.
Below is my loraconfig
LoraConfig(
r=32,
lora_alpha=16,
target_modules=[
'Wqkv',
'out_proj'
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
Excellent results! @navanit thank you for confirming the correct target_modules, the model now responds as expected.
Here is an example prompt I gave it: How can advances in artificial intelligence and machine learning contribute to more accurate and timely weather forecasting, and what are the limitations of relying on these technologies for weather predictions?
@cekal
I am facing the error by using your code.
ValueError: PhiForCausalLM does not support gradient checkpointing.
any walkthrough?
@Navanit-shorthills which GPU are you using? I'm on 1x A100 runpod.io (Jupyter notebook). The error you're encountering is due to the incompatibility of the PhiForCausalLM model with gradient checkpointing. To resolve this, you need to disable gradient checkpointing. This might increase memory usage, but it's necessary for this specific model architecture. You may try passing
model.config.gradient_checkpointing = False
right after loading the model. Replace the following section of the previous script with this one and try running it:
# Configure model and training
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype='float16',
bnb_4bit_use_double_quant=False,
)
device_map = {"": 0}
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
quantization_config=bnb_config,
device_map=device_map,
trust_remote_code=True,
use_auth_token=True
)
# Disable gradient checkpointing
model.config.gradient_checkpointing = False
Let me know if that solves the issue or not.
@cekal thanks for the answer, currently I am using NVIDIA GeForce RTX 3090 of 24.5 GB GPU. will see if I can train on it.
@cekal you were right, I tried working around. After disabling the gradient_checkpointing, started facing Cuda_out_of memory error. Is there any turn around since with the same GPU i trained llama 2 7b , mistral 7b but unable to fine tune the 2b parameter model.
@Navanit-shorthills it seems like more people are running into this problem. Instead of trying to turn off gradient checkpointing which is probably not the most effective approach, try adding checkpointing=true
to model=AutoModelForCasualLM.from_pretrained
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", checkpointing=True)
But again, I cannot verify if this works, as
@rungao2001
got "TypeError: PhiForCausalLM.init() got an unexpected keyword argument 'checkpointing'"
error when applying this. But try it, might work.
If that doesn't work, try doing the model.config.gradient_checkpointing = False
approach as before but reduce the batch size and try training on a lower max_seq_length
(e.g. max_seq_length=2048
----> max_seq_length=1096
). But this can produce a less capable model.
Last suggestion if everything fails is to either wait, as it seems like more people are encountering this issue, or using cloud computing like runpod.io (cost me $15-$20 to fully fine-tune it).
@cekal thanks I was able to fine tune by decreasing the max_seq_length = 720.
Also, I had used the below config.
But still the same, I was able to train mistal or llama 2 7b parameters with 2048 max_seq_length on my 24GB gpu
@Deepakvictor might be because you used a different version of the model. The results I displayed were from my custom fine-tuned version of phi-2, which is currently private.
https://github.com/hiyouga/LLaMA-Factory this repo seems supporting Phi-2, here is my toy working script
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate llama_factory
MODEL_NAME=phi-2
STAGE=sft
EPOCH=.01 #3.0
DATA=alpaca_gpt4_zh
SAVE_PATH=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH
SAVE_PATH_PREDICT=$SAVE_PATH/Predict
MODEL_PATH=./models/$MODEL_NAME
LoRA_TARGET=Wqkv #q_proj,v_proj
TEMPLATE=default
PREDICTION_SAMPLES=20
if [ ! -d $MODEL_PATH ]; then
echo "Model not found: $MODEL_PATH"
return 1
fi
if [ ! -d $SAVE_PATH ]; then
mkdir -p $SAVE_PATH
fi
if [ ! -d $SAVE_PATH_PREDICT ]; then
mkdir -p $SAVE_PATH_PREDICT
fi
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--seed 42 \
--stage $STAGE \
--model_name_or_path $MODEL_PATH \
--dataset $DATA \
--val_size .1 \
--val_max_sample 20 \
--finetuning_type lora \
--do_train \
--lora_target $LoRA_TARGET \
--output_dir $SAVE_PATH \
--overwrite_output_dir \
--overwrite_cache \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs $EPOCH \
--do_eval \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--prediction_loss_only \
--plot_loss \
--quantization_bit 4 \
|& tee $SAVE_PATH/train_eval_log.txt
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage $STAGE \
--model_name_or_path $MODEL_PATH \
--do_predict \
--max_samples $PREDICTION_SAMPLES \
--predict_with_generate \
--dataset $DATA \
--template $TEMPLATE \
--finetuning_type lora \
--adapter_name_or_path $SAVE_PATH \
--output_dir $SAVE_PATH_PREDICT \
--per_device_eval_batch_size 1 \
|& tee $SAVE_PATH_PREDICT/predict_log.txt
base_model = "microsoft/phi-2"
new_model = "phi-2-pa"
dataset = datasets.load_from_disk('wiki_pa_train_dataset')
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token=tokenizer.eos_token
tokenizer.padding_side="right"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
# use_flash_attention_2=True, # Phi does not support yet.
trust_remote_code=True,
flash_attn=True,
flash_rotary=True,
fused_dense=True,
low_cpu_mem_usage=True,
device_map={"": 0},
revision="refs/pr/23",
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
training_arguments = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=32,
evaluation_strategy="steps",
eval_steps=2000,
logging_steps=15,
optim="paged_adamw_8bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
save_steps=2000,
warmup_ratio=0.05,
weight_decay=0.01,
report_to="tensorboard",
max_steps=-1, # if maximum steps=2, it will stop after two steps
)
peft_config = LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules= ["Wqkv", "fc1", "fc2" ] # ["Wqkv", "out_proj", "fc1", "fc2" ], - 41M params
# modules_to_save=["embed_tokens","lm_head"]
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'],
eval_dataset=dataset['train'], #No separate evaluation dataset, i am using the same dataset
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=690,
tokenizer=tokenizer,
args=training_arguments,
)
Hi folks, here is my ft result done by llama_factory
https://huggingface.co/microsoft/phi-2/discussions/35#65819d07ca21d74c214cb3f6
@cekal thanks I was able to fine tune by decreasing the max_seq_length = 720.
Also, I had used the below config.
But still the same, I was able to train mistal or llama 2 7b parameters with 2048 max_seq_length on my 24GB gpu
@Navanit-shorthills true, I'm also having the same issue
@pbatra if you find any answer kindly reply in this thread.
Excellent results! @navanit thank you for confirming the correct target_modules, the model now responds as expected.
Here is an example prompt I gave it: How can advances in artificial intelligence and machine learning contribute to more accurate and timely weather forecasting, and what are the limitations of relying on these technologies for weather predictions?
Can you share the final working code that worked for you?
你好 这个开源模型是已经训练好的吗 它可以转换成中文的吗 感谢 本人萌新一枚
phi-2 has bug in speaking Chinese, it spits out gerberish
I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.
Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:
import os from dataclasses import dataclass, field from typing import Optional import torch from datasets import load_dataset from datasets import load_from_disk from peft import LoraConfig from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, AutoTokenizer, TrainingArguments, ) from tqdm.notebook import tqdm from trl import SFTTrainer from huggingface_hub import interpreter_login interpreter_login() compute_dtype = getattr(torch, "float16") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype='float16', bnb_4bit_use_double_quant=False, ) device_map = {"": 0} #Download model model = AutoModelForCausalLM.from_pretrained( "microsoft/phi-2", quantization_config=bnb_config, device_map=device_map, trust_remote_code=True, use_auth_token=True ) model.config.pretraining_tp = 1 peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=32, target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct? bias="none", task_type="CAUSAL_LM", ) tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token training_arguments = TrainingArguments( output_dir="./results", per_device_train_batch_size=1, gradient_accumulation_steps=4, optim="paged_adamw_32bit", save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE logging_steps=10, learning_rate=2e-4, fp16=False, bf16=True, max_grad_norm=.3, max_steps=10000, warmup_ratio=.03, group_by_length=True, lr_scheduler_type="constant", ) model.config.use_cache = False dataset = load_dataset("json", data_files="your_dataset.json", split="train") trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, dataset_text_field="text", max_seq_length=2048, tokenizer=tokenizer, args=training_arguments, packing=False, ) trainer.train()
@cekal
,I have a question ,in your fine-tune,trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_arguments,
packing=False,
),in this code, dataset_text_field='text',What is the corresponding content? it's prompt ?
@zoujiulong
In the fine-tuning script, the dataset_text_field
parameter in the SFTTrainer
object specifies the field name from your dataset that contains the text data used for training. This is not necessarily a prompt, but rather the actual textual content that you want the model to learn from.
Your dataset, which the script loads with load_dataset("json", data_files="your_dataset.json", split="train")
, is expected to be a collection of records, where each record is a JSON object. The dataset_text_field='text'
means that the trainer will look for a field named "text" in each JSON object of your dataset. This "text" field should contain the actual textual data.
For example, if you are training a language model and your dataset consists of sentences or paragraphs, each JSON object in your dataset file might look like this:
{ "text": "Here is a sample sentence for the language model to learn." }
In this case, "text"
is the key in each JSON object that points to the actual textual data you want the model to train on. If your dataset uses a different field name to store this textual data, you should change the dataset_text_field
parameter accordingly to match that field name.
@cekal thank you,I see,I’m a green hand.I have one more question,your purpose is Q&A,I remember that Should not you enter both the question and text such as BertForQuestionAnswering,why only use a field at here,Is phi-2 able to learn just by typing in text and then just asking?
very bad model. fine tune not working properly. :(
[ 12/100 00:08 < 01:17, 1.13 it/s, Epoch 0.00/1]
Step Training Loss
1 0.000000
2 0.000000
3 0.000000
4 0.000000
5 0.000000
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 0.000000
@Imran1 model isn't bad, perhaps your code is. 0 loss is obviously wrong. Mind sharing your fine-tuning script?
You can also try this: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb
Could you please re-run with the latest update (FP16)? We updated the modeling_phi.py
file and disabled the auto-casting on the Attention layer. This is the same fix as the previous code had.
@gugarosa I have performed full finetune with phi-2 on a single RTX A6000, but the loss is very quickly going to zero for just 10 steps. I have tried with the latest tranformers==4.37.0. Can you help me this? Thanks.
My implementation is followed: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb, but I commented out the quantization and lora parts for full finetuning.
Hi
@cekal
, I am trying to fine-tune and I am using target = ["Wqkv", "out_proj"] after exploring a few notebooks, but it is throwing error that the target modules are not present, I checked the model architecture too and I could see this :
PhiForCausalLM(
(model): PhiModel(
(embed_tokens): Embedding(51200, 2560)
(embed_dropout): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0-31): 32 x PhiDecoderLayer(
(self_attn): PhiAttention(
(q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
(dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
(rotary_emb): PhiRotaryEmbedding()
)
(mlp): PhiMLP(
(activation_fn): NewGELUActivation()
(fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
(fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
)
(input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)
Can you please suggest, what is that I am missing? I downloaded the model manually due to some network restrictions in my org.
@sbakhtyar Hi, based on the output you provided, try these:
target_modules = [
"q_proj", # Targeting query projection in PhiAttention
"k_proj", # Targeting key projection in PhiAttention
"v_proj", # Targeting value projection in PhiAttention
"dense", # Targeting the dense layer in PhiAttention for output transformation, not sure if appropriate, comment out if not necessary
"fc1", # Targeting the first fully connected layer in PhiMLP
"fc2", # Targeting the second fully connected layer in PhiMLP
]
Let me know how it goes!
Hey,
how do i know which attentional layer to choose as target_modules ? Right now i am using target_modules= ["Wqkv", "fc1", "fc2" ] for fine tuning phi-2. And in the LoRA paper the authors stated that they only tried their approach for the attention module and that there is more research needed for the the MLP module. Which target_modules should i choose and why ?
I appreciate all answers :)
I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear', 'transformer.embd.wte' - which resulted in better responses, but I feel like something is wrong in my training setup, as the model often behaves weirdly, and its responses are significantly worse than the ones from Mistral 7B. Considering Microsoft called this the state-of-art model below 13b parameters, mentioning it beats Mistral, it should outperform it, not underperform. I use a high-quality proprietary Q&A dataset, so the dataset quality cannot be the issue.
Just to confirm, am I using the right 'target_modules', or I should use different ones? Here is my training code:
import os from dataclasses import dataclass, field from typing import Optional import torch from datasets import load_dataset from datasets import load_from_disk from peft import LoraConfig from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, AutoTokenizer, TrainingArguments, ) from tqdm.notebook import tqdm from trl import SFTTrainer from huggingface_hub import interpreter_login interpreter_login() compute_dtype = getattr(torch, "float16") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype='float16', bnb_4bit_use_double_quant=False, ) device_map = {"": 0} #Download model model = AutoModelForCausalLM.from_pretrained( "microsoft/phi-2", quantization_config=bnb_config, device_map=device_map, trust_remote_code=True, use_auth_token=True ) model.config.pretraining_tp = 1 peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=32, target_modules=['lm_head.linear', 'transformer.embd.wte'], # is this correct? bias="none", task_type="CAUSAL_LM", ) tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token training_arguments = TrainingArguments( output_dir="./results", per_device_train_batch_size=1, gradient_accumulation_steps=4, optim="paged_adamw_32bit", save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE logging_steps=10, learning_rate=2e-4, fp16=False, bf16=True, max_grad_norm=.3, max_steps=10000, warmup_ratio=.03, group_by_length=True, lr_scheduler_type="constant", ) model.config.use_cache = False dataset = load_dataset("json", data_files="your_dataset.json", split="train") trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, dataset_text_field="text", max_seq_length=2048, tokenizer=tokenizer, args=training_arguments, packing=False, ) trainer.train()
Hi, @cekal ,
Can you please share your requirements.txt file? I am trying to finetune this model but I am getting an error from the bitsandbytes package:
Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):
CUDA Setup failed despite GPU being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
Thanks,
Sorry, I'm new to finetuning LLMs and my question might be too basic:
I have a DataFrame with two columns. "prompt" and "completion". The prompt is a statement and the completion is an argument in favor of that statement. I want to fine-tune Phi-2 for it.
I don't know if I should keep the two columns and give them separately to the model as input and label (if so, how should I give the label text to SFTTrainer?) or should I merge the two columns as one complete text column and feed that to the model? If so, how should I exactly combine the texts? I mean what special tokens should I put in the middle?
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType # peft-0.7.1
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
AutoTokenizer,
TrainingArguments,
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype='float16',
bnb_4bit_use_double_quant=False
)
model_path = "/.../phi-2/"
load model
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
# device_map=device_map,
trust_remote_code=True,
# use_auth_token=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
peft_config = LoraConfig(
r=8,
lora_alpha=8,
target_modules=['q_proj',
'k_proj',
'v_proj',
'dense',
'fc1',
'fc2',
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
modules_to_save = ["lm_head", "embed_tokens"] # because we added new tokens
)
# add LoRA adaptor
model = get_peft_model(model, peft_config)
from transformers import DataCollatorForSeq2Seq
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=label_pad_token_id,
pad_to_multiple_of=8
)
from datasets import Dataset, concatenate_datasets
training_arguments = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
save_steps=500, #CHANGE THIS IF YOU WANT IT TO SAVE LESS OFTEN. I WOULDN'T SAVE MORE OFTEN BECAUSE OF SPACE
logging_steps=10,
learning_rate=2e-4,
fp16=False,
bf16=True,
max_grad_norm=.3,
max_steps=10000,
warmup_ratio=.03,
group_by_length=True,
lr_scheduler_type="constant"
)
model.config.use_cache = False
train_dataset_object = Dataset.from_pandas(data_df[['sentence_j',
'sentence_i']].rename({'sentence_j':'prompt',
'sentence_i':'completion'},axis=1)) # here is where I'm unsure what to do
# Create a Data Collator for Seq2Seq LM
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, pad_token_id=tokenizer.pad_token_id)
# Prepare the dataset for SFTTrainer
train_dataset_generator = torch.utils.data.DataLoader(train_dataset_object, batch_size=32, collate_fn=data_collator)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=training_arguments,
packing=False,
)
trainer.train()
I want to train the phi-2 model using CPU. The configs are the same?
how can we do instruct fine tune phi 2 ? so it will follow instruction
@gugarosa I have performed full finetune with phi-2 on a single RTX A6000, but the loss is very quickly going to zero for just 10 steps. I have tried with the latest tranformers==4.37.0. Can you help me this? Thanks.
My implementation is followed: https://github.com/brevdev/notebooks/blob/e815947d907460c3ed123d49ac6aeab67a9adf22/phi2-finetune-own-data.ipynb, but I commented out the quantization and lora parts for full finetuning.
Does generation stops when it should for you?
For me, phi-2 and phi-1.5 always generate until max-length is reached, if defined.