--- |
library_name: transformers |
tags: [] |
--- |
## Model Details |
### Model Description |
This model is created for answering the KUET(Khulna University of Engineering & Technology) information. |
- **Developed by:** Md. Shahidul Salim |
- **Model type:** Question answering |
- **Language(s) (NLP):** English |
- **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1 |
## How to Get Started with the Model |
``` |
import transformers |
from transformers import AutoTokenizer |
model_name="shahidul034/KUET_LLM_Mistral" |
tokenizer = AutoTokenizer.from_pretrained(model_name) |
model = transformers.AutoModelForCausalLM.from_pretrained(model_name) |
pipe = pipeline("text-generation", |
model=full_output, |
tokenizer= tokenizer, |
torch_dtype=torch.bfloat16, |
device_map="auto", |
max_new_tokens = 512, |
do_sample=True, |
top_k=30, |
num_return_sequences=1, |
eos_token_id=tokenizer.eos_token_id |
) |
from langchain import HuggingFacePipeline |
llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0}) |
from langchain.llms import HuggingFaceTextGenInference |
from langchain.llms import HuggingFaceTextGenInference |
from langchain import PromptTemplate |
from langchain.schema import StrOutputParser |
template = """ |
<s>[INST] <<SYS>> |
{role} |
<</SYS>> |
{text} [/INST] |
""" |
prompt = PromptTemplate( |
input_variables = [ |
"role", |
"text" |
], |
template = template, |
) |
role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET." |
chain = prompt | llm | StrOutputParser() |
ques="What is KUET?" |
ans=chain.invoke({"role": role,"text":ques}) |
print(ans) |
``` |
[More Information Needed] |
## Training Details |
### Training Data |
Custom dataset, which is collected from the KUET website. |
### Training Procedure |
``` |
import os |
import torch |
from datasets import load_dataset, Dataset |
import pandas as pd |
import transformers |
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
from trl import SFTTrainer |
import transformers |
# from peft import AutoPeftModelForCausalLM |
from transformers import GenerationConfig |
from pynvml import * |
import glob |
base_model = "mistralai/Mistral-7B-Instruct-v0.2" |
lora_output = 'models/lora_KUET_LLM_Mistral' |
full_output = 'models/full_KUET_LLM_Mistral' |
DEVICE = 'cuda' |
bnb_config = BitsAndBytesConfig( |
load_in_8bit= True, |
# bnb_4bit_quant_type= "nf4", |
# bnb_4bit_compute_dtype= torch.bfloat16, |
# bnb_4bit_use_double_quant= False, |
) |
model = AutoModelForCausalLM.from_pretrained( |
base_model, |
# load_in_4bit=True, |
quantization_config=bnb_config, |
torch_dtype=torch.bfloat16, |
device_map="auto", |
trust_remote_code=True, |
) |
model.config.use_cache = False # silence the warnings |
model.config.pretraining_tp = 1 |
model.gradient_checkpointing_enable() |
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) |
tokenizer.padding_side = 'right' |
tokenizer.pad_token = tokenizer.eos_token |
tokenizer.add_eos_token = True |
tokenizer.add_bos_token, tokenizer.add_eos_token |
### read csv with Prompt, Answer pair |
data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here |
data_df=pd.read_excel( data_location ) |
def formatted_text(x): |
temp = [ |
# {"role": "system", "content": "Answer as a medical assistant. Respond concisely."}, |
{"role": "user", "content": """Answer the question concisely as a medical assisstant. |
Question: """ + x["Prompt"]}, |
{"role": "assistant", "content": x["Reply"]} |
] |
return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False) |
### set formatting |
data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names |
print(data_df.iloc[0]) |
dataset = Dataset.from_pandas(data_df) |
# Set PEFT adapter config (16:32) |
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training |
# target modules are currently selected for zephyr base model |
config = LoraConfig( |
r=16, |
lora_alpha=32, |
target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"], # target all the linear layers for full finetuning |
lora_dropout=0.05, |
bias="none", |
task_type="CAUSAL_LM") |
# stabilize output layer and layernorms |
model = prepare_model_for_kbit_training(model, 8) |
# Set PEFT adapter on model (Last step) |
model = get_peft_model(model, config) |
# Set Hyperparameters |
MAXLEN=512 |
OPTIMIZER='paged_adamw_8bit' # save memory |
LR=5e-06 # slightly smaller than pretraining lr | and close to LoRA standard |
# Set training config |
training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE, |
gradient_accumulation_steps=GRAD_ACC, |
optim=OPTIMIZER, |
learning_rate=LR, |
fp16=True, # consider compatibility when using bf16 |
logging_steps=10, |
num_train_epochs = 2, |
output_dir=lora_output, |
remove_unused_columns=True, |
) |
# Set collator |
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False) |
# Setup trainer |
trainer = SFTTrainer(model=model, |
train_dataset=dataset, |
data_collator=data_collator, |
args=training_config, |
dataset_text_field="text", |
# callbacks=[early_stop], need to learn, lora easily overfits |
) |
trainer.train() |
trainer.save_model(lora_output) |
# Get peft config |
from peft import PeftConfig |
config = PeftConfig.from_pretrained(lora_output) |
# Get base model |
model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path) |
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model) |
# Load the Lora model |
from peft import PeftModel |
model = PeftModel.from_pretrained(model, lora_output) |
# Get tokenizer |
tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
merged_model = model.merge_and_unload() |
merged_model.save_pretrained(full_output) |
tokenizer.save_pretrained(full_output) |
``` |
#### Preprocessing [optional] |
[More Information Needed] |
#### Training Hyperparameters |
- The following hyperparameters were used during training: |
- learning_rate: 0.0002 |
- train_batch_size: 24 |
- eval_batch_size: 8 |
- seed: 42 |
- gradient_accumulation_steps: 4 |
- total_train_batch_size: 96 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
- lr_scheduler_type: linear |
- num_epochs: 2 |
- mixed_precision_training: Native AMP |
#### Speeds, Sizes, Times [optional] |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
[More Information Needed] |
## Evaluation |
<!-- This section describes the evaluation protocols and provides the results. --> |
### Testing Data, Factors & Metrics |
#### Testing Data |
194 questions are generated by students. |
[More Information Needed] |
#### Factors |
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
[More Information Needed] |
#### Metrics |
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
[More Information Needed] |
### Results |
[More Information Needed] |
## Environmental Impact |
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
- **Hours used:** 2 hours |
#### Hardware |
RTX 4090 |