KUET_LLM_Mistral / README.md
shahidul034's picture
Update README.md
9846bc7 verified
---
library_name: transformers
tags: []
---
## Model Details
### Model Description
This model is created for answering the KUET(Khulna University of Engineering & Technology) information.
- **Developed by:** Md. Shahidul Salim
- **Model type:** Question answering
- **Language(s) (NLP):** English
- **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1
## How to Get Started with the Model
```
import transformers
from transformers import AutoTokenizer
model_name="shahidul034/KUET_LLM_Mistral"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline("text-generation",
model=full_output,
tokenizer= tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto",
max_new_tokens = 512,
do_sample=True,
top_k=30,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id
)
from langchain import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0})
from langchain.llms import HuggingFaceTextGenInference
from langchain.llms import HuggingFaceTextGenInference
from langchain import PromptTemplate
from langchain.schema import StrOutputParser
template = """
<s>[INST] <<SYS>>
{role}
<</SYS>>
{text} [/INST]
"""
prompt = PromptTemplate(
input_variables = [
"role",
"text"
],
template = template,
)
role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET."
chain = prompt | llm | StrOutputParser()
ques="What is KUET?"
ans=chain.invoke({"role": role,"text":ques})
print(ans)
```
[More Information Needed]
## Training Details
### Training Data
Custom dataset, which is collected from the KUET website.
### Training Procedure
```
import os
import torch
from datasets import load_dataset, Dataset
import pandas as pd
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
import transformers
# from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig
from pynvml import *
import glob
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
lora_output = 'models/lora_KUET_LLM_Mistral'
full_output = 'models/full_KUET_LLM_Mistral'
DEVICE = 'cuda'
bnb_config = BitsAndBytesConfig(
load_in_8bit= True,
# bnb_4bit_quant_type= "nf4",
# bnb_4bit_compute_dtype= torch.bfloat16,
# bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
# load_in_4bit=True,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
### read csv with Prompt, Answer pair
data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here
data_df=pd.read_excel( data_location )
def formatted_text(x):
temp = [
# {"role": "system", "content": "Answer as a medical assistant. Respond concisely."},
{"role": "user", "content": """Answer the question concisely as a medical assisstant.
Question: """ + x["Prompt"]},
{"role": "assistant", "content": x["Reply"]}
]
return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False)
### set formatting
data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names
print(data_df.iloc[0])
dataset = Dataset.from_pandas(data_df)
# Set PEFT adapter config (16:32)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# target modules are currently selected for zephyr base model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"], # target all the linear layers for full finetuning
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM")
# stabilize output layer and layernorms
model = prepare_model_for_kbit_training(model, 8)
# Set PEFT adapter on model (Last step)
model = get_peft_model(model, config)
# Set Hyperparameters
MAXLEN=512
BATCH_SIZE=4
GRAD_ACC=4
OPTIMIZER='paged_adamw_8bit' # save memory
LR=5e-06 # slightly smaller than pretraining lr | and close to LoRA standard
# Set training config
training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACC,
optim=OPTIMIZER,
learning_rate=LR,
fp16=True, # consider compatibility when using bf16
logging_steps=10,
num_train_epochs = 2,
output_dir=lora_output,
remove_unused_columns=True,
)
# Set collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False)
# Setup trainer
trainer = SFTTrainer(model=model,
train_dataset=dataset,
data_collator=data_collator,
args=training_config,
dataset_text_field="text",
# callbacks=[early_stop], need to learn, lora easily overfits
)
trainer.train()
trainer.save_model(lora_output)
# Get peft config
from peft import PeftConfig
config = PeftConfig.from_pretrained(lora_output)
# Get base model
model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model)
# Load the Lora model
from peft import PeftModel
model = PeftModel.from_pretrained(model, lora_output)
# Get tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path)
merged_model = model.merge_and_unload()
merged_model.save_pretrained(full_output)
tokenizer.save_pretrained(full_output)
```
#### Preprocessing [optional]
[More Information Needed]
#### Training Hyperparameters
- The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 24
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 96
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2
- mixed_precision_training: Native AMP
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
194 questions are generated by students.
[More Information Needed]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
### Results
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hours used:** 2 hours
#### Hardware
RTX 4090