File size: 8,345 Bytes
d4a1475 c33f334 d4a1475 c33f334 d4a1475 c33f334 d4a1475 9846bc7 d4a1475 c33f334 d4a1475 c33f334 d4a1475 c33f334 d4a1475 c33f334 d4a1475 c33f334 d4a1475 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
---
library_name: transformers
tags: []
---
## Model Details
### Model Description
This model is created for answering the KUET(Khulna University of Engineering & Technology) information.
- **Developed by:** Md. Shahidul Salim
- **Model type:** Question answering
- **Language(s) (NLP):** English
- **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1
## How to Get Started with the Model
```
import transformers
from transformers import AutoTokenizer
model_name="shahidul034/KUET_LLM_Mistral"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
pipe = pipeline("text-generation",
model=full_output,
tokenizer= tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto",
max_new_tokens = 512,
do_sample=True,
top_k=30,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id
)
from langchain import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0})
from langchain.llms import HuggingFaceTextGenInference
from langchain.llms import HuggingFaceTextGenInference
from langchain import PromptTemplate
from langchain.schema import StrOutputParser
template = """
<s>[INST] <<SYS>>
{role}
<</SYS>>
{text} [/INST]
"""
prompt = PromptTemplate(
input_variables = [
"role",
"text"
],
template = template,
)
role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET."
chain = prompt | llm | StrOutputParser()
ques="What is KUET?"
ans=chain.invoke({"role": role,"text":ques})
print(ans)
```
[More Information Needed]
## Training Details
### Training Data
Custom dataset, which is collected from the KUET website.
### Training Procedure
```
import os
import torch
from datasets import load_dataset, Dataset
import pandas as pd
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
import transformers
# from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig
from pynvml import *
import glob
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
lora_output = 'models/lora_KUET_LLM_Mistral'
full_output = 'models/full_KUET_LLM_Mistral'
DEVICE = 'cuda'
bnb_config = BitsAndBytesConfig(
load_in_8bit= True,
# bnb_4bit_quant_type= "nf4",
# bnb_4bit_compute_dtype= torch.bfloat16,
# bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
# load_in_4bit=True,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
### read csv with Prompt, Answer pair
data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here
data_df=pd.read_excel( data_location )
def formatted_text(x):
temp = [
# {"role": "system", "content": "Answer as a medical assistant. Respond concisely."},
{"role": "user", "content": """Answer the question concisely as a medical assisstant.
Question: """ + x["Prompt"]},
{"role": "assistant", "content": x["Reply"]}
]
return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False)
### set formatting
data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names
print(data_df.iloc[0])
dataset = Dataset.from_pandas(data_df)
# Set PEFT adapter config (16:32)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# target modules are currently selected for zephyr base model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"], # target all the linear layers for full finetuning
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM")
# stabilize output layer and layernorms
model = prepare_model_for_kbit_training(model, 8)
# Set PEFT adapter on model (Last step)
model = get_peft_model(model, config)
# Set Hyperparameters
MAXLEN=512
BATCH_SIZE=4
GRAD_ACC=4
OPTIMIZER='paged_adamw_8bit' # save memory
LR=5e-06 # slightly smaller than pretraining lr | and close to LoRA standard
# Set training config
training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRAD_ACC,
optim=OPTIMIZER,
learning_rate=LR,
fp16=True, # consider compatibility when using bf16
logging_steps=10,
num_train_epochs = 2,
output_dir=lora_output,
remove_unused_columns=True,
)
# Set collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False)
# Setup trainer
trainer = SFTTrainer(model=model,
train_dataset=dataset,
data_collator=data_collator,
args=training_config,
dataset_text_field="text",
# callbacks=[early_stop], need to learn, lora easily overfits
)
trainer.train()
trainer.save_model(lora_output)
# Get peft config
from peft import PeftConfig
config = PeftConfig.from_pretrained(lora_output)
# Get base model
model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model)
# Load the Lora model
from peft import PeftModel
model = PeftModel.from_pretrained(model, lora_output)
# Get tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path)
merged_model = model.merge_and_unload()
merged_model.save_pretrained(full_output)
tokenizer.save_pretrained(full_output)
```
#### Preprocessing [optional]
[More Information Needed]
#### Training Hyperparameters
- The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 24
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 96
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2
- mixed_precision_training: Native AMP
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
194 questions are generated by students.
[More Information Needed]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
### Results
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hours used:** 2 hours
#### Hardware
RTX 4090
|