|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is created for answering the KUET(Khulna University of Engineering & Technology) information. |
|
|
|
- **Developed by:** Md. Shahidul Salim |
|
- **Model type:** Question answering |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1 |
|
|
|
|
|
## How to Get Started with the Model |
|
``` |
|
import transformers |
|
from transformers import AutoTokenizer |
|
model_name="shahidul034/KUET_LLM_Mistral" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = transformers.AutoModelForCausalLM.from_pretrained(model_name) |
|
pipe = pipeline("text-generation", |
|
model=full_output, |
|
tokenizer= tokenizer, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
max_new_tokens = 512, |
|
do_sample=True, |
|
top_k=30, |
|
num_return_sequences=1, |
|
eos_token_id=tokenizer.eos_token_id |
|
) |
|
from langchain import HuggingFacePipeline |
|
llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0}) |
|
from langchain.llms import HuggingFaceTextGenInference |
|
from langchain.llms import HuggingFaceTextGenInference |
|
from langchain import PromptTemplate |
|
from langchain.schema import StrOutputParser |
|
|
|
template = """ |
|
<s>[INST] <<SYS>> |
|
{role} |
|
<</SYS>> |
|
{text} [/INST] |
|
""" |
|
|
|
prompt = PromptTemplate( |
|
input_variables = [ |
|
"role", |
|
"text" |
|
], |
|
template = template, |
|
) |
|
role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET." |
|
chain = prompt | llm | StrOutputParser() |
|
ques="What is KUET?" |
|
ans=chain.invoke({"role": role,"text":ques}) |
|
print(ans) |
|
``` |
|
|
|
[More Information Needed] |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Custom dataset for collecting from KUET website. |
|
|
|
### Training Procedure |
|
|
|
``` |
|
import os |
|
import torch |
|
from datasets import load_dataset, Dataset |
|
import pandas as pd |
|
import transformers |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
from trl import SFTTrainer |
|
import transformers |
|
# from peft import AutoPeftModelForCausalLM |
|
from transformers import GenerationConfig |
|
from pynvml import * |
|
import glob |
|
base_model = "mistralai/Mistral-7B-Instruct-v0.2" |
|
lora_output = 'models/lora_KUET_LLM_Mistral' |
|
full_output = 'models/full_KUET_LLM_Mistral' |
|
DEVICE = 'cuda' |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_8bit= True, |
|
# bnb_4bit_quant_type= "nf4", |
|
# bnb_4bit_compute_dtype= torch.bfloat16, |
|
# bnb_4bit_use_double_quant= False, |
|
) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model, |
|
# load_in_4bit=True, |
|
quantization_config=bnb_config, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
trust_remote_code=True, |
|
) |
|
model.config.use_cache = False # silence the warnings |
|
model.config.pretraining_tp = 1 |
|
model.gradient_checkpointing_enable() |
|
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) |
|
tokenizer.padding_side = 'right' |
|
tokenizer.pad_token = tokenizer.eos_token |
|
tokenizer.add_eos_token = True |
|
tokenizer.add_bos_token, tokenizer.add_eos_token |
|
|
|
### read csv with Prompt, Answer pair |
|
data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here |
|
data_df=pd.read_excel( data_location ) |
|
def formatted_text(x): |
|
temp = [ |
|
# {"role": "system", "content": "Answer as a medical assistant. Respond concisely."}, |
|
{"role": "user", "content": """Answer the question concisely as a medical assisstant. |
|
Question: """ + x["Prompt"]}, |
|
{"role": "assistant", "content": x["Reply"]} |
|
] |
|
return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False) |
|
|
|
### set formatting |
|
data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names |
|
print(data_df.iloc[0]) |
|
dataset = Dataset.from_pandas(data_df) |
|
# Set PEFT adapter config (16:32) |
|
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training |
|
|
|
# target modules are currently selected for zephyr base model |
|
config = LoraConfig( |
|
r=16, |
|
lora_alpha=32, |
|
target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"], # target all the linear layers for full finetuning |
|
lora_dropout=0.05, |
|
bias="none", |
|
task_type="CAUSAL_LM") |
|
|
|
# stabilize output layer and layernorms |
|
model = prepare_model_for_kbit_training(model, 8) |
|
# Set PEFT adapter on model (Last step) |
|
model = get_peft_model(model, config) |
|
# Set Hyperparameters |
|
MAXLEN=512 |
|
BATCH_SIZE=4 |
|
GRAD_ACC=4 |
|
OPTIMIZER='paged_adamw_8bit' # save memory |
|
LR=5e-06 # slightly smaller than pretraining lr | and close to LoRA standard |
|
# Set training config |
|
training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE, |
|
gradient_accumulation_steps=GRAD_ACC, |
|
optim=OPTIMIZER, |
|
learning_rate=LR, |
|
fp16=True, # consider compatibility when using bf16 |
|
logging_steps=10, |
|
num_train_epochs = 2, |
|
output_dir=lora_output, |
|
remove_unused_columns=True, |
|
) |
|
|
|
# Set collator |
|
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False) |
|
|
|
# Setup trainer |
|
trainer = SFTTrainer(model=model, |
|
train_dataset=dataset, |
|
data_collator=data_collator, |
|
args=training_config, |
|
dataset_text_field="text", |
|
# callbacks=[early_stop], need to learn, lora easily overfits |
|
) |
|
|
|
trainer.train() |
|
trainer.save_model(lora_output) |
|
|
|
# Get peft config |
|
from peft import PeftConfig |
|
config = PeftConfig.from_pretrained(lora_output) |
|
# Get base model |
|
model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path) |
|
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model) |
|
# Load the Lora model |
|
from peft import PeftModel |
|
model = PeftModel.from_pretrained(model, lora_output) |
|
|
|
# Get tokenizer |
|
tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
|
merged_model = model.merge_and_unload() |
|
merged_model.save_pretrained(full_output) |
|
tokenizer.save_pretrained(full_output) |
|
|
|
``` |
|
|
|
#### Preprocessing [optional] |
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- The following hyperparameters were used during training: |
|
- learning_rate: 0.0002 |
|
- train_batch_size: 24 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 4 |
|
- total_train_batch_size: 96 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 2 |
|
- mixed_precision_training: Native AMP |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
[More Information Needed] |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
194 questions are generated by students. |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hours used:** 2 hours |
|
|
|
|
|
#### Hardware |
|
RTX 4090 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|