--- library_name: transformers tags: [] --- ## Model Details ### Model Description This model is created for answering the KUET(Khulna University of Engineering & Technology) information. - **Developed by:** Md. Shahidul Salim - **Model type:** Question answering - **Language(s) (NLP):** English - **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1 ## How to Get Started with the Model ``` import transformers from transformers import AutoTokenizer model_name="shahidul034/KUET_LLM_Mistral" tokenizer = AutoTokenizer.from_pretrained(model_name) model = transformers.AutoModelForCausalLM.from_pretrained(model_name) pipe = pipeline("text-generation", model=full_output, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=30, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id ) from langchain import HuggingFacePipeline llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0}) from langchain.llms import HuggingFaceTextGenInference from langchain.llms import HuggingFaceTextGenInference from langchain import PromptTemplate from langchain.schema import StrOutputParser template = """ [INST] <> {role} <> {text} [/INST] """ prompt = PromptTemplate( input_variables = [ "role", "text" ], template = template, ) role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET." chain = prompt | llm | StrOutputParser() ques="What is KUET?" ans=chain.invoke({"role": role,"text":ques}) print(ans) ``` [More Information Needed] ## Training Details ### Training Data Custom dataset, which is collected from the KUET website. ### Training Procedure ``` import os import torch from datasets import load_dataset, Dataset import pandas as pd import transformers from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from trl import SFTTrainer import transformers # from peft import AutoPeftModelForCausalLM from transformers import GenerationConfig from pynvml import * import glob base_model = "mistralai/Mistral-7B-Instruct-v0.2" lora_output = 'models/lora_KUET_LLM_Mistral' full_output = 'models/full_KUET_LLM_Mistral' DEVICE = 'cuda' bnb_config = BitsAndBytesConfig( load_in_8bit= True, # bnb_4bit_quant_type= "nf4", # bnb_4bit_compute_dtype= torch.bfloat16, # bnb_4bit_use_double_quant= False, ) model = AutoModelForCausalLM.from_pretrained( base_model, # load_in_4bit=True, quantization_config=bnb_config, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) model.config.use_cache = False # silence the warnings model.config.pretraining_tp = 1 model.gradient_checkpointing_enable() tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) tokenizer.padding_side = 'right' tokenizer.pad_token = tokenizer.eos_token tokenizer.add_eos_token = True tokenizer.add_bos_token, tokenizer.add_eos_token ### read csv with Prompt, Answer pair data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here data_df=pd.read_excel( data_location ) def formatted_text(x): temp = [ # {"role": "system", "content": "Answer as a medical assistant. Respond concisely."}, {"role": "user", "content": """Answer the question concisely as a medical assisstant. Question: """ + x["Prompt"]}, {"role": "assistant", "content": x["Reply"]} ] return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False) ### set formatting data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names print(data_df.iloc[0]) dataset = Dataset.from_pandas(data_df) # Set PEFT adapter config (16:32) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training # target modules are currently selected for zephyr base model config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"], # target all the linear layers for full finetuning lora_dropout=0.05, bias="none", task_type="CAUSAL_LM") # stabilize output layer and layernorms model = prepare_model_for_kbit_training(model, 8) # Set PEFT adapter on model (Last step) model = get_peft_model(model, config) # Set Hyperparameters MAXLEN=512 BATCH_SIZE=4 GRAD_ACC=4 OPTIMIZER='paged_adamw_8bit' # save memory LR=5e-06 # slightly smaller than pretraining lr | and close to LoRA standard # Set training config training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE, gradient_accumulation_steps=GRAD_ACC, optim=OPTIMIZER, learning_rate=LR, fp16=True, # consider compatibility when using bf16 logging_steps=10, num_train_epochs = 2, output_dir=lora_output, remove_unused_columns=True, ) # Set collator data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False) # Setup trainer trainer = SFTTrainer(model=model, train_dataset=dataset, data_collator=data_collator, args=training_config, dataset_text_field="text", # callbacks=[early_stop], need to learn, lora easily overfits ) trainer.train() trainer.save_model(lora_output) # Get peft config from peft import PeftConfig config = PeftConfig.from_pretrained(lora_output) # Get base model model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path) tokenizer = transformers.AutoTokenizer.from_pretrained(base_model) # Load the Lora model from peft import PeftModel model = PeftModel.from_pretrained(model, lora_output) # Get tokenizer tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path) merged_model = model.merge_and_unload() merged_model.save_pretrained(full_output) tokenizer.save_pretrained(full_output) ``` #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 24 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 96 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 2 - mixed_precision_training: Native AMP #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data 194 questions are generated by students. [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hours used:** 2 hours #### Hardware RTX 4090