Mistral-7B-text-to-RLHF

This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.1 on the generator dataset Anthropic/hh-rlhf. It achieves the following results on the evaluation set:

Loss: 0.7952

Model description

Human-in-the-Loop Fine-tuning of Mistral-7B for Enhanced Text Generation and Text-to-SQL

Training data

Full Code - Fine-Tunning with Supervised Fine-tuning (SFT) GITHUB

Evaluation data

Human-in-the-Loop Fine-tuning of Mistral-7B for Enhanced Text Generation and Text-to-SQL

Full Code GITHUB


from accelerate import Accelerator
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig

#Initialize the accelerator

accelerator = Accelerator()


#From my Hugging Face Repository

model_id = 'frankmorales2020/Mistral-7B-text-to-RLHF'

# BitsAndBytesConfig int-4 config (if used for your reward model)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the reward model and tokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=1,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.padding_side = "right"

model.config.pad_token_id = tokenizer.pad_token_id 

# Test cases
test_cases = [
    ("What is the capital of France?", "Paris", "London"),
    ("Who painted the Mona Lisa?", "Leonardo da Vinci", "Michelangelo"),
    ("What is the largest planet in our solar system?", "Jupiter", "Mars"),
    ("What would you do if you saw someone drop their wallet?", "Pick it up and return it to them.", "Ignore it."),
    ("What color is the sky?", "Blue", "Green"),
    ("What is the chemical symbol for water?", "H2O", "CO2"),
    # Add more test cases here...
]

def evaluate_example(prompt, chosen, rejected):
    inputs = tokenizer(
        [f"{prompt} {chosen}", f"{prompt} {rejected}"],
        return_tensors="pt",
        padding=True,
    ).to(accelerator.device)
    outputs = model(**inputs)
    chosen_score = outputs.logits[0].item()
    rejected_score = outputs.logits[1].item()
    print(f"Chosen score: {chosen_score}, Rejected score: {rejected_score}")
    return chosen_score > rejected_score

correct_predictions = 0
total_reciprocal_rank = 0

for i, (prompt, chosen, rejected) in enumerate(test_cases):
    print("\n")
    print(f"Prompt: {prompt}, Chosen: {chosen}, Rejected: {rejected}")
    print("\n")
    if evaluate_example(prompt, chosen, rejected):
        print("Test passed!")
        correct_predictions += 1
        total_reciprocal_rank += 1
    else:
        print("Test failed.")
        total_reciprocal_rank += 0  # Incorrect prediction

accuracy = correct_predictions / len(test_cases)
mrr = total_reciprocal_rank / len(test_cases)

print(f"\nOverall accuracy: {accuracy:.2f}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.2f}")


Prompt: What is the capital of France?, Chosen: Paris, Rejected: London


Chosen score: 3.890625, Rejected score: -15.375
Test passed!


Prompt: Who painted the Mona Lisa?, Chosen: Leonardo da Vinci, Rejected: Michelangelo


Chosen score: 6.0625, Rejected score: 4.1875
Test passed!


Prompt: What is the largest planet in our solar system?, Chosen: Jupiter, Rejected: Mars


Chosen score: 10.6875, Rejected score: 10.0625
Test passed!


Prompt: What would you do if you saw someone drop their wallet?, Chosen: Pick it up and return it to them., Rejected: Ignore it.


Chosen score: 3.140625, Rejected score: 0.13671875
Test passed!


Prompt: What color is the sky?, Chosen: Blue, Rejected: Green


Chosen score: 11.0625, Rejected score: 4.46875
Test passed!


Prompt: What is the chemical symbol for water?, Chosen: H2O, Rejected: CO2


Chosen score: 0.42578125, Rejected score: -0.68359375
Test passed!

Overall accuracy: 1.00
Mean Reciprocal Rank (MRR): 1.00
Number of questions used for MRR calculation: 6

Training procedure

https://github.com/frank-morales2020/MLxDL/blob/main/FineTuning_LLM_Mistral_7B_Instruct_v0_1_for_text_to_RLHF.ipynb

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 3
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 6
optimizer: Use adamw_torch_fused with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.03
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
1.7876	1.0	507	0.9024
1.0272	2.0	1014	0.7952
0.638	3.0	1521	0.8579

Framework versions

PEFT 0.13.2
Transformers 4.46.1
Pytorch 2.5.0+cu121
Datasets 3.0.2
Tokenizers 0.20.1

frankmorales2020
/

Mistral-7B-text-to-RLHF