Paper: https://arxiv.org/pdf/2410.00847
Model: URM-LLaMa-3-8B
- Fine-tuned from FsfairX-LLaMA3-RM-v0.1

Brief

URM-LLaMa-3-8B is an uncertain-aware reward model. This RM consists of a base model and an uncertainty-aware and attribute-specific value head. The base model of this RM is from FsfairX-LLaMA3-RM-v0.1.

Attribute Regression

Dataset: HelpSteer2

During training, instead of multi-attributes scores, outputs of the uncertainty-aware value head are parameters of a normal distribution, from which scores are sampled. Then we run regression on the outputs with the labels to train the value head. To enable gradient back-propagation, reparameterization technique is used.

We use the five attributes from HelpSteer2: Helpfulness, Correctness, Coherence, Complexity and Verbosity. We use weighted sum to combine these attributes with prior weights [0.3, 0.74, 0.46, 0.47,-0.33] recommended by Nemotron-4.

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "LxzGordon/URM-LLaMa-3-8B"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    device_map='auto',
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "when were the first Olympic Games held?"
response1 = "April 1896"
response2 = "April 1892"

resp1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
resp2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

# Format and tokenize the conversations
resp1 = tokenizer.apply_chat_template(resp1, tokenize=False)
resp2 = tokenizer.apply_chat_template(resp2, tokenize=False)
resp1 = tokenizer(resp1, return_tensors="pt").to(model.device)
resp2 = tokenizer(resp2, return_tensors="pt").to(model.device)

with torch.no_grad():
    score1 = model(resp1['input_ids'],attention_mask=resp1['attention_mask']).logits[0][0].item()
    score2 = model(resp2['input_ids'],attention_mask=resp2['attention_mask']).logits[0][0].item()
print(score1,score2)

# Response 1 score: 3.669522523880005, Response 2 score: 2.5036821365356445

Reference

Please cite

@article{lou2024uncertainty,
  title={Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown},
  author={Lou, Xingzhou and Yan, Dong and Shen, Wei and Yan, Yuzi and Xie, Jian and Zhang, Junge},
  journal={arXiv preprint arXiv:2410.00847},
  year={2024}
}

LxzGordon
/

URM-LLaMa-3-8B

Brief

Attribute Regression

Usage

Reference

Dataset used to train LxzGordon/URM-LLaMa-3-8B