|
--- |
|
library_name: transformers |
|
base_model: meta-llama/Llama-3.1-70B-Instruct |
|
datasets: |
|
- infly/INF-ORM-Preference-Magnitude-80K |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
|
|
# INF Outcome Reward Model |
|
## Introduction |
|
|
|
[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K). |
|
|
|
**Note: Train Details are coming soon!** |
|
|
|
## RewardBench Leaderboard |
|
|
|
We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard. |
|
|
|
| Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning | |
|
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: | |
|
| 1 | **infly/INF-ORM-Llama3.1-70B** | Custom Classifier | 95.2 | 96.9 | 91.0 | 93.8 | 99.1 | |
|
| 2 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 | |
|
| 3 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 | |
|
| 4 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 | |
|
| 5 | SF-Foundation/TextEval-Llama3.1-70B | Generative | 93.5 | 94.1 | 90.1 | 93.2 | 96.4 | |
|
| 6 | meta-metrics/MetaMetrics-RM-v1.0 | Custom Classifier | 93.4 | 98.3 | 86.4 | 90.8 | 98.2 | |
|
| 7 | Skywork/Skywork-Critic-Llama-3.1-70B | Generative | 93.3 | 96.6 | 87.9 | 93.1 | 95.5 | |
|
| 8 | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier | 93.1 | 94.7 | 88.4 | 92.7 | 96.7 | |
|
| 9 | nicolinho/QRM-Llama3.1-8B | Seq. Classifier | 93.1 | 94.4 | 89.7 | 92.3 | 95.8 | |
|
| 10 | LxzGordon/URM-LLaMa-3.1-8B | Seq. Classifier | 92.9 | 95.5 | 88.2 | 91.1 | 97.0 | |
|
|
|
## Demo Code |
|
|
|
We provide an example usage of the INF-ORM-Llama3.1-70B below. |
|
Below is an example of obtaining the reward scores of two conversations. |
|
|
|
```python |
|
from typing import List, Optional, Union |
|
|
|
import torch |
|
import torch.nn as nn |
|
from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast |
|
from transformers.modeling_outputs import SequenceClassifierOutputWithPast |
|
|
|
class INFORMForSequenceClassification(LlamaPreTrainedModel): |
|
def __init__(self, config): |
|
super().__init__(config) |
|
self.num_labels = config.num_labels |
|
self.model = LlamaModel(config) |
|
self.score = nn.Sequential( |
|
nn.Linear(config.hidden_size, config.hidden_size), |
|
nn.ReLU(), |
|
nn.Linear(config.hidden_size, self.num_labels) |
|
) |
|
# Initialize weights and apply final processing |
|
self.post_init() |
|
|
|
def forward( |
|
self, |
|
input_ids: Optional[torch.LongTensor] = None, |
|
attention_mask: Optional[torch.Tensor] = None, |
|
position_ids: Optional[torch.LongTensor] = None, |
|
past_key_values: Optional[List[torch.FloatTensor]] = None, |
|
inputs_embeds: Optional[torch.FloatTensor] = None, |
|
labels: Optional[torch.LongTensor] = None, |
|
use_cache: Optional[bool] = None, |
|
output_attentions: Optional[bool] = None, |
|
output_hidden_states: Optional[bool] = None, |
|
return_dict: Optional[bool] = None, |
|
): |
|
|
|
transformer_outputs = self.model( |
|
input_ids, |
|
attention_mask=attention_mask, |
|
position_ids=position_ids, |
|
past_key_values=past_key_values, |
|
inputs_embeds=inputs_embeds, |
|
) |
|
hidden_states = transformer_outputs[0] |
|
logits = self.score(hidden_states) |
|
|
|
if input_ids is not None: |
|
batch_size = input_ids.shape[0] |
|
else: |
|
batch_size = inputs_embeds.shape[0] |
|
|
|
if self.config.pad_token_id is None and batch_size != 1: |
|
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.") |
|
if self.config.pad_token_id is None: |
|
sequence_lengths = -1 |
|
else: |
|
if input_ids is not None: |
|
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility |
|
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1 |
|
sequence_lengths = sequence_lengths % input_ids.shape[-1] |
|
sequence_lengths = sequence_lengths.to(logits.device) |
|
else: |
|
sequence_lengths = -1 |
|
|
|
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths] |
|
|
|
loss = None |
|
return SequenceClassifierOutputWithPast( |
|
loss=loss, |
|
logits=pooled_logits, |
|
past_key_values=transformer_outputs.past_key_values, |
|
hidden_states=transformer_outputs.hidden_states, |
|
attentions=transformer_outputs.attentions, |
|
) |
|
|
|
# Load model |
|
model_name = "infly/INF-ORM-Llama3.1-70B" |
|
orm = INFORMForSequenceClassification.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
attn_implementation="flash_attention_2", |
|
num_labels=1, |
|
) |
|
|
|
# Load tokenizer |
|
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name) |
|
conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ] |
|
conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ] |
|
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda") |
|
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda") |
|
|
|
# Inference |
|
with torch.no_grad(): |
|
score1 = orm(conv1_tokenized).logits[0][0].item() |
|
score2 = orm(conv2_tokenized).logits[0][0].item() |
|
print(f"Score for response 1: {score1}") |
|
print(f"Score for response 2: {score2}") |
|
|
|
# Output: |
|
|
|
# Score for response 1: 4.96875 |
|
# Score for response 2: 2.890625 |
|
|
|
``` |
|
|
|
## Declaration and License Agreement |
|
|
|
### Declaration |
|
|
|
### License Agreement |
|
|
|
## Contact |
|
If you have any questions, please feel free to reach us at <23210720070@m.fudan.edu.cn>. |
|
## Citation |
|
|
|
|
|
|