File size: 7,641 Bytes

2ab0c2d

---
library_name: transformers
base_model: meta-llama/Llama-3.1-70B-Instruct
datasets:
- infly/INF-ORM-Preference-Magnitude-80K
pipeline_tag: text-classification
---


# INF Outcome Reward Model
## Introduction

[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).

**Note: Train Details are coming soon!**

## RewardBench Leaderboard

We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard.

| Rank  | Model                                        | Model Type        | Score | Chat  | Chat Hard | Safety | Reasoning |
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|   1   | **infly/INF-ORM-Llama3.1-70B**  | Custom Classifier   | 95.2  | 96.9  |   91.0    |  93.8  |   99.1    |
|   2   | Skywork/Skywork-Reward-Gemma-2-27B-v0.2  | Seq. Classifier   | 94.3  | 96.1  |   89.9    |  93.0  |   98.1    |
|   3   | nvidia/Llama-3.1-Nemotron-70B-Reward         | Custom Classifier | 94.1  | 97.5  |   85.7    |  95.1  |   98.1    |
|   4   | Skywork/Skywork-Reward-Gemma-2-27B           | Seq. Classifier   | 93.8  | 95.8  |   91.4    |  91.9  |   96.1    |
|   5   | SF-Foundation/TextEval-Llama3.1-70B          | Generative        | 93.5  | 94.1  |   90.1    |  93.2  |   96.4    |
|   6   | meta-metrics/MetaMetrics-RM-v1.0             | Custom Classifier | 93.4  | 98.3  |   86.4    |  90.8  |   98.2    |
|   7   | Skywork/Skywork-Critic-Llama-3.1-70B         | Generative        | 93.3  | 96.6  |   87.9    |  93.1  |   95.5    |
|   8   | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier   | 93.1  | 94.7  |   88.4    |  92.7  |   96.7    |
|   9   | nicolinho/QRM-Llama3.1-8B                    | Seq. Classifier   | 93.1  | 94.4  |   89.7    |  92.3  |   95.8    |
|   10   | LxzGordon/URM-LLaMa-3.1-8B                   | Seq. Classifier   | 92.9  | 95.5  |   88.2    |  91.1  |   97.0    |

## Demo Code

We provide an example usage of the INF-ORM-Llama3.1-70B below. 
Below is an example of obtaining the reward scores of two conversations.

```python
from typing import List, Optional, Union

import torch
import torch.nn as nn
from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast
from transformers.modeling_outputs import SequenceClassifierOutputWithPast

class INFORMForSequenceClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        self.score = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.ReLU(),
            nn.Linear(config.hidden_size, self.num_labels)
        )
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
                sequence_lengths = sequence_lengths.to(logits.device)
            else:
                sequence_lengths = -1

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

        loss = None
        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

# Load model
model_name = "infly/INF-ORM-Llama3.1-70B"
orm = INFORMForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    num_labels=1,
)

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ]
conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ]
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda")
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda")

# Inference
with torch.no_grad():
    score1 = orm(conv1_tokenized).logits[0][0].item()
    score2 = orm(conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")

# Output:

# Score for response 1: 4.96875
# Score for response 2: 2.890625

```

## Declaration and License Agreement

### Declaration

### License Agreement

## Contact
If you have any questions, please feel free to reach us at <23210720070@m.fudan.edu.cn>.
## Citation