|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- berkeley-nest/Nectar |
|
language: |
|
- en |
|
--- |
|
# Starling-RM-34B |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Starling-RM-34B is a reward model trained from [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat). Following the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Yi-34B-Chat, |
|
and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), |
|
with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and |
|
less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar) is based on GPT-4 preference, the reward model is likely to be biased |
|
towards GPT-4's own preference, including longer responses and certain response format. |
|
|
|
For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper! |
|
|
|
|
|
- **Developed by:** Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao. |
|
- **Model type:** Reward Model for RLHF |
|
- **License:** Apache-2.0 license under the condition that the dataset is not used to compete with OpenAI |
|
- **Finetuned from model:** [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) |
|
|
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Blog:** https://starling.cs.berkeley.edu/ |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
Please use the following code for inference with the reward model. |
|
|
|
```python |
|
import os |
|
import torch |
|
from torch import nn |
|
from transformers import AutoTokenizer, LlamaPreTrainedModel,LlamaModel |
|
import math |
|
|
|
## Define the reward model function class |
|
|
|
class LlamaForSequenceClassification(LlamaPreTrainedModel): |
|
def __init__(self, config): |
|
super().__init__(config) |
|
self.transformer = LlamaModel(config) |
|
self.v_head = nn.Linear(config.hidden_size, 1, bias=False) |
|
self.PAD_ID = 0 |
|
# Initialize weights and apply final processing |
|
self.post_init() |
|
|
|
def get_device(self): |
|
return self.transformer.device |
|
|
|
def forward( |
|
self, |
|
input_ids=None, |
|
past_key_values=None, |
|
attention_mask=None, |
|
position_ids=None, |
|
): |
|
transformer_outputs = self.transformer( |
|
input_ids, |
|
attention_mask=attention_mask, |
|
position_ids=position_ids, |
|
output_hidden_states=True, |
|
) |
|
hidden_states = transformer_outputs.hidden_states[-1] |
|
scores = [] |
|
rewards = self.v_head(hidden_states).squeeze(-1) |
|
bs = int(input_ids.shape[0]) |
|
for i in range(bs): |
|
c_inds = (input_ids[i] == self.PAD_ID).nonzero() |
|
c_ind = c_inds[0].item() if len(c_inds) > 0 else input_ids.shape[1] |
|
scores.append(rewards[i, c_ind - 1]) |
|
scores = torch.stack(scores) |
|
return {"scores": scores} |
|
|
|
## Load the model and tokenizer |
|
|
|
reward_model = LlamaForSequenceClassification.from_pretrained("Nexusflow/Starling-RM-34B",torch_dtype=torch.bfloat16) |
|
reward_tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-34B-Chat") |
|
reward_tokenizer.truncation_side = "left" |
|
|
|
reward_model.eval().requires_grad_(False) |
|
|
|
## Define the reward function |
|
reward_device = "cpu" |
|
reward_batch_size = 1 |
|
|
|
def get_reward(samples): |
|
"""samples: List[str]""" |
|
input_ids = [] |
|
attention_masks = [] |
|
encodings_dict = reward_tokenizer( |
|
samples, |
|
truncation=True, |
|
max_length=2048, |
|
padding="max_length", |
|
return_tensors="pt", |
|
).to(reward_device) |
|
input_ids = encodings_dict["input_ids"] |
|
attention_masks = encodings_dict["attention_mask"] |
|
mbs = reward_batch_size |
|
out = [] |
|
for i in range(math.ceil(len(samples) / mbs)): |
|
rewards = reward_model(input_ids=input_ids[i * mbs : (i + 1) * mbs], attention_mask=attention_masks[i * mbs : (i + 1) * mbs]) |
|
out.extend(rewards["scores"]) |
|
return torch.hstack(out) |
|
|
|
## Inference over test prompts with Yi chat template |
|
|
|
test_sample = ["<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\nHi, how can I help you?<|im_end|>"] |
|
reward_for_test_sample = get_reward(test_sample) |
|
print(reward_for_test_sample) |
|
``` |
|
|
|
## Metrics |
|
|
|
### Accuracy Metrics |
|
| Model | Human Preference | Truth Preference | Safety Preference | Average | |
|
|----------------------|------------------|------------------|-------------------|--------------| |
|
| Starling-RM-7B-alpha | 0.762 | 0.684 | 0.767 | 0.738 | |
|
| Starling-RM-34B | **0.807** | **0.712** | **0.782** | **0.767** | |
|
| | | | | | |
|
|
|
Starling-RM-34B improves over Starling-RM-7B-alpha in *every metric* we benchmarked. Accuracy is measured as the rate in which the better response receives a higher score than the worse response. In the case of more than 2 responses, the accuracy is the average of the accuracy for each possible pairing. |
|
|
|
The Human Preference benchmark is measured with [LMSYS's Chatbot Arena Conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations), where the winning model's response is considered the better response. |
|
|
|
The Truth Preference benchmark is measured with the [Truthful QA](https://huggingface.co/datasets/truthful_qa) dataset, where we expect reward(best_answer) >= reward(correct_answer) > reward(incorrect_answer). |
|
|
|
The Safety Preference benchmark is measured with [PKU's Safe-RLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) dataset. On datapoints with one safe response and one unsafe response, we expect the safe response to always have higher reward. |
|
|
|
For all benchmarks, we subsample down to <= 1000 prompts. |
|
|
|
## License |
|
The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the [Yi License](https://huggingface.co/01-ai/Yi-34B-Chat/blob/main/LICENSE), OpenAI data generation [Terms of Use](https://openai.com/policies/terms-of-use), and [ShareGPT Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb). Please contact us if you find any potential violation. |
|
|
|
## Acknowledgment |
|
We would like to thank Nexusflow.ai for the assistance with compute which made these efforts possible. We would like to thank the [LMSYS Organization](https://lmsys.org/) for their support of [lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT. |
|
|
|
## Citation |
|
``` |
|
@misc{starling2023, |
|
title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF}, |
|
url = {}, |
|
author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao}, |
|
month = {November}, |
|
year = {2023} |
|
} |
|
``` |