|
# ๐ RLHF Step-2 Reward Model |
|
This repository is home to a RLHF reward model. This model is trained on questions and answers from the Stack Overflow Data Dump (https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences), using the `distilroberta-base` model (https://huggingface.co/distilroberta-base) as a base. |
|
|
|
## Usage |
|
You can use this model directly with a pipeline for tasks such as text generation and instruction following: |
|
|
|
```python |
|
from transformers import ( |
|
AutoModelForSequenceClassification, |
|
AutoTokenizer, |
|
pipeline |
|
) |
|
|
|
reward_model = AutoModelForSequenceClassification.from_pretrained( |
|
cambioml/rlhf_reward_model, |
|
num_labels=1, |
|
# torch_dtype=torch.bfloat16, |
|
load_in_8bit=True, |
|
device_map={"": Accelerator().process_index} |
|
) |
|
|
|
reward_tokenizer = AutoTokenizer.from_pretrained(cambioml/rlhf_reward_model) |
|
reward_tokenizer.pad_token = reward_tokenizer.eos_token |
|
|
|
|
|
reward_kwargs = { |
|
"return_all_scores": True, |
|
"function_to_apply": "none", |
|
"batch_size": 32, |
|
"truncation": True, |
|
"max_length": 138 |
|
} |
|
|
|
reward_pipe = pipeline( |
|
"sentiment-analysis", |
|
model=reward_model, |
|
model_kwargs=reward_kwargs, |
|
tokenizer=reward_tokenizer, |
|
return_token_type_ids=False, |
|
) |
|
``` |