Text Classification
Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
42a3baa
1 Parent(s): 81d5a1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -22,7 +22,7 @@ This is a reward model used for PPO training trained on the HH-RLHF 60k dataset.
22
  It was used to train [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-hh-rlhf-60k) model.
23
 
24
  For more details, read the paper:
25
- [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
26
 
27
 
28
  ## .Model description
@@ -76,6 +76,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
76
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
77
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
78
  year={2024},
 
79
  archivePrefix={arXiv},
80
  primaryClass={cs.CL}
81
  }
 
22
  It was used to train [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-hh-rlhf-60k) model.
23
 
24
  For more details, read the paper:
25
+ [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
26
 
27
 
28
  ## .Model description
 
76
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
77
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
78
  year={2024},
79
+ eprint={2406.09279},
80
  archivePrefix={arXiv},
81
  primaryClass={cs.CL}
82
  }