allenai
/

tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm-value

Token Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hamishivi commited on Jun 14

Commit

11367cd

•

1 Parent(s): 8cfec27

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -21,8 +21,10 @@ Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tul
 This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm) model.
 We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
 For more details, read the paper:
-[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
 ## .Model description
@@ -76,6 +78,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
       title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
       author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
       year={2024},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }

 This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm) model.
 We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
+At time of writing, you may have to [install transformers from source](https://huggingface.co/docs/transformers/en/installation#install-from-source) to get the `LlamaForTokenClassification` class.
 For more details, read the paper:
+[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
 ## .Model description
       title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
       author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
       year={2024},
+      eprint={2406.09279},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
 }