Token Classification
Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
11367cd
1 Parent(s): 8cfec27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -1
README.md CHANGED
@@ -21,8 +21,10 @@ Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tul
21
  This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm) model.
22
  We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
23
 
 
 
24
  For more details, read the paper:
25
- [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
26
 
27
 
28
  ## .Model description
@@ -76,6 +78,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
76
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
77
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
78
  year={2024},
 
79
  archivePrefix={arXiv},
80
  primaryClass={cs.CL}
81
  }
 
21
  This is a **value** model produced during the PPO training of [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm) model.
22
  We release the value model as it may provide a good starting point for additional research or improved decoding with our released PPO models.
23
 
24
+ At time of writing, you may have to [install transformers from source](https://huggingface.co/docs/transformers/en/installation#install-from-source) to get the `LlamaForTokenClassification` class.
25
+
26
  For more details, read the paper:
27
+ [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
28
 
29
 
30
  ## .Model description
 
78
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
79
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
80
  year={2024},
81
+ eprint={2406.09279},
82
  archivePrefix={arXiv},
83
  primaryClass={cs.CL}
84
  }