Update README.md
Browse files
README.md
CHANGED
@@ -156,13 +156,13 @@ The SFT model was trained using [Llama-Factory](https://github.com/hiyouga/LLaMA
|
|
156 |
| Italian Safety Instructions | - | 0 | 0 | 21,426 |
|
157 |
| Italian Conversations | - | 0 | 0 | 4,843 |
|
158 |
|
159 |
-
For more details, please check [our tech
|
160 |
|
161 |
### Online DPO Training
|
162 |
|
163 |
This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
|
164 |
|
165 |
-
For more details, please check [our tech
|
166 |
|
167 |
## Model Evaluation
|
168 |
|
|
|
156 |
| Italian Safety Instructions | - | 0 | 0 | 21,426 |
|
157 |
| Italian Conversations | - | 0 | 0 | 4,843 |
|
158 |
|
159 |
+
For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
|
160 |
|
161 |
### Online DPO Training
|
162 |
|
163 |
This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
|
164 |
|
165 |
+
For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
|
166 |
|
167 |
## Model Evaluation
|
168 |
|