s-conia commited on
Commit
12f220c
1 Parent(s): fd42805

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -156,13 +156,13 @@ The SFT model was trained using [Llama-Factory](https://github.com/hiyouga/LLaMA
156
  | Italian Safety Instructions | - | 0 | 0 | 21,426 |
157
  | Italian Conversations | - | 0 | 0 | 4,843 |
158
 
159
- For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
160
 
161
  ### Online DPO Training
162
 
163
  This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
164
 
165
- For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
166
 
167
  ## Model Evaluation
168
 
 
156
  | Italian Safety Instructions | - | 0 | 0 | 21,426 |
157
  | Italian Conversations | - | 0 | 0 | 4,843 |
158
 
159
+ For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
160
 
161
  ### Online DPO Training
162
 
163
  This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
164
 
165
+ For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
166
 
167
  ## Model Evaluation
168