sapienzanlp
/

Minerva-7B-instruct-v1.0

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

s-conia commited on 14 days ago

Commit

12f220c

•

1 Parent(s): fd42805

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -156,13 +156,13 @@ The SFT model was trained using [Llama-Factory](https://github.com/hiyouga/LLaMA
 | Italian Safety Instructions                       | -                                                                      | 0        | 0       | 21,426  |
 | Italian Conversations                     | -                                                                      | 0        | 0       | 4,843   |
-For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
 ### Online DPO Training
 This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
-For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
 ## Model Evaluation

 | Italian Safety Instructions                       | -                                                                      | 0        | 0       | 21,426  |
 | Italian Conversations                     | -                                                                      | 0        | 0       | 4,843   |
+For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
 ### Online DPO Training
 This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
+For more details, please check [our tech page](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
 ## Model Evaluation