Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,5 @@
|
|
1 |
---
|
|
|
2 |
datasets:
|
3 |
- openbmb/UltraFeedback
|
4 |
language:
|
@@ -109,14 +110,9 @@ Self-Play Preference Optimization for Language Model Alignment (https://arxiv.or
|
|
109 |
|
110 |
# Llama-3-Instruct-8B-SPPO-Iter3
|
111 |
|
112 |
-
This model
|
113 |
|
114 |
|
115 |
-
## Links to Other Models
|
116 |
-
- [Llama-3-Instruct-8B-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter1)
|
117 |
-
- [Llama-3-Instruct-8B-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2)
|
118 |
-
- [Llama-3-Instruct-8B-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3)
|
119 |
-
|
120 |
### Model Description
|
121 |
|
122 |
- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
|
|
|
1 |
---
|
2 |
+
base_model: UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3
|
3 |
datasets:
|
4 |
- openbmb/UltraFeedback
|
5 |
language:
|
|
|
110 |
|
111 |
# Llama-3-Instruct-8B-SPPO-Iter3
|
112 |
|
113 |
+
This model is a GPTQ of the SPPO model developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 3, based on the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.
|
114 |
|
115 |
|
|
|
|
|
|
|
|
|
|
|
116 |
### Model Description
|
117 |
|
118 |
- Model type: A 8B parameter GPT-like model fine-tuned on synthetic datasets.
|