Update README.md
Browse files
README.md
CHANGED
@@ -40,7 +40,7 @@ essential.
|
|
40 |
|
41 |
The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
|
42 |
|
43 |
-
|
44 |
- **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
|
45 |
- **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
|
46 |
- **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
|
@@ -92,7 +92,7 @@ result
|
|
92 |
```
|
93 |
|
94 |
|
95 |
-
|
96 |
|
97 |
```bibtex
|
98 |
@online{DeBloomzChat,
|
|
|
40 |
|
41 |
The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
|
42 |
|
43 |
+
**Human Preference Data:**
|
44 |
- **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
|
45 |
- **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
|
46 |
- **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
|
|
|
92 |
```
|
93 |
|
94 |
|
95 |
+
### Citation
|
96 |
|
97 |
```bibtex
|
98 |
@online{DeBloomzChat,
|