Update README.md
Browse files
README.md
CHANGED
@@ -40,7 +40,7 @@ essential.
|
|
40 |
|
41 |
The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
|
42 |
|
43 |
-
|
44 |
- **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
|
45 |
- **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
|
46 |
- **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
|
|
|
40 |
|
41 |
The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
|
42 |
|
43 |
+
Human Preference Data:
|
44 |
- **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
|
45 |
- **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
|
46 |
- **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
|