cmarkea
/

bloomz-3b-dpo-chat

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Cyrile commited on Jul 5

Commit

92b2f25

•

1 Parent(s): ca38345

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ essential.
 The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
-1. **Human Preference Data:**
    - **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
    - **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
    - **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
@@ -92,7 +92,7 @@ result
 ```
-## Citation
 ```bibtex
 @online{DeBloomzChat,

 The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
+**Human Preference Data:**
    - **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
    - **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
    - **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
 ```
+### Citation
 ```bibtex
 @online{DeBloomzChat,