Jellywibble commited on
Commit
233c98c
1 Parent(s): e2c0a74

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - pytorch
8
+ - reward_model
9
+ - transformers
10
+ - RLHF
11
+ library_name: transformers
12
+ ---
13
+
14
+ This is part of the Chai reward-model series, using the GPT2 architecture with a classification head, optimising for a user accepting the completion generated by the base model.
15
+
16
+ Its training dataset consists of purely user-generated content [retry_and_continue_50m_reward_model](https://huggingface.co/datasets/ChaiML/retry_and_continue_50m_reward_model), where a user has the option to decline the generated response via the retry button or end the conversation.
17
+
18
+ ## Model details
19
+ - Developed by [Chai Research](https://www.chai-research.com/)
20
+ - Model type: Transformer-based Classification Model
21
+ - Language: English
22
+ - License: cc-by-nc-4.0
23
+ - Contact: for general correspondence, please email [hello@chai-research.com](mailto:hello@chai-research.com?subject=Huggingface%20Model%20Inquiry)
24
+
25
+ ## Uses and limitations
26
+ ### Intended use
27
+ This reward model was developed primarily for commercial purposes. It learns an inner representation of response quality rated by humans that can be used to conduct best-of-N sampling and Reinforcement Leanring with the PPO framework.
28
+
29
+ In addition to scientific uses, you may also further fine-tune and adapt this reward model for deployment, as long as your use is in accordance with the Creative Commons Attribution Non Commercial 4.0 (cc-by-nc-4.0) license, i.e. non-commercial use. This model works with the Transformers Library. If you decide to this pre-trained reward model as a basis for your fine-tuned model, please note that you need to conduct your own risk and bias assessment.
30
+
31
+ ### Out-of-scope use
32
+
33
+ This reward model is **not** intended for deployment as-is. It is not a product and cannot be used for human-facing interactions without supervision.
34
+
35
+ This model **has not** been optimised for common reward-model objectives such as harmfulness, truthfulness and helpfulness, it is only trained based on user actions present on the Chai mobile app platform. Therefore, this model will **not** rank responses appropriately when evaluating on common open-sourced datasets. All base model responses within the training data were generated using an in-house variant of [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B), therefore the model performance may degrade when the input is generated using other language models.
36
+
37
+ ### How to use
38
+
39
+ This reward model can be loaded using the `AutoModelForSequenceClassification` functionality, with a GPT2 tokenizer where the `pad_token_id` is set to the EOS token id, padding sides need to be set according to the configurations used during model training.
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
44
+ model = AutoModelForSequenceClassification.from_pretrained("ChaiML/gpt2_medium_retry_and_continue_12m_reward_model")
45
+ tokenizer.pad_token_id = 50256
46
+ tokenizer.truncation_side = ‘left’
47
+ tokenizer.padding_side = ‘right’
48
+ tokens = self.eval_tokenizer(candidates, return_tensors='pt', return_attention_mask=True, padding='longest', truncation=True, max_length=256)
49
+ reward = model(**tokens).logits
50
+ ```
51
+
52
+ ## Model training
53
+ ### Training dataset
54
+ This model was trained by randomly sampling 12 million rows out of the [ChaiML/retry_and_continue_50m_reward_model](https://huggingface.co/datasets/ChaiML/retry_and_continue_50m_reward_model) dataset.
55
+ The original dataset contains over 50 million rows of completions (chatbot responses), along with number of remaining user messages within their corresponding conversations and whether the user pressed the "retry" button (where the completion is rejected and resampled). The model which was used to generate these completions is a in-house variant of [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B), with the following sampling parameters:
56
+
57
+ <figure style="width:30em">
58
+
59
+ | Parameters | Value |
60
+ | ---------------------- | ----------- |
61
+ | temperature | 0.72 |
62
+ | repetition_penalty | 1.13125 |
63
+ | max_new_tokens | 64 |
64
+ | top_p | 0.725 |
65
+ | top_k | 0 |
66
+ | eos_token_id | 198 |
67
+ | do_sample | True |
68
+ </figure>
69
+
70
+ ### Training procedure
71
+ The `gpt2_medium_retry_and_continue_12m_reward_model` was trained using a [gpt2-medium](https://huggingface.co/gpt2-medium) base model and a classification head with single output. Binary Cross Entropy loss was used. The model was trained on 4xA40 GPUs, 16 per device batch size and gradient accumulation of 1 (therefore the effective batch size is 64), with 1e-5 learning rate for 2 epochs for a total of 375,000 steps. Tensor parallelism and pipeline parallelism were used to distribute the model across GPUs.