Joblib
English
llm
human-feedback
weak supervision
data filtering
Inference Endpoints
Christopher Glaze commited on
Commit
1358e52
1 Parent(s): ce46e2d

Update readme

Browse files
Files changed (4) hide show
  1. .gitignore +2 -1
  2. README.md +32 -1
  3. curating_model_eval.png +0 -0
  4. tests.py +10 -2
.gitignore CHANGED
@@ -1 +1,2 @@
1
- **/__pycache__
 
 
1
+ **/__pycache__
2
+ .DS_Store
README.md CHANGED
@@ -6,4 +6,35 @@ widget:
6
  dataset: open-assistant
7
  language:
8
  - en
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  dataset: open-assistant
7
  language:
8
  - en
9
+ ---
10
+
11
+ # Summary
12
+ Instruction tuning has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone. However, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models.
13
+
14
+ This model (1) classifies instruction with a standardized schema that can be applied across datasets and (2) scores response quality on a scale of 0-1. The purpose is to measure and track instruction diversity across training sets, and enable filtering based on response quality for more targeted fine-tuning.
15
+
16
+ The instruction classification schema is based on prior work in large language models:
17
+
18
+ * <strong>Open-qa</strong>: question-answering without context, e.g., “When was Google founded?”
19
+ * <strong>Closed-qa</strong>: question-answer from a provided context, e.g., “Look at the following paragraph and tell me how many mentions of fruit there are.”
20
+ * <strong>Brainstorming</strong>: e.g., “Give me some ideas for planning a beach trip.”
21
+ * <strong>Generation</strong>: e.g., “Write me an essay comparing baroque with minimalist music”.
22
+ * <strong>Summarization</strong>: e.g., “Summarize the main points from this news article”
23
+ * <strong>Other</strong>: e.g., anything that did not fit the previous five categories.
24
+
25
+ # Model evaluation
26
+ Model response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
27
+
28
+ <center>
29
+ <img src="curating_model_eval.png" width="300"/>
30
+ </center>
31
+
32
+ # Usage
33
+ The model can accept either a dictionary or list of dicts as input. Each dict needs an ```instruction``` field at a bare minimum (in which case it will simply classify the instruction). If a ```response field``` is included, a response score will be returned. Users can also provide a ```dataset field```, which will only change model predictions if it falls under one of the existing sources we trained on (but can be left blank): dolly, helpful-instructions or open-assistant.
34
+
35
+ ## Example
36
+ Input:
37
+ ```{'instruction': 'What are ways I can stay energized throughout the day?', 'response': 'Drink lots of coffee!'}```
38
+
39
+ Model output:
40
+ ```{'instruction class': 'brainstorming', 'instruction class confidence': 0.9683452, 'response quality': 0.08076164}```
curating_model_eval.png ADDED
tests.py CHANGED
@@ -14,9 +14,17 @@ pred=response_model_handler(payload)
14
  print(pred)
15
 
16
  payload = {'inputs': [{"instruction": "What are some ways to stay energized throughout the day?",
17
- "response": "Drink lots of coffee!"},
 
 
 
 
18
  {"instruction": "What are some ways to stay energized throughout the day?",
19
- "response": "Buy lots of sailboats!"}]}
 
 
 
 
20
 
21
  # test the handler
22
  pred=response_model_handler(payload)
 
14
  print(pred)
15
 
16
  payload = {'inputs': [{"instruction": "What are some ways to stay energized throughout the day?",
17
+ "response": "Drink lots of coffee!",
18
+ "dataset": ''},
19
+ {"instruction": "What are some ways to stay energized throughout the day?",
20
+ "response": "Drink lots of coffee!",
21
+ "dataset": 'dolly'},
22
  {"instruction": "What are some ways to stay energized throughout the day?",
23
+ "response": "Drink lots of coffee!",
24
+ "dataset": 'open-assistant'},
25
+ {"instruction": "What are some ways to stay energized throughout the day?",
26
+ "response": "Drink lots of coffee!",
27
+ "dataset": 'helpful_instructions'}]}
28
 
29
  # test the handler
30
  pred=response_model_handler(payload)