Christopher Glaze
commited on
Commit
·
1358e52
1
Parent(s):
ce46e2d
Update readme
Browse files- .gitignore +2 -1
- README.md +32 -1
- curating_model_eval.png +0 -0
- tests.py +10 -2
.gitignore
CHANGED
@@ -1 +1,2 @@
|
|
1 |
-
**/__pycache__
|
|
|
|
1 |
+
**/__pycache__
|
2 |
+
.DS_Store
|
README.md
CHANGED
@@ -6,4 +6,35 @@ widget:
|
|
6 |
dataset: open-assistant
|
7 |
language:
|
8 |
- en
|
9 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
dataset: open-assistant
|
7 |
language:
|
8 |
- en
|
9 |
+
---
|
10 |
+
|
11 |
+
# Summary
|
12 |
+
Instruction tuning has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone. However, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models.
|
13 |
+
|
14 |
+
This model (1) classifies instruction with a standardized schema that can be applied across datasets and (2) scores response quality on a scale of 0-1. The purpose is to measure and track instruction diversity across training sets, and enable filtering based on response quality for more targeted fine-tuning.
|
15 |
+
|
16 |
+
The instruction classification schema is based on prior work in large language models:
|
17 |
+
|
18 |
+
* <strong>Open-qa</strong>: question-answering without context, e.g., “When was Google founded?”
|
19 |
+
* <strong>Closed-qa</strong>: question-answer from a provided context, e.g., “Look at the following paragraph and tell me how many mentions of fruit there are.”
|
20 |
+
* <strong>Brainstorming</strong>: e.g., “Give me some ideas for planning a beach trip.”
|
21 |
+
* <strong>Generation</strong>: e.g., “Write me an essay comparing baroque with minimalist music”.
|
22 |
+
* <strong>Summarization</strong>: e.g., “Summarize the main points from this news article”
|
23 |
+
* <strong>Other</strong>: e.g., anything that did not fit the previous five categories.
|
24 |
+
|
25 |
+
# Model evaluation
|
26 |
+
Model response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
|
27 |
+
|
28 |
+
<center>
|
29 |
+
<img src="curating_model_eval.png" width="300"/>
|
30 |
+
</center>
|
31 |
+
|
32 |
+
# Usage
|
33 |
+
The model can accept either a dictionary or list of dicts as input. Each dict needs an ```instruction``` field at a bare minimum (in which case it will simply classify the instruction). If a ```response field``` is included, a response score will be returned. Users can also provide a ```dataset field```, which will only change model predictions if it falls under one of the existing sources we trained on (but can be left blank): dolly, helpful-instructions or open-assistant.
|
34 |
+
|
35 |
+
## Example
|
36 |
+
Input:
|
37 |
+
```{'instruction': 'What are ways I can stay energized throughout the day?', 'response': 'Drink lots of coffee!'}```
|
38 |
+
|
39 |
+
Model output:
|
40 |
+
```{'instruction class': 'brainstorming', 'instruction class confidence': 0.9683452, 'response quality': 0.08076164}```
|
curating_model_eval.png
ADDED
tests.py
CHANGED
@@ -14,9 +14,17 @@ pred=response_model_handler(payload)
|
|
14 |
print(pred)
|
15 |
|
16 |
payload = {'inputs': [{"instruction": "What are some ways to stay energized throughout the day?",
|
17 |
-
"response": "Drink lots of coffee!"
|
|
|
|
|
|
|
|
|
18 |
{"instruction": "What are some ways to stay energized throughout the day?",
|
19 |
-
"response": "
|
|
|
|
|
|
|
|
|
20 |
|
21 |
# test the handler
|
22 |
pred=response_model_handler(payload)
|
|
|
14 |
print(pred)
|
15 |
|
16 |
payload = {'inputs': [{"instruction": "What are some ways to stay energized throughout the day?",
|
17 |
+
"response": "Drink lots of coffee!",
|
18 |
+
"dataset": ''},
|
19 |
+
{"instruction": "What are some ways to stay energized throughout the day?",
|
20 |
+
"response": "Drink lots of coffee!",
|
21 |
+
"dataset": 'dolly'},
|
22 |
{"instruction": "What are some ways to stay energized throughout the day?",
|
23 |
+
"response": "Drink lots of coffee!",
|
24 |
+
"dataset": 'open-assistant'},
|
25 |
+
{"instruction": "What are some ways to stay energized throughout the day?",
|
26 |
+
"response": "Drink lots of coffee!",
|
27 |
+
"dataset": 'helpful_instructions'}]}
|
28 |
|
29 |
# test the handler
|
30 |
pred=response_model_handler(payload)
|