jeffreywpli
commited on
Commit
•
6c49093
1
Parent(s):
279205a
Update README.md
Browse files
README.md
CHANGED
@@ -47,7 +47,7 @@ The model pipeline currently consists of a chain of two xbgoost algorithms, one
|
|
47 |
|
48 |
# Model evaluation
|
49 |
## Instruction classification
|
50 |
-
Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the
|
51 |
|
52 |
## Response quality
|
53 |
Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
|
|
|
47 |
|
48 |
# Model evaluation
|
49 |
## Instruction classification
|
50 |
+
Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distributions of ground-truth and predicted classes. Thus, the model remains useful for tracking overall instruction diversity and representation.
|
51 |
|
52 |
## Response quality
|
53 |
Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
|