snorkelai
/

instruction-response-quality

weak supervision

Inference Endpoints

Model card Files Files and versions Community

jeffreywpli commited on Jul 20, 2023

Commit

6c49093

·

1 Parent(s): 279205a

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -47,7 +47,7 @@ The model pipeline currently consists of a chain of two xbgoost algorithms, one
 # Model evaluation
 ## Instruction classification
-Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. Thus, the model remains useful for tracking overall instruction diversity and representation.
 ## Response quality
 Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:

 # Model evaluation
 ## Instruction classification
+Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distributions of ground-truth and predicted classes. Thus, the model remains useful for tracking overall instruction diversity and representation.
 ## Response quality
 Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's: