Joblib
English
llm
human-feedback
weak supervision
data filtering
Inference Endpoints
jeffreywpli commited on
Commit
6c49093
1 Parent(s): 279205a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -47,7 +47,7 @@ The model pipeline currently consists of a chain of two xbgoost algorithms, one
47
 
48
  # Model evaluation
49
  ## Instruction classification
50
- Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distribution of ground truth classes with the predicted. Thus, the model remains useful for tracking overall instruction diversity and representation.
51
 
52
  ## Response quality
53
  Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's:
 
47
 
48
  # Model evaluation
49
  ## Instruction classification
50
+ Instruction classification scores were measured with ground-truth developed internally, with an out-of-sample accuracy/macro averaged f1 score of 78%/70%. The largest error mode appears linked with basic uncertainty as to how to classify an instruction. For example, "What are a few words that can be used to describe running?" could be interpeted as a ```generation``` task to write a brief snippet describing running, a ```brainstorming``` task to simply come up ideas for writing about running, or (as was indicated in metadata associated with the instruction) as an ```open-qa``` task to answer what running is. However, model predictions appear unbiased when comparing the distributions of ground-truth and predicted classes. Thus, the model remains useful for tracking overall instruction diversity and representation.
51
 
52
  ## Response quality
53
  Response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's: