--- license: apache-2.0 widget: - instruction: How can I stay energized throughout the day? response: Drink lots of coffee! dataset: open-assistant language: - en --- # Summary Instruction tuning has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone. However, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models. This model (1) classifies instruction with a standardized schema that can be applied across datasets and (2) scores response quality on a scale of 0-1. The purpose is to measure and track instruction diversity across training sets, and enable filtering based on response quality for more targeted fine-tuning. The instruction classification schema is based on prior work in large language models: * Open-qa: question-answering without context, e.g., “When was Google founded?” * Closed-qa: question-answer from a provided context, e.g., “Look at the following paragraph and tell me how many mentions of fruit there are.” * Brainstorming: e.g., “Give me some ideas for planning a beach trip.” * Generation: e.g., “Write me an essay comparing baroque with minimalist music”. * Summarization: e.g., “Summarize the main points from this news article” * Other: e.g., anything that did not fit the previous five categories. # Model evaluation Model response quality scores were evaluated with double-blind A/B testing that compared dataset responses against what was generated by ChatGPT (version 3.5 turbo). Our evaluation confirmed that response quality predicted preferences for the dataset response over ChatGPT's: