|
--- |
|
language: en |
|
license: apache-2.0 |
|
library_name: transformers |
|
model-index: |
|
- name: digital-socrates-7b |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: AI2 Reasoning Challenge (25-Shot) |
|
type: ai2_arc |
|
config: ARC-Challenge |
|
split: test |
|
args: |
|
num_few_shot: 25 |
|
metrics: |
|
- type: acc_norm |
|
value: 54.44 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: HellaSwag (10-Shot) |
|
type: hellaswag |
|
split: validation |
|
args: |
|
num_few_shot: 10 |
|
metrics: |
|
- type: acc_norm |
|
value: 75.99 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU (5-Shot) |
|
type: cais/mmlu |
|
config: all |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 51.41 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: TruthfulQA (0-shot) |
|
type: truthful_qa |
|
config: multiple_choice |
|
split: validation |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: mc2 |
|
value: 44.88 |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: Winogrande (5-shot) |
|
type: winogrande |
|
config: winogrande_xl |
|
split: validation |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 73.09 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GSM8k (5-shot) |
|
type: gsm8k |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 17.89 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=allenai/digital-socrates-7b |
|
name: Open LLM Leaderboard |
|
--- |
|
|
|
This is the Digital Socrates 7B (DS-7B) model described in our paper: <b>Digital Socrates: Evaluating LLMs through explanation critiques</b> (ACL Anthology link: https://aclanthology.org/2024.acl-long.302, arXiv link: https://arxiv.org/abs/2311.09613). |
|
|
|
The recommended, better performing 13B model can be found at https://huggingface.co/allenai/digital-socrates-13b |
|
|
|
The DS-7B model is a fine-tuned version of [Llama-2-7b-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat), please |
|
review its [guidelines](https://huggingface.co/meta-llama/Llama-2-7b-chat/blob/main/USE_POLICY.md) and |
|
[licensing](https://huggingface.co/meta-llama/Llama-2-7b-chat#model-details) before use. |
|
|
|
|
|
|
|
# What is Digital Socrates? |
|
Digital Socrates is an open-source, automatic explanation-critiquing model. It is useful for revealing insights about student |
|
models by examining their reasoning chains. The use of Digital Socrates allows for nuanced, interpretable automatic |
|
evaluation of explanations without expensive API calls or human annotations. |
|
|
|
While LLMs can provide reasoned explanations along with their answers, the nature and |
|
quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the |
|
explanation capabilities of modern models and to create a nuanced, interpretable explanation |
|
evaluation tool that can generate such characterizations automatically. |
|
|
|
Given a **question** (together with the multiple-choice options and gold answer), |
|
```commandline |
|
When Dennis operates his lawnmower, he notices the engine makes a lot of noise. He also notices that the engine gets very hot. Which best describes the heat and noise generated from the lawnmower? |
|
(A) a change in phase |
|
(B) thermal expansion |
|
(C) an increase in entropy |
|
(D) mechanical advantage |
|
|
|
Answer sheet: (C) |
|
``` |
|
as well as a student model explanation and answer, |
|
``` |
|
1) The question states that the lawnmower engine makes a lot of noise. |
|
2) The question states that the lawnmower engine gets very hot. |
|
3) Noise and heat are both forms of energy. |
|
4) The noise and heat generated from the lawnmower are a result of the conversion of energy from the fuel to mechanical energy. |
|
|
|
Answer: (D) |
|
``` |
|
|
|
Digital Socrates gives a **critique of the model-generated explanation** that provides localized feedback on the |
|
most significant flaw (if any) in the explanation, suggestions to address the identified flaw, as well as a numeric rating indicating the explanation quality: |
|
``` |
|
The explanation states or suggests the following: |
|
* Main flaw (standalone statement): "The noise and heat generated from the lawnmower are a result of the conversion of energy from the fuel to mechanical energy." |
|
* Dimension: incorrect_information |
|
|
|
Consider these points for revising the explanation: |
|
* General: Remember that noise and heat are not forms of energy. They are byproducts of energy conversion. |
|
* Specific: In this case, the noise and heat generated by the lawnmower are not a result of the conversion of energy from the fuel to mechanical energy. They are byproducts of the combustion process. |
|
|
|
Explanation score: 2 |
|
``` |
|
|
|
|
|
Remarkably, despite being orders of magnitude smaller than GPT-4, our Digital Socrates models are |
|
capable of generating critiques close to GPT-4 critiques in terms of human rating and other |
|
quantitative measures (correlation of explanation scores given and error category matches). |
|
Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for |
|
revealing insights about student models by examining their reasoning chains. |
|
|
|
We invite you to try out Digital Socrates for your own application! |
|
|
|
|
|
|
|
# How to use Digital Socrates? |
|
We provide a quick example of how you can try out Digital Socrates with just a few lines of code: |
|
|
|
'DSCritiqueBank-V1' used below can be downloaded from our [dataset page](https://allenai.org/data/digital-socrates). |
|
``` |
|
import json |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
# Load model and tokenizer |
|
model_path = "allenai/digital-socrates-7b" |
|
model = AutoModelForCausalLM.from_pretrained(model_path).to("cuda:0") |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
# Define input data |
|
question = "When Dennis operates his lawnmower, he notices the engine makes a lot of noise. He also notices that the engine gets very hot. Which best describes the heat and noise generated from the lawnmower? (A) a change in phase (B) thermal expansion (C) an increase in entropy (D) mechanical advantage" |
|
explanation = "1) The question states that the lawnmower engine makes a lot of noise.\n2) The question states that the lawnmower engine gets very hot.\n3) Noise and heat are both forms of energy.\n4) The noise and heat generated from the lawnmower are a result of the conversion of energy from the fuel to mechanical energy." |
|
answerkey = "C" |
|
predictedanswer = "D" |
|
|
|
# construct prompt (Llama conventions) |
|
with open("../DSCritiqueBank-V1/DSCB-prompts.json") as file: |
|
prompts = json.load(file) |
|
|
|
system_prompt = prompts['digital_socrates_v1']['system'] |
|
user_prompt = prompts['digital_socrates_v1']['main'].replace("[[QUESTION]]", question).replace("[[EXPLANATION]]", explanation).replace("[[PREDICTEDANSWER]]", predictedanswer).replace("[[ANSWERKEY]]", answerkey) |
|
|
|
full_prompt = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>{user_prompt} [/INST]\n\n" |
|
|
|
# Run model |
|
input_ids = tokenizer.encode(full_prompt, return_tensors="pt").to("cuda:0") |
|
output = model.generate(input_ids, max_new_tokens=512, temperature=0) |
|
res = tokenizer.batch_decode(output, skip_special_tokens=True) |
|
``` |
|
Print the output: |
|
``` |
|
>>> print(res[0].split("[/INST]")[-1]) |
|
|
|
The explanation states or suggests the following: |
|
* Main flaw (standalone statement): "The noise and heat generated from the lawnmower are a result of the conversion of energy from the fuel to mechanical energy." |
|
* Dimension: incorrect_information |
|
|
|
Consider these points for revising the explanation: |
|
* General: Remember that noise and heat are not forms of energy. They are byproducts of energy conversion. |
|
* Specific: In this case, the noise and heat generated by the lawnmower are not a result of the conversion of energy from the fuel to mechanical energy. They are byproducts of the combustion process. |
|
|
|
Explanation score: 2 |
|
``` |
|
|
|
|
|
|
|
# More details about Digital Socrates ... |
|
For more details about Digital Socrates, please refer to our: |
|
* 📄Paper: https://arxiv.org/abs/2311.09613 |
|
* 💻Dataset: https://allenai.org/data/digital-socrates |
|
|
|
|
|
# Citation |
|
|
|
``` |
|
@inproceedings{gu-etal-2024-digital, |
|
title = "Digital Socrates: Evaluating {LLM}s through Explanation Critiques", |
|
author = "Gu, Yuling and |
|
Tafjord, Oyvind and |
|
Clark, Peter", |
|
editor = "Ku, Lun-Wei and |
|
Martins, Andre and |
|
Srikumar, Vivek", |
|
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", |
|
month = aug, |
|
year = "2024", |
|
address = "Bangkok, Thailand", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.acl-long.302", |
|
pages = "5559--5586", |
|
} |
|
``` |
|
|
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_allenai__digital-socrates-7b) |
|
|
|
| Metric |Value| |
|
|---------------------------------|----:| |
|
|Avg. |52.95| |
|
|AI2 Reasoning Challenge (25-Shot)|54.44| |
|
|HellaSwag (10-Shot) |75.99| |
|
|MMLU (5-Shot) |51.41| |
|
|TruthfulQA (0-shot) |44.88| |
|
|Winogrande (5-shot) |73.09| |
|
|GSM8k (5-shot) |17.89| |
|
|
|
|