diff --git "a/table_result/2407.00087v2_output.json" "b/table_result/2407.00087v2_output.json" --- "a/table_result/2407.00087v2_output.json" +++ "b/table_result/2407.00087v2_output.json" @@ -26,10 +26,13 @@ "[paragraph id = 4] Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets." ], "table_html": "
\n

\n\n\n\n\n\nScienceQA\nWin Rate\n\n\n\n vs \n69.76%\n\n vs \n73.76%\n\nA-OKVQA\nWin Rate\n\n vs \n69.11%\n\n vs \n66.96%\n\n\n

\n
Table 1: We train baseline models, MM-CoT, with the ARES pipeline and ask GPT-4o to evaluate which rationale reasoning is better. We compare each baseline for two model sizes ( and ) and two tasks (ScienceQA and A-OKVQA).
\n
", - "perturb_sentence_id": 2, + "perturb_sentence_id": [ + 2, + 3 + ], "output": { - "perturbed_statement": "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ). We conduct our experiments with two different model sizes, Flan-Mini and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", - "perturbed_explanation": "1. The original explanation states that the experiments were conducted with Flan-Base and Flan-Large models, ensuring a fair comparison and control over model scale variations. 2. The statement is incorrect because it lists Flan-Mini and Flan-Large as the model sizes instead, which misrepresents the actual experimental setup as only Flan-Base and Flan-Large models were utilized." + "perturbed_statement": "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).We conduct our experiments with two different model sizes, Flan-Base and Flan-Small with ViT feature, on ScienceQA and A-OKVQA.", + "perturbed_explanation": "The original explanation is: This statement outlines the methodology used in evaluating rationale chains, which includes randomizing them for a fair comparison and testing with two model sizes, Flan-Base and Flan-Large with ViT feature, across ScienceQA and A-OKVQA datasets. 1. The statement mistakenly refers to one of the model sizes as Flan-Small instead of Flan-Large. This is incorrect because the model sizes actually used were Flan-Base and Flan-Large, not Flan-Small." } }, { @@ -83,10 +86,13 @@ "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03" ], "table_html": "
\n
\n

\n\n\n\n\n\nModel\nSize\nNAT\nSOC\nLAN\nTXT\nIMG\nNO\nG1-6\nG7-12\nAvg\n\nHuman\n-\n90.23\n84.97\n87.48\n89.60\n87.50\n88.10\n91.59\n82.42\n88.40\n\nMCAN (Yu et al., 2019 ###reference_b40###)\n95M\n56.08\n46.23\n58.09\n59.43\n51.17\n55.40\n51.65\n59.72\n54.54\n\nTop-Down (Anderson et al., 2018 ###reference_b1###)\n70M\n59.50\n54.33\n61.82\n62.90\n54.88\n59.79\n57.27\n62.16\n59.02\n\nBAN (Kim et al., 2018 ###reference_b16###)\n112M\n60.88\n46.57\n66.64\n62.61\n52.60\n65.51\n56.83\n63.94\n59.37\n\nDFAF (Peng et al., 2019 ###reference_b28###)\n74M\n64.03\n48.82\n63.55\n65.88\n54.49\n64.11\n57.12\n67.17\n60.72\n\nViLT (Kim et al., 2021 ###reference_b17###)\n113M\n60.48\n63.89\n60.27\n63.20\n61.38\n57.00\n60.72\n61.90\n61.14\n\nPatch-TRM (Lu et al., 2022b ###reference_b25###)\n90M\n65.19\n46.79\n65.55\n66.96\n55.28\n64.95\n58.04\n67.50\n61.42\n\nVisualBERT (Li et al., 2019 ###reference_b20###)\n111M\n59.33\n69.18\n61.18\n62.71\n62.17\n58.54\n62.96\n59.92\n61.87\n\nUnifiedQABase (Khashabi et al., 2020 ###reference_b15###)\n223M\n68.16\n69.18\n74.91\n63.78\n61.38\n77.84\n72.98\n65.00\n70.12\n\nUnifiedQABase w/ CoT (Lu et al., 2022a ###reference_b24###)\n223M\n71.00\n76.04\n78.91\n66.42\n66.53\n81.81\n77.06\n68.82\n74.11\n\nLLaMA-Adapter (Zhang et al., 2023a ###reference_b42###)\n6B\n84.37\n88.30\n84.36\n83.72\n80.32\n86.90\n85.83\n84.05\n85.19\n\nLLaVA (Liu et al., 2023a ###reference_b22###)\n13B\n90.36\n95.95*\n88.00\n89.49\n88.00\n90.66\n90.93\n90.90*\n90.92\n\nInstructBLIP (Dai et al., 2023 ###reference_b8###)\n11B\n-\n-\n-\n-\n90.70*\n-\n-\n-\n-\n\n (Zhang et al., 2023b ###reference_b45###)\n251M+251M\n84.59\n92.46\n83.45\n83.87\n83.29\n85.64\n86.34\n85.23\n85.95\n\n (Ours)\n251M+30M\n87.92\n92.58\n85.91\n86.61\n85.82\n88.36\n88.88\n87.48\n88.38\n\n (Zhang et al., 2023b ###reference_b45###)\n790M+790M\n90.76\n93.59\n86.55\n89.69\n87.85\n89.55\n90.90\n89.12\n90.26\n\n (Ours)\n790M+76M\n91.21*\n92.80\n89.45*\n90.27*\n88.35\n91.22*\n91.48*\n90.38\n91.09*\n\n\n

\n
\n
Table 2: \nMain results on the ScienceQA test set (%). Size = backbone size. Question classes: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. Other results are sourced from Lu et al. (2022a) and Zhang et al. (2023b). Results in bold represent the better performance corresponding baseline. (*) indicates the best performance.\n
\n
", - "perturb_sentence_id": 20, + "perturb_sentence_id": [ + 20, + 21 + ], "output": { - "perturbed_statement": "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 64.83", - "perturbed_explanation": "The original explanation establishes the accuracy metrics of the models as follows: IPVR (OPT-66B) scored 48.6, ViLBERT scored 49.1, followed by other models achieving 60.96, 65.41, 65.68, and 68.03 sequentially. The statement inaccurately describes the final model's accuracy as 64.83 instead of the correct 68.03. This misrepresentation introduces an erroneous sequence, disrupting the claim's alignment with provided information. This demonstrates the importance of maintaining accurate data representation in summaries." + "perturbed_statement": "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 50.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "perturbed_explanation": "The original explanation notes that the model accuracy figures provided are specific and verifiable based on given context. 1. The IPVR (OPT-66B) model accuracy is stated as 48.6, which can be cross-referenced. 2. Changing IPVR (OPT-66B) accuracy to 50.6 introduces a factual error, as the original accuracy is documented as 48.6, not 50.6." } }, { @@ -130,10 +136,13 @@ "[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study." ], "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\nIPVR (OPT-66B)\n48.6\n\nViLBERT\n49.1\n\n\n60.96\n\n (Ours)\n65.41\n\n\n65.68\n\n (Ours)\n68.03\n\n\n

\n
Table 3: Results of ARES on A-OKVQA. We mainly compare different-sized MM-CoT baselines (Zhang et al., 2023b). We retrain the MM-CoTs and run the ARES pipeline on these models. We evaluate these models on the validation set because the test set is hidden.
\n
", - "perturb_sentence_id": 11, + "perturb_sentence_id": [ + 11, + 12 + ], "output": { - "perturbed_statement": "[paragraph id = 11] In our experiments, MM-CoT models perform around 15% better than the reported accuracy in (Zhang et al., 2023b ).ARES achieves 6.75% gains against and 3.15% for .", - "perturbed_explanation": "1. The original statement claims a 10% improvement by MM-CoT models in accuracy over reported metrics and outlines specific gains achieved by ARES as 4.45% and 2.35% in respective contexts. 2. The statement inaccurately attributes MM-CoT models with a 15% performance improvement and modifies ARES gains to 6.75% and 3.15%, which conflicts with the consistent data representation in the study." + "perturbed_statement": "[paragraph id = 11] In our experiments, MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b ).ARES achieves 4.45% gains against XYZ baseline and 2.35% for another baseline.", + "perturbed_explanation": "1. The statement claims MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b), whereas the context mentions a 10% improvement, not 20%. 2. The addition of 'against XYZ baseline and for another baseline' adds unintended specificity not present in the context, which simply discusses percentage gains without associating them with specific baselines." } }, { @@ -175,10 +184,13 @@ "[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES." ], "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\n\n\n\n85.95\n\n1st RL\n86.70\n\n1st ARES\n87.81\n\n1st ARES & 2nd RL\n87.88\n\n2nd ARES\n88.38\n\n\n

\n
Table 4: Ablation study: The accuracy gradually improves as each stage of ARES is added.
\n
", - "perturb_sentence_id": 19, + "perturb_sentence_id": [ + 19, + 20 + ], "output": { - "perturbed_statement": "[paragraph id = 19] 1st ARES & 2nd RL represents the second ARES on 1st RL, and finally, 2nd RL refers to two rounds of RL.Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", - "perturbed_explanation": "Original Explanation: The statement provides the mapping of iterations between RL and ARES processes and their related accuracy metrics. The context specifies that '1st ARES & 2nd RL' indicates the second RL following the execution of 1st ARES. Furthermore, '2nd ARES' refers to completing a second round of ARES execution. Explanation: '1st ARES & 2nd RL' being described as 'the second ARES on 1st RL' is incorrect as it implies a contradictory order of operations, while '2nd RL' equating to two RL rounds misrepresents the definition and separate context of 2nd ARES." + "perturbed_statement": "[paragraph id = 19] 1st ARES & 2nd RL represents the second ARES on 1st RL, and finally, 2nd ARES refers to one round of ARES.Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "perturbed_explanation": "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES. This implies a sequential progression through different stages involving ARES and RL. 2. The statement incorrectly states that '1st ARES & 2nd RL represents the second ARES on 1st RL' and '2nd ARES refers to one round of ARES'. In contrast, it should correctly describe the order of operations as involving a second RL following the 1st ARES and two rounds of ARES for the '2nd ARES' phase, rather than one round as mistakenly mentioned." } }, { @@ -222,10 +234,13 @@ "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 )." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", - "perturb_sentence_id": 9, + "perturb_sentence_id": [ + 9, + 10 + ], "output": { - "perturbed_statement": "[paragraph id = 9] We additionally shift the reward distribution to center it at a mean value of 10 (Zheng et al., 2023 ). Therefore, the actual range extends to 20.", - "perturbed_explanation": "The original explanation is as follows: The reward distribution is stated to be adjusted to center it around its mean, however, specific numeric values might have been mentioned for both the center and range limits. Deviating from these specifics alters the accuracy in referencing key findings mentioned in the corresponding document. It is essential to ensure consistency between the textual statement and the documented contexts and findings to preserve clarity and reliability. Hence, verifying these particular aspects is vital for accurate comprehension and application of the stated information." + "perturbed_statement": "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits limited behaviors (please refer to Section 4 ).", + "perturbed_explanation": "1. The original statement explains the characteristics of the RL fine-tuned model, indicating that it exhibits emergent behaviors due to nuanced scores. 2. The statement incorrectly states that the RL fine-tuned model exhibits limited behaviors instead of emergent behaviors, which contradicts the information that it yields emergent behaviors as detailed in the context." } }, { @@ -288,10 +303,13 @@ "[paragraph id = 18] (See Appendix C.2 .)" ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", - "perturb_sentence_id": 12, + "perturb_sentence_id": [ + 12, + 13 + ], "output": { - "perturbed_statement": "[paragraph id = 12] (See Appendix C.2 .) In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", - "perturbed_explanation": "The original explanation mentions that the rationale chains in prior steps are maintained, and Appendix C.2 is referred to for details about reducing repetitive sentences during the RL stage. However: 1. Appendix C.2 does not pertain to the SFT stage dataset correction but to repetitive sentence removal during the RL stage. 2. Associating Appendix C.2 with the SFT stage introduces misalignment in the documentation and potential confusion for readers seeking accurate references." + "perturbed_statement": "[paragraph id = 12] (See Table 5 .) In order to collect the corrected dataset for the RL stage, we let Haiku refer to the given problem and correct the answer as the prompt.", + "perturbed_explanation": "1. The original statement discusses collecting a corrected dataset for the SFT stage by letting Haiku refer to the given problem and correct the answer as the prompt. 2. The statement is incorrect because it mentions collecting the corrected dataset for the RL stage, whereas the context suggests this process is associated with the SFT stage. The RL stage is mentioned in relation to generating repetitive sentences, not dataset correction." } }, { @@ -344,10 +362,13 @@ "[paragraph id = 26] For more detailed settings, please refer to Appendix C ." ], "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", - "perturb_sentence_id": 25, + "perturb_sentence_id": [ + 25, + 26 + ], "output": { - "perturbed_statement": "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model's features with a small number of weights, enabling 50x–70x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ). For more detailed settings, please refer to Appendix C.", - "perturbed_explanation": "The original statement describes the performance improvements in inference time when utilizing the LoRA adapter relative to MM-CoT, specifying a range of 2x–14x based on experimental observations. 1. The inference speed-up values involve precise calculations obtained from benchmarking different models, with specific ranges assigned. 2. Claiming a significantly higher range of 50x–70x misrepresents the reported findings and could lead to misconceptions about the actual levels of improvement achievable using the technology outlined." + "perturbed_statement": "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model's features with a large number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8). For more detailed settings, please refer to Appendix C.", + "perturbed_explanation": "The original explanation likely highlighted that the LoRA adapter efficiently uses the rationale model's features with a small number of weights, which results in significantly faster inference speeds compared to MM-CoT. 2. The statement incorrectly mentions a 'large number of weights', contradicting the efficiency highlighted by the small number of weights that enable faster processing. LoRA's advantage in speed is due to its minimal weight requirement, not a large one." } }, { @@ -390,10 +411,13 @@ "[paragraph id = 8] This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input." ], "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", - "perturb_sentence_id": 7, + "perturb_sentence_id": [ + 7, + 8 + ], "output": { - "perturbed_statement": "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 1.5x–7x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ). This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input.", - "perturbed_explanation": "Original Explanation: The statement details the performance benefits of the framework compared to MM-CoT in terms of answer generation speed, supported by specific speed ratios reported in Table 7 and Table 8. 1. The mentioned speed advantage as 1.5x–7x deviates from the actual reported range of 2x–14x, therefore misrepresenting the evaluated benchmarks mentioned in the context. Accurate reference to reported data is crucial for conveying findings precisely." + "perturbed_statement": "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, but generates answers 2x–14x slower than MM-CoT, which uses a separate inference model (See Table 7 and Table 8). This speed disadvantage arises from the fact that our framework requires an additional rationale as input, unlike the separate inference model framework that does not require any rationale before generating input.", + "perturbed_explanation": "The original explanation highlights two key aspects: 1) The LoRA adapter framework generates answers significantly faster (2x-14x) than the MM-CoT framework. 2) The speed advantage is due to not needing a rationale as input, unlike MM-CoT, which requires generating a rationale first. The statement introduces inaccuracies in the following ways: 1) It claims the LoRA framework is slower (2x–14x) than MM-CoT, which contradicts the actual performance speedup detailed earlier. 2) It incorrectly suggests that the LoRA framework needs an additional rationale as input, reversing the actual rationale requirement described." } }, { @@ -443,10 +467,13 @@ "[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", - "perturb_sentence_id": 11, + "perturb_sentence_id": [ + 11, + 12 + ], "output": { - "perturbed_statement": "[paragraph id = 11] gets a positive score because it is an incomplete sentence (Table 13). If there is no correction stage, the probability of is simply increased.", - "perturbed_explanation": "1. The original explanation states that the sentence receives a negative score for being incomplete; this emphasizes the importance of sentence completeness.2. The statement is incorrect because an incomplete sentence cannot logically receive a positive score, nor can its probability increase without the correction process. Such details contradict the reasoning provided." + "perturbed_statement": "[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 18). If there is no correction stage, the probability of is simply increased.", + "perturbed_explanation": "1. The original explanation indicates that the context discusses a negative score given to an incomplete sentence and the probability of it being reduced if not corrected. 2. The statement is incorrect because it references Table 18 instead of Table 13, and it claims that the probability of the incomplete sentence is increased without the correction stage, which contradicts the context where the probability decreases if the valuable part is ignored." } }, { @@ -479,10 +506,13 @@ "[paragraph id = 7] The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 )." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", - "perturb_sentence_id": 6, + "perturb_sentence_id": [ + 6, + 7 + ], "output": { - "perturbed_statement": "[paragraph id = 6] As a result of RL, we observe that rationale chains are consistently well-structured and complete (see Table 13 and 15). The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through SFT and appropriately guides it (Table 13 and 17).", - "perturbed_explanation": "1. The RL fine-tuning stage may lead to errors like repetitive or incomplete sentences, as elaborated in paragraph 1. 2. The SFT stage aims to address these errors, as mentioned in paragraph 2. 3. The statement incorrectly states that RL produces consistently well-structured outcomes, contradicting the fact that RL can result in repetitive or incomplete rationale chains. Additionally, it erroneously suggests that the model's fine-tuning guidance through SFT contradicts the actual sequence of processes leading to improved outputs." + "perturbed_statement": "[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are neither repetitive nor truncated (see Table 13 and 12 ). The SFT stage, with correction feedback, reflects a different direction from how the model is fine-tuned through RL and misguides it (Table 13 and 16 ).", + "perturbed_explanation": "This explanation addresses two aspects presented in the statement. 1. The first aspect is regarding the impact of RL fine-tuning. The original statement mentions that RL resulted in sentences within rationale chains being repetitive or truncated. 2. The second aspect involves the role of SFT. The SFT stage is designed to guide and correct the errors introduced during RL fine-tuning. The statement incorrectly claims that sentences are neither repetitive nor truncated, and that SFT misguides the model, both of which contradict the described scenario where RL leads to errors that SFT aims to correct." } }, { @@ -512,10 +542,13 @@ "[paragraph id = 6] We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Changed Empty Rationale Reasoning Through RL
\nProblem: Select the solid.
\nChoices: [rain, water in a fishbowl, hammer]
\nAnswer: hammer
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\n<empty>\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nRain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you pour water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood.The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you put water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A rain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood. Both iron and wood are solids.\n\n
\n
Table 14: Example of an empty rationale reasoning that starts generating rationale reasoning after the RL stage.
\n
", - "perturb_sentence_id": 4, + "perturb_sentence_id": [ + 4, + 5 + ], "output": { - "perturbed_statement": "[paragraph id = 4] In particular, 20% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", - "perturbed_explanation": "Original Explanation:\n1. The original statement specifies that 9.5% of ScienceQA problems lack rationale reasoning (solution) data.\n2. This is supported by Table 14, which details the model's performance changes before and after the RL stage.\nExplanation for statement:\n3. The statement incorrectly indicates that 20% of problems lack rationale reasoning, which contradicts the accurate percentage specified.\n4. This factual inconsistency misrepresents the proportion of problems affected, altering the displayed dataset characteristics." + "perturbed_statement": "[paragraph id = 4] In particular, 15% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14).", + "perturbed_explanation": "Original Explanation: In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward. \n\n1. The statement incorrectly claims that 15% of the problems have empty rationale reasoning data, whereas the correct percentage is 9.5%. \n2. This alteration changes the statistical representation of the data and may lead to misconceptions about the prevalence of problems lacking rationale reasoning in ScienceQA." } }, { @@ -543,10 +576,13 @@ "[paragraph id = 8] The process of elimination is a useful method for deriving answers when options are given." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Process of Elimination After the RL Stage
\nProblem: Which is a compound sentence?
\nChoices: [\"Last night we heard an owl outside in the oak tree.\",
               \"I ate all of my dinner, so I can have some pie.\"]
\nAnswer: \"I ate all of my dinner, so I can have some pie.\"
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie. The second sentence does not have a comma, so it is not a compound sentence. Last night we heard an owl outside in the oak tree.\n\n
\n
Table 15: Example of the process of elimination rationale reasoning after the RL stage.
\n
", - "perturb_sentence_id": 7, + "perturb_sentence_id": [ + 7, + 8 + ], "output": { - "perturbed_statement": "[paragraph id = 7] Second, the training model begins to explain why other options are the answer (See Table 15). The process of elimination is a useful method for deriving answers when options are given.", - "perturbed_explanation": "The original explanation states that the training model explains why certain options are not the answer. The statement now suggests the training model explains why other options are the answer, creating an inconsistency: 3. The methodology described emphasizes identifying incorrect options rather than determining alternative correct options." + "perturbed_statement": "[paragraph id = 7] Second, the training model begins to explain why other options are the answer (See Table 15 ). The process of elimination is a useful method for deriving answers when options are given.", + "perturbed_explanation": "The process of elimination is a useful method for deriving answers when options are given because it allows individuals to identify and discard alternatives that do not meet the given criteria or constraints. 2. However, the statement incorrectly claims that the training model explains why other options are the answer, which contradicts the concept of elimination that focuses on discarding incorrect options rather than validating them as correct answers." } }, { @@ -588,10 +624,13 @@ "[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Corrected Rationales After RL
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWill these magnets attract or repel each other?\n\n
\n\nChoices\n\n\n\nrepel, attract\n\n
\n\nAfter RL\n\n\n\nWill these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are the same repel. So, these magnets will repel each other. The picture shows two magnets that are the same shape and the same size. Both magnets are the same size. So, these magnets will repel each other.\n\n
\n\nCorrected Rationale\n\n\n\nThe magnets will attract each other because the north pole of one magnet is closest to the south pole of the other magnet. Opposite poles of magnets attract each other in this diagram. Therefore, the correct answer is that the magnets will attract.\n\n
\n\nQuestion\n\n\n\nHow long is a pen?\n\n
\n\nChoices\n\n\n\n20 kilometers, 20 centimeters\n\n
\n\nAfter RL\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is too long. 20 centimeters is too short. 20 kilometers is too long.\n\n
\n\nCorrected Rationale\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is an unreasonable estimate, as it is far too long for a typical pen. 20 centimeters is a more accurate and reasonable estimate for the length of a standard pen.\n\n
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhat is the capital of South Carolina?\n\n
\n\nChoices\n\n\n\nColumbia, Montgomery, Charleston, Harrisburg\n\n
\n\nAfter RL\n\n\n\nColumbia is the capital of South Carolina.nThe city of Columbia is Charleston, South Carolina. Charleston is the smallest city in South Carolina.nHayeburg is the smallest city in South Carolina. Montgomery is the smallest city in South Carolina. But the city of Columbia is home to many cultural and historical landmarks.\n\n
\n\nCorrected Rationale\n\n\n\nColumbia is the correct answer as it is the capital city of South Carolina. It is the largest city in the state and serves as the seat of government. The other options, Montgomery (the capital of Alabama), Charleston (a city in South Carolina), and Harrisburg (the capital of Pennsylvania), are not the capital of South Carolina.\n\n
\n
Table 16: Examples of how the correction stage corrects mistakes after the RL stage.
\n
", - "perturb_sentence_id": 17, + "perturb_sentence_id": [ + 17, + 18 + ], "output": { - "perturbed_statement": "[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017) to encourage the model to deviate significantly from the original model. The KL divergence penalty further encourages deviation.", - "perturbed_explanation": "Original Explanation: Reinforcement learning (RL) is tuned using policy optimization methods such as PPO (Proximal Policy Optimization) to refine the model while ensuring its outputs remain within desirable boundaries. The KL divergence penalty typically serves to constrain the model away from overfitting or drastically changing from its prior state during training.1. The statement inaccurately suggests that RL intends to force the model to deviate significantly from its original parameters, which contradicts the principle of fine-tuning, aiming for controlled improvement without unnecessary departure from foundational training.2. The description of the KL divergence as promoting deviation directly opposes its role in ensuring restrictive and guided model behavior alterations." + "perturbed_statement": "[paragraph id = 17] Additionally, RL is primarily fine-tuned through DQN (Schulman et al., 2017) to prevent the model from deviating too much from the original model. The KL divergence penalty further prevents deviation.", + "perturbed_explanation": "1. The original explanation notes that RL is primarily fine-tuned through PPO (Proximal Policy Optimization), which efficiently balances exploration and exploitation. It also emphasizes the role of the KL divergence penalty in preventing the model's significant deviation from the base model. 2. The statement incorrectly claims that RL is primarily fine-tuned through DQN (Deep Q-Network). DQN is not used for fine-tuning in the context described; PPO is the correct method used, according to the cited reference (Schulman et al., 2017). This introduces a factual error in the description of the fine-tuning process." } }, { @@ -631,10 +670,13 @@ "[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Generated Rationales
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhich property do these three objects have in common?\n\n
\n\nChoices\n\n\n\nrough, yellow, sticky\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nLook at each object. For each object, decide if it has that property. A rough object feels scratchy when you touch it. The rain boots and the rubber duck are not rough. A sticky object can attach or stick to other things. The rain boots and the rubber duck are not sticky. Yellow is a color. This color is yellow. All three objects are yellow. The property that all three objects have in common is yellow.\n\n
\n\n\n\n\n\nLook at each object. For each object, decide if it has that property. A sticky object can attach or stick to other things. The rubber duck is not sticky. Yellow is a color. This color is yellow. The rubber duck and the rain boots are yellow, but the lemon is not. A rough object feels scratchy when you touch it. All three objects are rough. The property that all three objects have in common is rough.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe property that all three objects have in common is yellow. The rubber duck and the rain boots are both yellow, and the lemon is also yellow. Although the objects may have other properties like roughness, the common property among them is the color yellow.\n\n
\n\n (Ours)\n\n\n\nThe three objects shown - rubber duck, rain boots, and lemon - all share the common property of being yellow in color. The rubber duck is made of a smooth, non-sticky material, the rain boots are also a yellow-colored object, and the lemon is also a yellow-colored liquid. Therefore, the correct answer is that the common property among these three objects is that they are yellow. The other options, rough and sticky, do not apply to all three objects, as the rubber duck is not rough, and the rain boots are not sticky.\n\n
\n\nQuestion\n\n\n\nWhich word would you find on a dictionary page with the following guide words? lever - loan\n\n
\n\nChoices\n\n\n\nlamb, liquid\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nPut the words in alphabetical order. Since liquid is between the guide words lever - loan, it would be found on that page.\n\n
\n\n\n\n\n\nPut the words in alphabetical order. Since lamb is between the guide words lever - loan, it would be found on that page.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words.\n\n
\n\n (Ours)\n\n\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words. Therefore, the word \"liquid\" would be the word found on that dictionary page.\n\n
\n
Table 17: Comparison of generated rationales by baseline, baseline with correction feedback, and our model (ARES).
\n
", - "perturb_sentence_id": 16, + "perturb_sentence_id": [ + 16, + 17 + ], "output": { - "perturbed_statement": "[paragraph id = 16] However, the corrected reasoning for MM-CoT without the RL stage contains excessive information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples). Table 4 also demonstrates that inference accuracy gradually degrades as each part of ARES is executed.", - "perturbed_explanation": "The original explanation lists that MM-CoT without RL stage suffers from a lack of information compared to ARES that employs RL and that inference accuracy for ARES enhances progressively. However, 1) the information content for MM-CoT without RL should not be described as `excessive` since it's observed as being insufficient, and 2) the inference accuracy for ARES improves with execution, rather than degrading, which directly contradicts the observations presented in the original texts." + "perturbed_statement": "[paragraph id = 16] Nonetheless, the corrected reasoning for MM-CoT with the RL stage has insufficient information compared to the reasoning of ARES without RL (refer to Table 17 for more examples). Table 4 also shows that inference accuracy remains the same as each part of ARES is executed.", + "perturbed_explanation": "The original explanation highlights: 1. The statement discusses comparison in reasoning between MM-CoT and ARES, specifically emphasizing that the RL stage in ARES enhances reasoning, unlike MM-CoT without the RL stage, which has insufficient information. 2. In Table 4, it is stated that inference accuracy for ARES gradually improves as each part is executed. The statement is incorrect because: 3. It falsely claims that MM-CoT with the RL stage has insufficient information compared to ARES without RL, which contradicts the context where ARES's reasoning benefits from RL. 4. It incorrectly claims that Table 4 shows inference accuracy remains the same as ARES components execute, while the context indicates improvement with execution." } } ] \ No newline at end of file