diff --git "a/table_result/2407.00087v2_output.json" "b/table_result/2407.00087v2_output.json" new file mode 100644--- /dev/null +++ "b/table_result/2407.00087v2_output.json" @@ -0,0 +1,640 @@ +[ + { + "path": "table_paper/2407.00087v2.json", + "table_id": "1", + "section": "4.3", + "all_context": [ + "We check whether ARES improves the quality of rationale reasoning compared to the baseline model.", + "GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.", + "We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).", + "We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", + "Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets.", + "" + ], + "target_context_ids": [ + 0, + 1, + 2, + 3, + 4 + ], + "selected_paragraphs": [ + "[paragraph id = 0] We check whether ARES improves the quality of rationale reasoning compared to the baseline model.", + "[paragraph id = 1] GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.", + "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).", + "[paragraph id = 3] We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", + "[paragraph id = 4] Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets." + ], + "table_html": "
\n

\n\n\n\n\n\nScienceQA\nWin Rate\n\n\n\n vs \n69.76%\n\n vs \n73.76%\n\nA-OKVQA\nWin Rate\n\n vs \n69.11%\n\n vs \n66.96%\n\n\n

\n
Table 1: We train baseline models, MM-CoT, with the ARES pipeline and ask GPT-4o to evaluate which rationale reasoning is better. We compare each baseline for two model sizes ( and ) and two tasks (ScienceQA and A-OKVQA).
\n
", + "perturb_sentence_id": 2, + "output": { + "perturbed_statement": "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ). We conduct our experiments with two different model sizes, Flan-Mini and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", + "perturbed_explanation": "1. The original explanation states that the experiments were conducted with Flan-Base and Flan-Large models, ensuring a fair comparison and control over model scale variations. 2. The statement is incorrect because it lists Flan-Mini and Flan-Large as the model sizes instead, which misrepresents the actual experimental setup as only Flan-Base and Flan-Large models were utilized." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "2", + "section": "4.4", + "all_context": [ + "We investigate whether the improved rationale also contributes to answer inference accuracy.", + "Table 2 shows the main results of answer inference on the ScienceQA.", + "We evaluate our base model against the MM-CoT baseline.", + "achieves a 2.79% improvement compared to the corresponding baseline ().", + "The large model () shows some minimal improvement compared to the corresponding baseline.", + "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", + "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", + "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", + "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", + "Table 3 shows the results of answer inference on the A-OKVQA.", + "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", + "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", + "ARES achieves 4.45% gains against and 2.35% for .", + "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", + "Figure 3 shows the rationale reasoning for 4 cases.", + "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", + "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", + "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", + "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", + "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "" + ], + "target_context_ids": [ + 1, + 4, + 5, + 7, + 12, + 13, + 18, + 19, + 20 + ], + "selected_paragraphs": [ + "[paragraph id = 1] Table 2 shows the main results of answer inference on the ScienceQA.", + "[paragraph id = 4] The large model () shows some minimal improvement compared to the corresponding baseline.", + "[paragraph id = 5] However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", + "[paragraph id = 7] The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", + "[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .", + "[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", + "[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", + "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03" + ], + "table_html": "
\n
\n

\n\n\n\n\n\nModel\nSize\nNAT\nSOC\nLAN\nTXT\nIMG\nNO\nG1-6\nG7-12\nAvg\n\nHuman\n-\n90.23\n84.97\n87.48\n89.60\n87.50\n88.10\n91.59\n82.42\n88.40\n\nMCAN (Yu et al., 2019 ###reference_b40###)\n95M\n56.08\n46.23\n58.09\n59.43\n51.17\n55.40\n51.65\n59.72\n54.54\n\nTop-Down (Anderson et al., 2018 ###reference_b1###)\n70M\n59.50\n54.33\n61.82\n62.90\n54.88\n59.79\n57.27\n62.16\n59.02\n\nBAN (Kim et al., 2018 ###reference_b16###)\n112M\n60.88\n46.57\n66.64\n62.61\n52.60\n65.51\n56.83\n63.94\n59.37\n\nDFAF (Peng et al., 2019 ###reference_b28###)\n74M\n64.03\n48.82\n63.55\n65.88\n54.49\n64.11\n57.12\n67.17\n60.72\n\nViLT (Kim et al., 2021 ###reference_b17###)\n113M\n60.48\n63.89\n60.27\n63.20\n61.38\n57.00\n60.72\n61.90\n61.14\n\nPatch-TRM (Lu et al., 2022b ###reference_b25###)\n90M\n65.19\n46.79\n65.55\n66.96\n55.28\n64.95\n58.04\n67.50\n61.42\n\nVisualBERT (Li et al., 2019 ###reference_b20###)\n111M\n59.33\n69.18\n61.18\n62.71\n62.17\n58.54\n62.96\n59.92\n61.87\n\nUnifiedQABase (Khashabi et al., 2020 ###reference_b15###)\n223M\n68.16\n69.18\n74.91\n63.78\n61.38\n77.84\n72.98\n65.00\n70.12\n\nUnifiedQABase w/ CoT (Lu et al., 2022a ###reference_b24###)\n223M\n71.00\n76.04\n78.91\n66.42\n66.53\n81.81\n77.06\n68.82\n74.11\n\nLLaMA-Adapter (Zhang et al., 2023a ###reference_b42###)\n6B\n84.37\n88.30\n84.36\n83.72\n80.32\n86.90\n85.83\n84.05\n85.19\n\nLLaVA (Liu et al., 2023a ###reference_b22###)\n13B\n90.36\n95.95*\n88.00\n89.49\n88.00\n90.66\n90.93\n90.90*\n90.92\n\nInstructBLIP (Dai et al., 2023 ###reference_b8###)\n11B\n-\n-\n-\n-\n90.70*\n-\n-\n-\n-\n\n (Zhang et al., 2023b ###reference_b45###)\n251M+251M\n84.59\n92.46\n83.45\n83.87\n83.29\n85.64\n86.34\n85.23\n85.95\n\n (Ours)\n251M+30M\n87.92\n92.58\n85.91\n86.61\n85.82\n88.36\n88.88\n87.48\n88.38\n\n (Zhang et al., 2023b ###reference_b45###)\n790M+790M\n90.76\n93.59\n86.55\n89.69\n87.85\n89.55\n90.90\n89.12\n90.26\n\n (Ours)\n790M+76M\n91.21*\n92.80\n89.45*\n90.27*\n88.35\n91.22*\n91.48*\n90.38\n91.09*\n\n\n

\n
\n
Table 2: \nMain results on the ScienceQA test set (%). Size = backbone size. Question classes: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. Other results are sourced from Lu et al. (2022a) and Zhang et al. (2023b). Results in bold represent the better performance corresponding baseline. (*) indicates the best performance.\n
\n
", + "perturb_sentence_id": 20, + "output": { + "perturbed_statement": "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 64.83", + "perturbed_explanation": "The original explanation establishes the accuracy metrics of the models as follows: IPVR (OPT-66B) scored 48.6, ViLBERT scored 49.1, followed by other models achieving 60.96, 65.41, 65.68, and 68.03 sequentially. The statement inaccurately describes the final model's accuracy as 64.83 instead of the correct 68.03. This misrepresentation introduces an erroneous sequence, disrupting the claim's alignment with provided information. This demonstrates the importance of maintaining accurate data representation in summaries." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "3", + "section": "4.4", + "all_context": [ + "We investigate whether the improved rationale also contributes to answer inference accuracy.", + "Table 2 shows the main results of answer inference on the ScienceQA.", + "We evaluate our base model against the MM-CoT baseline.", + "achieves a 2.79% improvement compared to the corresponding baseline ().", + "The large model () shows some minimal improvement compared to the corresponding baseline.", + "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", + "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", + "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", + "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", + "Table 3 shows the results of answer inference on the A-OKVQA.", + "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", + "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", + "ARES achieves 4.45% gains against and 2.35% for .", + "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", + "Figure 3 shows the rationale reasoning for 4 cases.", + "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", + "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", + "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", + "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", + "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "" + ], + "target_context_ids": [ + 10, + 11, + 12, + 13 + ], + "selected_paragraphs": [ + "[paragraph id = 10] We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", + "[paragraph id = 11] In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", + "[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .", + "[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study." + ], + "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\nIPVR (OPT-66B)\n48.6\n\nViLBERT\n49.1\n\n\n60.96\n\n (Ours)\n65.41\n\n\n65.68\n\n (Ours)\n68.03\n\n\n

\n
Table 3: Results of ARES on A-OKVQA. We mainly compare different-sized MM-CoT baselines (Zhang et al., 2023b). We retrain the MM-CoTs and run the ARES pipeline on these models. We evaluate these models on the validation set because the test set is hidden.
\n
", + "perturb_sentence_id": 11, + "output": { + "perturbed_statement": "[paragraph id = 11] In our experiments, MM-CoT models perform around 15% better than the reported accuracy in (Zhang et al., 2023b ).ARES achieves 6.75% gains against and 3.15% for .", + "perturbed_explanation": "1. The original statement claims a 10% improvement by MM-CoT models in accuracy over reported metrics and outlines specific gains achieved by ARES as 4.45% and 2.35% in respective contexts. 2. The statement inaccurately attributes MM-CoT models with a 15% performance improvement and modifies ARES gains to 6.75% and 3.15%, which conflicts with the consistent data representation in the study." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "4", + "section": "4.4", + "all_context": [ + "We investigate whether the improved rationale also contributes to answer inference accuracy.", + "Table 2 shows the main results of answer inference on the ScienceQA.", + "We evaluate our base model against the MM-CoT baseline.", + "achieves a 2.79% improvement compared to the corresponding baseline ().", + "The large model () shows some minimal improvement compared to the corresponding baseline.", + "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", + "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", + "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", + "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", + "Table 3 shows the results of answer inference on the A-OKVQA.", + "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", + "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", + "ARES achieves 4.45% gains against and 2.35% for .", + "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", + "Figure 3 shows the rationale reasoning for 4 cases.", + "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", + "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", + "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", + "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", + "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "" + ], + "target_context_ids": [ + 17, + 18, + 19 + ], + "selected_paragraphs": [ + "[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", + "[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES." + ], + "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\n\n\n\n85.95\n\n1st RL\n86.70\n\n1st ARES\n87.81\n\n1st ARES & 2nd RL\n87.88\n\n2nd ARES\n88.38\n\n\n

\n
Table 4: Ablation study: The accuracy gradually improves as each stage of ARES is added.
\n
", + "perturb_sentence_id": 19, + "output": { + "perturbed_statement": "[paragraph id = 19] 1st ARES & 2nd RL represents the second ARES on 1st RL, and finally, 2nd RL refers to two rounds of RL.Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "perturbed_explanation": "Original Explanation: The statement provides the mapping of iterations between RL and ARES processes and their related accuracy metrics. The context specifies that '1st ARES & 2nd RL' indicates the second RL following the execution of 1st ARES. Furthermore, '2nd ARES' refers to completing a second round of ARES execution. Explanation: '1st ARES & 2nd RL' being described as 'the second ARES on 1st RL' is incorrect as it implies a contradictory order of operations, while '2nd RL' equating to two RL rounds misrepresents the definition and separate context of 2nd ARES." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "5", + "section": "2.2", + "all_context": [ + "Reinforcement Learning (RL) fine-tunes our model to maximize sum of sentence rewards from an advanced AI model such as GPT-4 and Claude 3 Opus.", + "The RL objective is as follows: where is a discount factor.", + "We use Proximal Policy Optimization (PPO) (Schulman et al., 2017 ) to achieve this RL objective, treating sentences as actions (Equation 3 ).", + "where is the original policy (baseline model) and is an advantage estimator at timestep .", + "PPO is commonly leveraged in Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ).", + "PPO s conservative update prevents the training model from deviating too far from the original model, thus avoiding degeneration.", + "Sentence-Level Nuanced Feedback: We request a score between and for each sentence in CoT through the advanced AI for RL.", + "The closer the score is to , the more relevant and helpful it is to solving the problem.", + "Table 5 presents the prompt format.", + "We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).", + "Therefore, the actual range is from to .", + "Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 ).", + "This allows us to understand the direction in which the model is intended to change through RL.", + "Advantages of Using Advanced AI for Score Feedback: Although calling the API has disadvantages, such as incurring costs or facing usage limits, there exist several advantages to using the advanced AI for feedback.", + "First, there is no need to train a reward model.", + "Second, as the RL fine-tuned model begins to generate out-of-distribution outputs that differ from the data used to train the reward model, it becomes challenging for the trained reward model to provide accurate rewards.", + "However, this out-of-distribution problem is effectively addressed with the advanced AI.", + "RL Challenge: One of the challenging factors for RL is hyperparameter tuning (Eimer et al., 2023 ).", + "This often results in generating repetitive words and truncated sentences (Ouyang et al., 2022 ).", + "Additionally, as the model size increases, finding working hyperparameters becomes infeasible for individuals.", + "To alleviate this issue, we utilize correction feedback from the advanced AI as the second stage (Section 2.3 ), and proceed with the supervised fine-tuning to stabilize the RL fine-tuned model.", + "" + ], + "target_context_ids": [ + 8, + 9, + 10, + 11 + ], + "selected_paragraphs": [ + "[paragraph id = 8] Table 5 presents the prompt format.", + "[paragraph id = 9] We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).", + "[paragraph id = 10] Therefore, the actual range is from to .", + "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 )." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", + "perturb_sentence_id": 9, + "output": { + "perturbed_statement": "[paragraph id = 9] We additionally shift the reward distribution to center it at a mean value of 10 (Zheng et al., 2023 ). Therefore, the actual range extends to 20.", + "perturbed_explanation": "The original explanation is as follows: The reward distribution is stated to be adjusted to center it around its mean, however, specific numeric values might have been mentioned for both the center and range limits. Deviating from these specifics alters the accuracy in referencing key findings mentioned in the corresponding document. It is essential to ensure consistency between the textual statement and the documented contexts and findings to preserve clarity and reliability. Hence, verifying these particular aspects is vital for accurate comprehension and application of the stated information." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "5", + "section": "3", + "all_context": [ + "Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.", + "We choose ScienceQA because it contains reasoning chains to derive the answer.", + "Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).", + "In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).", + "Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.", + "MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).", + "This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).", + "We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.", + "Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.", + "We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.", + "We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.", + "To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.", + "(See Table 5 .)", + "In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", + "We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", + "The RL stage often makes the training model generate repetitive sentences.", + "This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", + "To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", + "(See Appendix C.2 .)", + "Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.", + "The learning rate for is with epochs for PPO and a batch size of for both tasks.", + "We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.", + "For A-OKVQA, we proceed with 1 round for both model sizes.", + "For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.", + "Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).", + "The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", + "For more detailed settings, please refer to Appendix C .", + "Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.", + "First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.", + "For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).", + "The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).", + "Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).", + "This evaluation is also performed on both model sizes and tasks.", + "Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .", + "" + ], + "target_context_ids": [ + 12, + 13, + 14, + 15, + 16, + 17, + 18 + ], + "selected_paragraphs": [ + "[paragraph id = 12] (See Table 5 .)", + "[paragraph id = 13] In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", + "[paragraph id = 14] We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", + "[paragraph id = 15] The RL stage often makes the training model generate repetitive sentences.", + "[paragraph id = 16] This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", + "[paragraph id = 17] To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", + "[paragraph id = 18] (See Appendix C.2 .)" + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", + "perturb_sentence_id": 12, + "output": { + "perturbed_statement": "[paragraph id = 12] (See Appendix C.2 .) In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", + "perturbed_explanation": "The original explanation mentions that the rationale chains in prior steps are maintained, and Appendix C.2 is referred to for details about reducing repetitive sentences during the RL stage. However: 1. Appendix C.2 does not pertain to the SFT stage dataset correction but to repetitive sentence removal during the RL stage. 2. Associating Appendix C.2 with the SFT stage introduces misalignment in the documentation and potential confusion for readers seeking accurate references." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "8", + "section": "3", + "all_context": [ + "Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.", + "We choose ScienceQA because it contains reasoning chains to derive the answer.", + "Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).", + "In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).", + "Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.", + "MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).", + "This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).", + "We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.", + "Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.", + "We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.", + "We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.", + "To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.", + "(See Table 5 .)", + "In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", + "We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", + "The RL stage often makes the training model generate repetitive sentences.", + "This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", + "To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", + "(See Appendix C.2 .)", + "Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.", + "The learning rate for is with epochs for PPO and a batch size of for both tasks.", + "We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.", + "For A-OKVQA, we proceed with 1 round for both model sizes.", + "For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.", + "Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).", + "The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", + "For more detailed settings, please refer to Appendix C .", + "Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.", + "First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.", + "For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).", + "The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).", + "Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).", + "This evaluation is also performed on both model sizes and tasks.", + "Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .", + "" + ], + "target_context_ids": [ + 25, + 26 + ], + "selected_paragraphs": [ + "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", + "[paragraph id = 26] For more detailed settings, please refer to Appendix C ." + ], + "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", + "perturb_sentence_id": 25, + "output": { + "perturbed_statement": "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model's features with a small number of weights, enabling 50x–70x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ). For more detailed settings, please refer to Appendix C.", + "perturbed_explanation": "The original statement describes the performance improvements in inference time when utilizing the LoRA adapter relative to MM-CoT, specifying a range of 2x–14x based on experimental observations. 1. The inference speed-up values involve precise calculations obtained from benchmarking different models, with specific ranges assigned. 2. Claiming a significantly higher range of 50x–70x misrepresents the reported findings and could lead to misconceptions about the actual levels of improvement achievable using the technology outlined." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "8", + "section": "5", + "all_context": [ + "Chain-of-Thought (CoT) is a multi-step reasoning method for problem-solving that encourages LLMs to consider the intermediate reasoning steps.", + "Zero-Shot-CoT (Kojima et al., 2023 ) promotes CoT by using prompts such as \"Let s think step by step\" for LLMs.", + "For Few-Shot-CoT (Zhang et al., 2022b ; Wei et al., 2023 ), a few examples with reasoning processes are provided, allowing the model to refer to these examples and understand how to perform CoT.", + "Wei et al.", + "(2023 ) reveal that this CoT technique positively impacts the performance of large models (B), but has minimal effect on smaller models.", + "MM-CoT (Zhang et al., 2023b ) suggest that CoT is beneficial even for relatively small models, such as 200M, if the model that generates intermediate reasoning and the model that infers the answer are separated.", + "We find that simply adding a LoRA adapter (Hu et al., 2021 ) to the reasoning model results in comparable performance.", + "This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).", + "This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input.", + "Reinforcement Learning from Human Feedback (RLHF) (Glaese et al., 2022 ; Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ) align LLMs with user preferences.", + "Ouyang et al.", + "(2022 ) collects ranked feedback from human labelers and uses this feedback to perform Reinforcement Learning (RL).", + "Constitutional AI (CAI) (Bai et al., 2022 ) collects ranked AI feedback rather than costly human feedback and handles harmfulness with RL.", + "Both approaches learn outcome-supervised reward models (ORM) using ranking-based feedback.", + "Lightman et al.", + "(2023 ), instead, propose a process-supervised reward model (PRM) that leverages sentence-level feedback for CoT.", + "Lightman et al.", + "(2023 ); Luo et al.", + "(2024 ) evaluate each trained ORM and PRM with searching algorithms such as best-of- or Monte Carlo Tree Search (MCTS) by selecting the highest-scored solution, demonstrating that the PRM-selected solution outperforms the ORM-selected one.", + "Wang et al.", + "(2024 ) perform RL using PRM, providing heuristic sentence-level scores for math problems that are simple to grade.", + "As an LLM is trained with RL and starts generating outputs different from the original distribution, these reward models would not correctly provide rewards (Pitis, 2023 ; Byun and Perrault, 2024 ).", + "Instead of training a reward model for a more general task, we perform RL by requesting sentence-level rewards from advanced AI models such as GPT-4.", + "Model Accuracy 85.95 1st RL 86.70 1st ARES 87.81 1st ARES & 2nd RL 87.88 2nd ARES 88.38", + "" + ], + "target_context_ids": [ + 7, + 8 + ], + "selected_paragraphs": [ + "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).", + "[paragraph id = 8] This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input." + ], + "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", + "perturb_sentence_id": 7, + "output": { + "perturbed_statement": "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 1.5x–7x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ). This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input.", + "perturbed_explanation": "Original Explanation: The statement details the performance benefits of the framework compared to MM-CoT in terms of answer generation speed, supported by specific speed ratios reported in Table 7 and Table 8. 1. The mentioned speed advantage as 1.5x–7x deviates from the actual reported range of 2x–14x, therefore misrepresenting the evaluated benchmarks mentioned in the context. Accurate reference to reported data is crucial for conveying findings precisely." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "13", + "section": "2.3", + "all_context": [ + "The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.", + "However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.", + "(See examples in Appendix D .)", + "Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.", + "We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .", + "Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.", + "This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.", + "How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.", + "The direction of learning is determined by the reward (scalar) value.", + "However, the opposite direction of the reward is sometimes required.", + "For example, suppose there is a truncated sentence in CoT.", + "gets a negative score because it is an incomplete sentence (Table 13 ).", + "If there is no correction stage, the probability of is simply reduced.", + "What if contains some valuable part?", + "This valuable part is ignored, and its probability decreases.", + "To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", + "Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", + "Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.", + "The KL divergence penalty further prevents deviation.", + "However, this penalty often causes the model s degeneration.", + "As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.", + "While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.", + "Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.", + "" + ], + "target_context_ids": [ + 11, + 12, + 13, + 14, + 15, + 16 + ], + "selected_paragraphs": [ + "[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 13 ).", + "[paragraph id = 12] If there is no correction stage, the probability of is simply reduced.", + "[paragraph id = 13] What if contains some valuable part?", + "[paragraph id = 14] This valuable part is ignored, and its probability decreases.", + "[paragraph id = 15] To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", + "[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", + "perturb_sentence_id": 11, + "output": { + "perturbed_statement": "[paragraph id = 11] gets a positive score because it is an incomplete sentence (Table 13). If there is no correction stage, the probability of is simply increased.", + "perturbed_explanation": "1. The original explanation states that the sentence receives a negative score for being incomplete; this emphasizes the importance of sentence completeness.2. The statement is incorrect because an incomplete sentence cannot logically receive a positive score, nor can its probability increase without the correction process. Such details contradict the reasoning provided." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "13", + "section": "4.2", + "all_context": [ + "Despite the benefits of RL, hyperparameter tuning often requires massive effort.", + "Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.", + "To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.", + "SFT is more stable than RL.", + "We evaluate how well the SFT stage corrects errors caused by the RL stage for various RL hyperparameters.", + "We test various RL hyperparameters such as learning rate = {5e-6, 1e-5, 2e-5, 5e-5}, batch size = {2, 4, 8, 16, 32}, and PPO epoch = {5, 10, 15}.", + "As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).", + "The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 ).", + "However, excessive RL learning rates or epochs cause serious degeneration of the model, such as producing no output or generating strange words, and the results of correction feedback are also unreasonable.", + "ScienceQA Win Rate vs 69.76% vs 73.76% A-OKVQA Win Rate vs 69.11% vs 66.96% Model Size NAT SOC LAN TXT IMG NO G1-6 G7-12 Avg Human - 90.23 84.97 87.48 89.60 87.50 88.10 91.59 82.42 88.40 MCAN (Yu et al., 2019 ) 95M 56.08 46.23 58.09 59.43 51.17 55.40 51.65 59.72 54.54 Top-Down (Anderson et al., 2018 ) 70M 59.50 54.33 61.82 62.90 54.88 59.79 57.27 62.16 59.02 BAN (Kim et al., 2018 ) 112M 60.88 46.57 66.64 62.61 52.60 65.51 56.83 63.94 59.37 DFAF (Peng et al., 2019 ) 74M 64.03 48.82 63.55 65.88 54.49 64.11 57.12 67.17 60.72 ViLT (Kim et al., 2021 ) 113M 60.48 63.89 60.27 63.20 61.38 57.00 60.72 61.90 61.14 Patch-TRM (Lu et al., 2022b ) 90M 65.19 46.79 65.55 66.96 55.28 64.95 58.04 67.50 61.42 VisualBERT (Li et al., 2019 ) 111M 59.33 69.18 61.18 62.71 62.17 58.54 62.96 59.92 61.87 UnifiedQABase (Khashabi et al., 2020 ) 223M 68.16 69.18 74.91 63.78 61.38 77.84 72.98 65.00 70.12 UnifiedQABase w/ CoT (Lu et al., 2022a ) 223M 71.00 76.04 78.91 66.42 66.53 81.81 77.06 68.82 74.11 LLaMA-Adapter (Zhang et al., 2023a ) 6B 84.37 88.30 84.36 83.72 80.32 86.90 85.83 84.05 85.19 LLaVA (Liu et al., 2023a ) 13B 90.36 95.95* 88.00 89.49 88.00 90.66 90.93 90.90* 90.92 InstructBLIP (Dai et al., 2023 ) 11B - - - - 90.70* - - - - (Zhang et al., 2023b ) 251M+251M 84.59 92.46 83.45 83.87 83.29 85.64 86.34 85.23 85.95 (Ours) 251M+30M 87.92 92.58 85.91 86.61 85.82 88.36 88.88 87.48 88.38 (Zhang et al., 2023b ) 790M+790M 90.76 93.59 86.55 89.69 87.85 89.55 90.90 89.12 90.26 (Ours) 790M+76M 91.21* 92.80 89.45* 90.27* 88.35 91.22* 91.48* 90.38 91.09*", + "" + ], + "target_context_ids": [ + 1, + 2, + 6, + 7 + ], + "selected_paragraphs": [ + "[paragraph id = 1] Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.", + "[paragraph id = 2] To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.", + "[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).", + "[paragraph id = 7] The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 )." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", + "perturb_sentence_id": 6, + "output": { + "perturbed_statement": "[paragraph id = 6] As a result of RL, we observe that rationale chains are consistently well-structured and complete (see Table 13 and 15). The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through SFT and appropriately guides it (Table 13 and 17).", + "perturbed_explanation": "1. The RL fine-tuning stage may lead to errors like repetitive or incomplete sentences, as elaborated in paragraph 1. 2. The SFT stage aims to address these errors, as mentioned in paragraph 2. 3. The statement incorrectly states that RL produces consistently well-structured outcomes, contradicting the fact that RL can result in repetitive or incomplete rationale chains. Additionally, it erroneously suggests that the model's fine-tuning guidance through SFT contradicts the actual sequence of processes leading to improved outputs." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "14", + "section": "4.1", + "all_context": [ + "Through RL, a training model is aligned to a specific preference.", + "Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.", + "However, this process produces some interesting additional results.", + "First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.", + "In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", + "The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", + "We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.", + "Second, the training model begins to explain why other options are not the answer (See Table 15 ).", + "The process of elimination is a useful method for deriving answers when options are given.", + "" + ], + "target_context_ids": [ + 4, + 5, + 6 + ], + "selected_paragraphs": [ + "[paragraph id = 4] In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", + "[paragraph id = 5] The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", + "[paragraph id = 6] We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Changed Empty Rationale Reasoning Through RL
\nProblem: Select the solid.
\nChoices: [rain, water in a fishbowl, hammer]
\nAnswer: hammer
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\n<empty>\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nRain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you pour water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood.The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you put water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A rain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood. Both iron and wood are solids.\n\n
\n
Table 14: Example of an empty rationale reasoning that starts generating rationale reasoning after the RL stage.
\n
", + "perturb_sentence_id": 4, + "output": { + "perturbed_statement": "[paragraph id = 4] In particular, 20% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", + "perturbed_explanation": "Original Explanation:\n1. The original statement specifies that 9.5% of ScienceQA problems lack rationale reasoning (solution) data.\n2. This is supported by Table 14, which details the model's performance changes before and after the RL stage.\nExplanation for statement:\n3. The statement incorrectly indicates that 20% of problems lack rationale reasoning, which contradicts the accurate percentage specified.\n4. This factual inconsistency misrepresents the proportion of problems affected, altering the displayed dataset characteristics." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "15", + "section": "4.1", + "all_context": [ + "Through RL, a training model is aligned to a specific preference.", + "Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.", + "However, this process produces some interesting additional results.", + "First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.", + "In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", + "The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", + "We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.", + "Second, the training model begins to explain why other options are not the answer (See Table 15 ).", + "The process of elimination is a useful method for deriving answers when options are given.", + "" + ], + "target_context_ids": [ + 7, + 8 + ], + "selected_paragraphs": [ + "[paragraph id = 7] Second, the training model begins to explain why other options are not the answer (See Table 15 ).", + "[paragraph id = 8] The process of elimination is a useful method for deriving answers when options are given." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Process of Elimination After the RL Stage
\nProblem: Which is a compound sentence?
\nChoices: [\"Last night we heard an owl outside in the oak tree.\",
          ��    \"I ate all of my dinner, so I can have some pie.\"]
\nAnswer: \"I ate all of my dinner, so I can have some pie.\"
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie. The second sentence does not have a comma, so it is not a compound sentence. Last night we heard an owl outside in the oak tree.\n\n
\n
Table 15: Example of the process of elimination rationale reasoning after the RL stage.
\n
", + "perturb_sentence_id": 7, + "output": { + "perturbed_statement": "[paragraph id = 7] Second, the training model begins to explain why other options are the answer (See Table 15). The process of elimination is a useful method for deriving answers when options are given.", + "perturbed_explanation": "The original explanation states that the training model explains why certain options are not the answer. The statement now suggests the training model explains why other options are the answer, creating an inconsistency: 3. The methodology described emphasizes identifying incorrect options rather than determining alternative correct options." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "16", + "section": "2.3", + "all_context": [ + "The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.", + "However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.", + "(See examples in Appendix D .)", + "Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.", + "We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .", + "Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.", + "This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.", + "How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.", + "The direction of learning is determined by the reward (scalar) value.", + "However, the opposite direction of the reward is sometimes required.", + "For example, suppose there is a truncated sentence in CoT.", + "gets a negative score because it is an incomplete sentence (Table 13 ).", + "If there is no correction stage, the probability of is simply reduced.", + "What if contains some valuable part?", + "This valuable part is ignored, and its probability decreases.", + "To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", + "Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", + "Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.", + "The KL divergence penalty further prevents deviation.", + "However, this penalty often causes the model s degeneration.", + "As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.", + "While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.", + "Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.", + "" + ], + "target_context_ids": [ + 16, + 17 + ], + "selected_paragraphs": [ + "[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", + "[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Corrected Rationales After RL
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWill these magnets attract or repel each other?\n\n
\n\nChoices\n\n\n\nrepel, attract\n\n
\n\nAfter RL\n\n\n\nWill these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are the same repel. So, these magnets will repel each other. The picture shows two magnets that are the same shape and the same size. Both magnets are the same size. So, these magnets will repel each other.\n\n
\n\nCorrected Rationale\n\n\n\nThe magnets will attract each other because the north pole of one magnet is closest to the south pole of the other magnet. Opposite poles of magnets attract each other in this diagram. Therefore, the correct answer is that the magnets will attract.\n\n
\n\nQuestion\n\n\n\nHow long is a pen?\n\n
\n\nChoices\n\n\n\n20 kilometers, 20 centimeters\n\n
\n\nAfter RL\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is too long. 20 centimeters is too short. 20 kilometers is too long.\n\n
\n\nCorrected Rationale\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is an unreasonable estimate, as it is far too long for a typical pen. 20 centimeters is a more accurate and reasonable estimate for the length of a standard pen.\n\n
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhat is the capital of South Carolina?\n\n
\n\nChoices\n\n\n\nColumbia, Montgomery, Charleston, Harrisburg\n\n
\n\nAfter RL\n\n\n\nColumbia is the capital of South Carolina.nThe city of Columbia is Charleston, South Carolina. Charleston is the smallest city in South Carolina.nHayeburg is the smallest city in South Carolina. Montgomery is the smallest city in South Carolina. But the city of Columbia is home to many cultural and historical landmarks.\n\n
\n\nCorrected Rationale\n\n\n\nColumbia is the correct answer as it is the capital city of South Carolina. It is the largest city in the state and serves as the seat of government. The other options, Montgomery (the capital of Alabama), Charleston (a city in South Carolina), and Harrisburg (the capital of Pennsylvania), are not the capital of South Carolina.\n\n
\n
Table 16: Examples of how the correction stage corrects mistakes after the RL stage.
\n
", + "perturb_sentence_id": 17, + "output": { + "perturbed_statement": "[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017) to encourage the model to deviate significantly from the original model. The KL divergence penalty further encourages deviation.", + "perturbed_explanation": "Original Explanation: Reinforcement learning (RL) is tuned using policy optimization methods such as PPO (Proximal Policy Optimization) to refine the model while ensuring its outputs remain within desirable boundaries. The KL divergence penalty typically serves to constrain the model away from overfitting or drastically changing from its prior state during training.1. The statement inaccurately suggests that RL intends to force the model to deviate significantly from its original parameters, which contradicts the principle of fine-tuning, aiming for controlled improvement without unnecessary departure from foundational training.2. The description of the KL divergence as promoting deviation directly opposes its role in ensuring restrictive and guided model behavior alterations." + } + }, + { + "path": "table_paper/2407.00087v2.json", + "table_id": "17", + "section": "4.4", + "all_context": [ + "We investigate whether the improved rationale also contributes to answer inference accuracy.", + "Table 2 shows the main results of answer inference on the ScienceQA.", + "We evaluate our base model against the MM-CoT baseline.", + "achieves a 2.79% improvement compared to the corresponding baseline ().", + "The large model () shows some minimal improvement compared to the corresponding baseline.", + "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", + "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", + "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", + "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", + "Table 3 shows the results of answer inference on the A-OKVQA.", + "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", + "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", + "ARES achieves 4.45% gains against and 2.35% for .", + "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", + "Figure 3 shows the rationale reasoning for 4 cases.", + "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", + "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", + "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", + "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", + "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", + "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", + "" + ], + "target_context_ids": [ + 16, + 17 + ], + "selected_paragraphs": [ + "[paragraph id = 16] However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", + "[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed." + ], + "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Generated Rationales
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhich property do these three objects have in common?\n\n
\n\nChoices\n\n\n\nrough, yellow, sticky\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nLook at each object. For each object, decide if it has that property. A rough object feels scratchy when you touch it. The rain boots and the rubber duck are not rough. A sticky object can attach or stick to other things. The rain boots and the rubber duck are not sticky. Yellow is a color. This color is yellow. All three objects are yellow. The property that all three objects have in common is yellow.\n\n
\n\n\n\n\n\nLook at each object. For each object, decide if it has that property. A sticky object can attach or stick to other things. The rubber duck is not sticky. Yellow is a color. This color is yellow. The rubber duck and the rain boots are yellow, but the lemon is not. A rough object feels scratchy when you touch it. All three objects are rough. The property that all three objects have in common is rough.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe property that all three objects have in common is yellow. The rubber duck and the rain boots are both yellow, and the lemon is also yellow. Although the objects may have other properties like roughness, the common property among them is the color yellow.\n\n
\n\n (Ours)\n\n\n\nThe three objects shown - rubber duck, rain boots, and lemon - all share the common property of being yellow in color. The rubber duck is made of a smooth, non-sticky material, the rain boots are also a yellow-colored object, and the lemon is also a yellow-colored liquid. Therefore, the correct answer is that the common property among these three objects is that they are yellow. The other options, rough and sticky, do not apply to all three objects, as the rubber duck is not rough, and the rain boots are not sticky.\n\n
\n\nQuestion\n\n\n\nWhich word would you find on a dictionary page with the following guide words? lever - loan\n\n
\n\nChoices\n\n\n\nlamb, liquid\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nPut the words in alphabetical order. Since liquid is between the guide words lever - loan, it would be found on that page.\n\n
\n\n\n\n\n\nPut the words in alphabetical order. Since lamb is between the guide words lever - loan, it would be found on that page.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words.\n\n
\n\n (Ours)\n\n\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words. Therefore, the word \"liquid\" would be the word found on that dictionary page.\n\n
\n
Table 17: Comparison of generated rationales by baseline, baseline with correction feedback, and our model (ARES).
\n
", + "perturb_sentence_id": 16, + "output": { + "perturbed_statement": "[paragraph id = 16] However, the corrected reasoning for MM-CoT without the RL stage contains excessive information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples). Table 4 also demonstrates that inference accuracy gradually degrades as each part of ARES is executed.", + "perturbed_explanation": "The original explanation lists that MM-CoT without RL stage suffers from a lack of information compared to ARES that employs RL and that inference accuracy for ARES enhances progressively. However, 1) the information content for MM-CoT without RL should not be described as `excessive` since it's observed as being insufficient, and 2) the inference accuracy for ARES improves with execution, rather than degrading, which directly contradicts the observations presented in the original texts." + } + } +] \ No newline at end of file