[ { "path": "table_paper/2407.00087v2.json", "table_id": "1", "section": "4.3", "all_context": [ "We check whether ARES improves the quality of rationale reasoning compared to the baseline model.", "GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.", "We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).", "We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", "Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets.", "" ], "target_context_ids": [ 0, 1, 2, 3, 4 ], "selected_paragraphs": [ "[paragraph id = 0] We check whether ARES improves the quality of rationale reasoning compared to the baseline model.", "[paragraph id = 1] GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.", "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).", "[paragraph id = 3] We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.", "[paragraph id = 4] Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets." ], "table_html": "
\n

\n\n\n\n\n\nScienceQA\nWin Rate\n\n\n\n vs \n69.76%\n\n vs \n73.76%\n\nA-OKVQA\nWin Rate\n\n vs \n69.11%\n\n vs \n66.96%\n\n\n

\n
Table 1: We train baseline models, MM-CoT, with the ARES pipeline and ask GPT-4o to evaluate which rationale reasoning is better. We compare each baseline for two model sizes ( and ) and two tasks (ScienceQA and A-OKVQA).
\n
", "perturb_sentence_id": [ 2, 3 ], "output": { "perturbed_statement": "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).We conduct our experiments with two different model sizes, Flan-Base and Flan-Small with ViT feature, on ScienceQA and A-OKVQA.", "perturbed_explanation": "The original explanation is: This statement outlines the methodology used in evaluating rationale chains, which includes randomizing them for a fair comparison and testing with two model sizes, Flan-Base and Flan-Large with ViT feature, across ScienceQA and A-OKVQA datasets. 1. The statement mistakenly refers to one of the model sizes as Flan-Small instead of Flan-Large. This is incorrect because the model sizes actually used were Flan-Base and Flan-Large, not Flan-Small." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "2", "section": "4.4", "all_context": [ "We investigate whether the improved rationale also contributes to answer inference accuracy.", "Table 2 shows the main results of answer inference on the ScienceQA.", "We evaluate our base model against the MM-CoT baseline.", "achieves a 2.79% improvement compared to the corresponding baseline ().", "The large model () shows some minimal improvement compared to the corresponding baseline.", "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", "Table 3 shows the results of answer inference on the A-OKVQA.", "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", "ARES achieves 4.45% gains against and 2.35% for .", "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", "Figure 3 shows the rationale reasoning for 4 cases.", "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "" ], "target_context_ids": [ 1, 4, 5, 7, 12, 13, 18, 19, 20 ], "selected_paragraphs": [ "[paragraph id = 1] Table 2 shows the main results of answer inference on the ScienceQA.", "[paragraph id = 4] The large model () shows some minimal improvement compared to the corresponding baseline.", "[paragraph id = 5] However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", "[paragraph id = 7] The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", "[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .", "[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", "[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03" ], "table_html": "
\n
\n

\n\n\n\n\n\nModel\nSize\nNAT\nSOC\nLAN\nTXT\nIMG\nNO\nG1-6\nG7-12\nAvg\n\nHuman\n-\n90.23\n84.97\n87.48\n89.60\n87.50\n88.10\n91.59\n82.42\n88.40\n\nMCAN (Yu et al., 2019 ###reference_b40###)\n95M\n56.08\n46.23\n58.09\n59.43\n51.17\n55.40\n51.65\n59.72\n54.54\n\nTop-Down (Anderson et al., 2018 ###reference_b1###)\n70M\n59.50\n54.33\n61.82\n62.90\n54.88\n59.79\n57.27\n62.16\n59.02\n\nBAN (Kim et al., 2018 ###reference_b16###)\n112M\n60.88\n46.57\n66.64\n62.61\n52.60\n65.51\n56.83\n63.94\n59.37\n\nDFAF (Peng et al., 2019 ###reference_b28###)\n74M\n64.03\n48.82\n63.55\n65.88\n54.49\n64.11\n57.12\n67.17\n60.72\n\nViLT (Kim et al., 2021 ###reference_b17###)\n113M\n60.48\n63.89\n60.27\n63.20\n61.38\n57.00\n60.72\n61.90\n61.14\n\nPatch-TRM (Lu et al., 2022b ###reference_b25###)\n90M\n65.19\n46.79\n65.55\n66.96\n55.28\n64.95\n58.04\n67.50\n61.42\n\nVisualBERT (Li et al., 2019 ###reference_b20###)\n111M\n59.33\n69.18\n61.18\n62.71\n62.17\n58.54\n62.96\n59.92\n61.87\n\nUnifiedQABase (Khashabi et al., 2020 ###reference_b15###)\n223M\n68.16\n69.18\n74.91\n63.78\n61.38\n77.84\n72.98\n65.00\n70.12\n\nUnifiedQABase w/ CoT (Lu et al., 2022a ###reference_b24###)\n223M\n71.00\n76.04\n78.91\n66.42\n66.53\n81.81\n77.06\n68.82\n74.11\n\nLLaMA-Adapter (Zhang et al., 2023a ###reference_b42###)\n6B\n84.37\n88.30\n84.36\n83.72\n80.32\n86.90\n85.83\n84.05\n85.19\n\nLLaVA (Liu et al., 2023a ###reference_b22###)\n13B\n90.36\n95.95*\n88.00\n89.49\n88.00\n90.66\n90.93\n90.90*\n90.92\n\nInstructBLIP (Dai et al., 2023 ###reference_b8###)\n11B\n-\n-\n-\n-\n90.70*\n-\n-\n-\n-\n\n (Zhang et al., 2023b ###reference_b45###)\n251M+251M\n84.59\n92.46\n83.45\n83.87\n83.29\n85.64\n86.34\n85.23\n85.95\n\n (Ours)\n251M+30M\n87.92\n92.58\n85.91\n86.61\n85.82\n88.36\n88.88\n87.48\n88.38\n\n (Zhang et al., 2023b ###reference_b45###)\n790M+790M\n90.76\n93.59\n86.55\n89.69\n87.85\n89.55\n90.90\n89.12\n90.26\n\n (Ours)\n790M+76M\n91.21*\n92.80\n89.45*\n90.27*\n88.35\n91.22*\n91.48*\n90.38\n91.09*\n\n\n

\n
\n
Table 2: \nMain results on the ScienceQA test set (%). Size = backbone size. Question classes: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. Other results are sourced from Lu et al. (2022a) and Zhang et al. (2023b). Results in bold represent the better performance corresponding baseline. (*) indicates the best performance.\n
\n
", "perturb_sentence_id": [ 20, 21 ], "output": { "perturbed_statement": "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 50.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "perturbed_explanation": "The original explanation notes that the model accuracy figures provided are specific and verifiable based on given context. 1. The IPVR (OPT-66B) model accuracy is stated as 48.6, which can be cross-referenced. 2. Changing IPVR (OPT-66B) accuracy to 50.6 introduces a factual error, as the original accuracy is documented as 48.6, not 50.6." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "3", "section": "4.4", "all_context": [ "We investigate whether the improved rationale also contributes to answer inference accuracy.", "Table 2 shows the main results of answer inference on the ScienceQA.", "We evaluate our base model against the MM-CoT baseline.", "achieves a 2.79% improvement compared to the corresponding baseline ().", "The large model () shows some minimal improvement compared to the corresponding baseline.", "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", "Table 3 shows the results of answer inference on the A-OKVQA.", "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", "ARES achieves 4.45% gains against and 2.35% for .", "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", "Figure 3 shows the rationale reasoning for 4 cases.", "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "" ], "target_context_ids": [ 10, 11, 12, 13 ], "selected_paragraphs": [ "[paragraph id = 10] We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", "[paragraph id = 11] In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", "[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .", "[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study." ], "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\nIPVR (OPT-66B)\n48.6\n\nViLBERT\n49.1\n\n\n60.96\n\n (Ours)\n65.41\n\n\n65.68\n\n (Ours)\n68.03\n\n\n

\n
Table 3: Results of ARES on A-OKVQA. We mainly compare different-sized MM-CoT baselines (Zhang et al., 2023b). We retrain the MM-CoTs and run the ARES pipeline on these models. We evaluate these models on the validation set because the test set is hidden.
\n
", "perturb_sentence_id": [ 11, 12 ], "output": { "perturbed_statement": "[paragraph id = 11] In our experiments, MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b ).ARES achieves 4.45% gains against XYZ baseline and 2.35% for another baseline.", "perturbed_explanation": "1. The statement claims MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b), whereas the context mentions a 10% improvement, not 20%. 2. The addition of 'against XYZ baseline and for another baseline' adds unintended specificity not present in the context, which simply discusses percentage gains without associating them with specific baselines." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "4", "section": "4.4", "all_context": [ "We investigate whether the improved rationale also contributes to answer inference accuracy.", "Table 2 shows the main results of answer inference on the ScienceQA.", "We evaluate our base model against the MM-CoT baseline.", "achieves a 2.79% improvement compared to the corresponding baseline ().", "The large model () shows some minimal improvement compared to the corresponding baseline.", "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", "Table 3 shows the results of answer inference on the A-OKVQA.", "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", "ARES achieves 4.45% gains against and 2.35% for .", "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", "Figure 3 shows the rationale reasoning for 4 cases.", "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "" ], "target_context_ids": [ 17, 18, 19 ], "selected_paragraphs": [ "[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", "[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES." ], "table_html": "
\n

\n\n\n\n\n\nModel\nAccuracy\n\n\n\n\n85.95\n\n1st RL\n86.70\n\n1st ARES\n87.81\n\n1st ARES & 2nd RL\n87.88\n\n2nd ARES\n88.38\n\n\n

\n
Table 4: Ablation study: The accuracy gradually improves as each stage of ARES is added.
\n
", "perturb_sentence_id": [ 19, 20 ], "output": { "perturbed_statement": "[paragraph id = 19] 1st ARES & 2nd RL represents the second ARES on 1st RL, and finally, 2nd ARES refers to one round of ARES.Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "perturbed_explanation": "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES. This implies a sequential progression through different stages involving ARES and RL. 2. The statement incorrectly states that '1st ARES & 2nd RL represents the second ARES on 1st RL' and '2nd ARES refers to one round of ARES'. In contrast, it should correctly describe the order of operations as involving a second RL following the 1st ARES and two rounds of ARES for the '2nd ARES' phase, rather than one round as mistakenly mentioned." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "5", "section": "2.2", "all_context": [ "Reinforcement Learning (RL) fine-tunes our model to maximize sum of sentence rewards from an advanced AI model such as GPT-4 and Claude 3 Opus.", "The RL objective is as follows: where is a discount factor.", "We use Proximal Policy Optimization (PPO) (Schulman et al., 2017 ) to achieve this RL objective, treating sentences as actions (Equation 3 ).", "where is the original policy (baseline model) and is an advantage estimator at timestep .", "PPO is commonly leveraged in Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ).", "PPO s conservative update prevents the training model from deviating too far from the original model, thus avoiding degeneration.", "Sentence-Level Nuanced Feedback: We request a score between and for each sentence in CoT through the advanced AI for RL.", "The closer the score is to , the more relevant and helpful it is to solving the problem.", "Table 5 presents the prompt format.", "We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).", "Therefore, the actual range is from to .", "Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 ).", "This allows us to understand the direction in which the model is intended to change through RL.", "Advantages of Using Advanced AI for Score Feedback: Although calling the API has disadvantages, such as incurring costs or facing usage limits, there exist several advantages to using the advanced AI for feedback.", "First, there is no need to train a reward model.", "Second, as the RL fine-tuned model begins to generate out-of-distribution outputs that differ from the data used to train the reward model, it becomes challenging for the trained reward model to provide accurate rewards.", "However, this out-of-distribution problem is effectively addressed with the advanced AI.", "RL Challenge: One of the challenging factors for RL is hyperparameter tuning (Eimer et al., 2023 ).", "This often results in generating repetitive words and truncated sentences (Ouyang et al., 2022 ).", "Additionally, as the model size increases, finding working hyperparameters becomes infeasible for individuals.", "To alleviate this issue, we utilize correction feedback from the advanced AI as the second stage (Section 2.3 ), and proceed with the supervised fine-tuning to stabilize the RL fine-tuned model.", "" ], "target_context_ids": [ 8, 9, 10, 11 ], "selected_paragraphs": [ "[paragraph id = 8] Table 5 presents the prompt format.", "[paragraph id = 9] We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).", "[paragraph id = 10] Therefore, the actual range is from to .", "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 )." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", "perturb_sentence_id": [ 9, 10 ], "output": { "perturbed_statement": "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits limited behaviors (please refer to Section 4 ).", "perturbed_explanation": "1. The original statement explains the characteristics of the RL fine-tuned model, indicating that it exhibits emergent behaviors due to nuanced scores. 2. The statement incorrectly states that the RL fine-tuned model exhibits limited behaviors instead of emergent behaviors, which contradicts the information that it yields emergent behaviors as detailed in the context." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "5", "section": "3", "all_context": [ "Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.", "We choose ScienceQA because it contains reasoning chains to derive the answer.", "Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).", "In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).", "Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.", "MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).", "This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).", "We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.", "Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.", "We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.", "We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.", "To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.", "(See Table 5 .)", "In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", "We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", "The RL stage often makes the training model generate repetitive sentences.", "This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", "To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", "(See Appendix C.2 .)", "Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.", "The learning rate for is with epochs for PPO and a batch size of for both tasks.", "We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.", "For A-OKVQA, we proceed with 1 round for both model sizes.", "For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.", "Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).", "The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", "For more detailed settings, please refer to Appendix C .", "Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.", "First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.", "For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).", "The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).", "Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).", "This evaluation is also performed on both model sizes and tasks.", "Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .", "" ], "target_context_ids": [ 12, 13, 14, 15, 16, 17, 18 ], "selected_paragraphs": [ "[paragraph id = 12] (See Table 5 .)", "[paragraph id = 13] In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", "[paragraph id = 14] We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", "[paragraph id = 15] The RL stage often makes the training model generate repetitive sentences.", "[paragraph id = 16] This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", "[paragraph id = 17] To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", "[paragraph id = 18] (See Appendix C.2 .)" ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n
\n\nFeedback\n\n\n\nPrompt Structure\n\n
\n\nSentence-Level \n
Nuanced Feedback
\n
\n
\n\n[Prompt when Image is provided]\n
There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n
\n
Question: <Question>\n
Options: <Choices>\n
Hint: <Hint>\n
Answer: <Answer>\n
\n
{\n
\"<Rationale 1>\": xxx,\n
\"<Rationale 2>\": xxx,\n
\"<Rationale 3>\": xxx\n
}\n\n
\n
[Prompt when no Image is provided]\n
There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <same as the prompt when the Image is provided>
\n
\n
\n
Table 5: Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.
\n
", "perturb_sentence_id": [ 12, 13 ], "output": { "perturbed_statement": "[paragraph id = 12] (See Table 5 .) In order to collect the corrected dataset for the RL stage, we let Haiku refer to the given problem and correct the answer as the prompt.", "perturbed_explanation": "1. The original statement discusses collecting a corrected dataset for the SFT stage by letting Haiku refer to the given problem and correct the answer as the prompt. 2. The statement is incorrect because it mentions collecting the corrected dataset for the RL stage, whereas the context suggests this process is associated with the SFT stage. The RL stage is mentioned in relation to generating repetitive sentences, not dataset correction." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "8", "section": "3", "all_context": [ "Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.", "We choose ScienceQA because it contains reasoning chains to derive the answer.", "Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).", "In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).", "Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.", "MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).", "This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).", "We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.", "Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.", "We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.", "We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.", "To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.", "(See Table 5 .)", "In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.", "We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.", "The RL stage often makes the training model generate repetitive sentences.", "This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.", "To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.", "(See Appendix C.2 .)", "Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.", "The learning rate for is with epochs for PPO and a batch size of for both tasks.", "We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.", "For A-OKVQA, we proceed with 1 round for both model sizes.", "For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.", "Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).", "The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", "For more detailed settings, please refer to Appendix C .", "Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.", "First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.", "For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).", "The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).", "Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).", "This evaluation is also performed on both model sizes and tasks.", "Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .", "" ], "target_context_ids": [ 25, 26 ], "selected_paragraphs": [ "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).", "[paragraph id = 26] For more detailed settings, please refer to Appendix C ." ], "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", "perturb_sentence_id": [ 25, 26 ], "output": { "perturbed_statement": "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model's features with a large number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8). For more detailed settings, please refer to Appendix C.", "perturbed_explanation": "The original explanation likely highlighted that the LoRA adapter efficiently uses the rationale model's features with a small number of weights, which results in significantly faster inference speeds compared to MM-CoT. 2. The statement incorrectly mentions a 'large number of weights', contradicting the efficiency highlighted by the small number of weights that enable faster processing. LoRA's advantage in speed is due to its minimal weight requirement, not a large one." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "8", "section": "5", "all_context": [ "Chain-of-Thought (CoT) is a multi-step reasoning method for problem-solving that encourages LLMs to consider the intermediate reasoning steps.", "Zero-Shot-CoT (Kojima et al., 2023 ) promotes CoT by using prompts such as \"Let s think step by step\" for LLMs.", "For Few-Shot-CoT (Zhang et al., 2022b ; Wei et al., 2023 ), a few examples with reasoning processes are provided, allowing the model to refer to these examples and understand how to perform CoT.", "Wei et al.", "(2023 ) reveal that this CoT technique positively impacts the performance of large models (B), but has minimal effect on smaller models.", "MM-CoT (Zhang et al., 2023b ) suggest that CoT is beneficial even for relatively small models, such as 200M, if the model that generates intermediate reasoning and the model that infers the answer are separated.", "We find that simply adding a LoRA adapter (Hu et al., 2021 ) to the reasoning model results in comparable performance.", "This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).", "This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input.", "Reinforcement Learning from Human Feedback (RLHF) (Glaese et al., 2022 ; Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ) align LLMs with user preferences.", "Ouyang et al.", "(2022 ) collects ranked feedback from human labelers and uses this feedback to perform Reinforcement Learning (RL).", "Constitutional AI (CAI) (Bai et al., 2022 ) collects ranked AI feedback rather than costly human feedback and handles harmfulness with RL.", "Both approaches learn outcome-supervised reward models (ORM) using ranking-based feedback.", "Lightman et al.", "(2023 ), instead, propose a process-supervised reward model (PRM) that leverages sentence-level feedback for CoT.", "Lightman et al.", "(2023 ); Luo et al.", "(2024 ) evaluate each trained ORM and PRM with searching algorithms such as best-of- or Monte Carlo Tree Search (MCTS) by selecting the highest-scored solution, demonstrating that the PRM-selected solution outperforms the ORM-selected one.", "Wang et al.", "(2024 ) perform RL using PRM, providing heuristic sentence-level scores for math problems that are simple to grade.", "As an LLM is trained with RL and starts generating outputs different from the original distribution, these reward models would not correctly provide rewards (Pitis, 2023 ; Byun and Perrault, 2024 ).", "Instead of training a reward model for a more general task, we perform RL by requesting sentence-level rewards from advanced AI models such as GPT-4.", "Model Accuracy 85.95 1st RL 86.70 1st ARES 87.81 1st ARES & 2nd RL 87.88 2nd ARES 88.38", "" ], "target_context_ids": [ 7, 8 ], "selected_paragraphs": [ "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).", "[paragraph id = 8] This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input." ], "table_html": "
\n

\n\n\n\n\n\nModel\nRationale\nInference\nTotal\n\n\n\n\n6m\n2m\n8m\n\n\n-\n3m\n3m\n\n\n16m\n3m\n19m\n\n\n-\n6m\n6m\n\n\n

\n
Table 8: Time Comparison between MM-CoT and ARES models for A-OKVAQ test set
\n
", "perturb_sentence_id": [ 7, 8 ], "output": { "perturbed_statement": "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, but generates answers 2x–14x slower than MM-CoT, which uses a separate inference model (See Table 7 and Table 8). This speed disadvantage arises from the fact that our framework requires an additional rationale as input, unlike the separate inference model framework that does not require any rationale before generating input.", "perturbed_explanation": "The original explanation highlights two key aspects: 1) The LoRA adapter framework generates answers significantly faster (2x-14x) than the MM-CoT framework. 2) The speed advantage is due to not needing a rationale as input, unlike MM-CoT, which requires generating a rationale first. The statement introduces inaccuracies in the following ways: 1) It claims the LoRA framework is slower (2x–14x) than MM-CoT, which contradicts the actual performance speedup detailed earlier. 2) It incorrectly suggests that the LoRA framework needs an additional rationale as input, reversing the actual rationale requirement described." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "13", "section": "2.3", "all_context": [ "The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.", "However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.", "(See examples in Appendix D .)", "Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.", "We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .", "Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.", "This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.", "How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.", "The direction of learning is determined by the reward (scalar) value.", "However, the opposite direction of the reward is sometimes required.", "For example, suppose there is a truncated sentence in CoT.", "gets a negative score because it is an incomplete sentence (Table 13 ).", "If there is no correction stage, the probability of is simply reduced.", "What if contains some valuable part?", "This valuable part is ignored, and its probability decreases.", "To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", "Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", "Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.", "The KL divergence penalty further prevents deviation.", "However, this penalty often causes the model s degeneration.", "As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.", "While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.", "Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.", "" ], "target_context_ids": [ 11, 12, 13, 14, 15, 16 ], "selected_paragraphs": [ "[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 13 ).", "[paragraph id = 12] If there is no correction stage, the probability of is simply reduced.", "[paragraph id = 13] What if contains some valuable part?", "[paragraph id = 14] This valuable part is ignored, and its probability decreases.", "[paragraph id = 15] To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", "[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", "perturb_sentence_id": [ 11, 12 ], "output": { "perturbed_statement": "[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 18). If there is no correction stage, the probability of is simply increased.", "perturbed_explanation": "1. The original explanation indicates that the context discusses a negative score given to an incomplete sentence and the probability of it being reduced if not corrected. 2. The statement is incorrect because it references Table 18 instead of Table 13, and it claims that the probability of the incomplete sentence is increased without the correction stage, which contradicts the context where the probability decreases if the valuable part is ignored." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "13", "section": "4.2", "all_context": [ "Despite the benefits of RL, hyperparameter tuning often requires massive effort.", "Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.", "To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.", "SFT is more stable than RL.", "We evaluate how well the SFT stage corrects errors caused by the RL stage for various RL hyperparameters.", "We test various RL hyperparameters such as learning rate = {5e-6, 1e-5, 2e-5, 5e-5}, batch size = {2, 4, 8, 16, 32}, and PPO epoch = {5, 10, 15}.", "As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).", "The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 ).", "However, excessive RL learning rates or epochs cause serious degeneration of the model, such as producing no output or generating strange words, and the results of correction feedback are also unreasonable.", "ScienceQA Win Rate vs 69.76% vs 73.76% A-OKVQA Win Rate vs 69.11% vs 66.96% Model Size NAT SOC LAN TXT IMG NO G1-6 G7-12 Avg Human - 90.23 84.97 87.48 89.60 87.50 88.10 91.59 82.42 88.40 MCAN (Yu et al., 2019 ) 95M 56.08 46.23 58.09 59.43 51.17 55.40 51.65 59.72 54.54 Top-Down (Anderson et al., 2018 ) 70M 59.50 54.33 61.82 62.90 54.88 59.79 57.27 62.16 59.02 BAN (Kim et al., 2018 ) 112M 60.88 46.57 66.64 62.61 52.60 65.51 56.83 63.94 59.37 DFAF (Peng et al., 2019 ) 74M 64.03 48.82 63.55 65.88 54.49 64.11 57.12 67.17 60.72 ViLT (Kim et al., 2021 ) 113M 60.48 63.89 60.27 63.20 61.38 57.00 60.72 61.90 61.14 Patch-TRM (Lu et al., 2022b ) 90M 65.19 46.79 65.55 66.96 55.28 64.95 58.04 67.50 61.42 VisualBERT (Li et al., 2019 ) 111M 59.33 69.18 61.18 62.71 62.17 58.54 62.96 59.92 61.87 UnifiedQABase (Khashabi et al., 2020 ) 223M 68.16 69.18 74.91 63.78 61.38 77.84 72.98 65.00 70.12 UnifiedQABase w/ CoT (Lu et al., 2022a ) 223M 71.00 76.04 78.91 66.42 66.53 81.81 77.06 68.82 74.11 LLaMA-Adapter (Zhang et al., 2023a ) 6B 84.37 88.30 84.36 83.72 80.32 86.90 85.83 84.05 85.19 LLaVA (Liu et al., 2023a ) 13B 90.36 95.95* 88.00 89.49 88.00 90.66 90.93 90.90* 90.92 InstructBLIP (Dai et al., 2023 ) 11B - - - - 90.70* - - - - (Zhang et al., 2023b ) 251M+251M 84.59 92.46 83.45 83.87 83.29 85.64 86.34 85.23 85.95 (Ours) 251M+30M 87.92 92.58 85.91 86.61 85.82 88.36 88.88 87.48 88.38 (Zhang et al., 2023b ) 790M+790M 90.76 93.59 86.55 89.69 87.85 89.55 90.90 89.12 90.26 (Ours) 790M+76M 91.21* 92.80 89.45* 90.27* 88.35 91.22* 91.48* 90.38 91.09*", "" ], "target_context_ids": [ 1, 2, 6, 7 ], "selected_paragraphs": [ "[paragraph id = 1] Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.", "[paragraph id = 2] To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.", "[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).", "[paragraph id = 7] The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 )." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Incomplete\nSentence After the RL Stage
\nProblem: What is the volume of a large soup pot?
\nChoices: [7 liters, 7 milliliters]
\nAnswer: 7 liters
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe better estimate for the volume of a large soup pot is-\n\n
\n
Table 13: Example of the incomplete sentence after the RL stage.
\n
", "perturb_sentence_id": [ 6, 7 ], "output": { "perturbed_statement": "[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are neither repetitive nor truncated (see Table 13 and 12 ). The SFT stage, with correction feedback, reflects a different direction from how the model is fine-tuned through RL and misguides it (Table 13 and 16 ).", "perturbed_explanation": "This explanation addresses two aspects presented in the statement. 1. The first aspect is regarding the impact of RL fine-tuning. The original statement mentions that RL resulted in sentences within rationale chains being repetitive or truncated. 2. The second aspect involves the role of SFT. The SFT stage is designed to guide and correct the errors introduced during RL fine-tuning. The statement incorrectly claims that sentences are neither repetitive nor truncated, and that SFT misguides the model, both of which contradict the described scenario where RL leads to errors that SFT aims to correct." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "14", "section": "4.1", "all_context": [ "Through RL, a training model is aligned to a specific preference.", "Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.", "However, this process produces some interesting additional results.", "First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.", "In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", "The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", "We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.", "Second, the training model begins to explain why other options are not the answer (See Table 15 ).", "The process of elimination is a useful method for deriving answers when options are given.", "" ], "target_context_ids": [ 4, 5, 6 ], "selected_paragraphs": [ "[paragraph id = 4] In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", "[paragraph id = 5] The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", "[paragraph id = 6] We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Changed Empty Rationale Reasoning Through RL
\nProblem: Select the solid.
\nChoices: [rain, water in a fishbowl, hammer]
\nAnswer: hammer
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\n<empty>\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nRain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you pour water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood.The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you put water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A rain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood. Both iron and wood are solids.\n\n
\n
Table 14: Example of an empty rationale reasoning that starts generating rationale reasoning after the RL stage.
\n
", "perturb_sentence_id": [ 4, 5 ], "output": { "perturbed_statement": "[paragraph id = 4] In particular, 15% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14).", "perturbed_explanation": "Original Explanation: In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward. \n\n1. The statement incorrectly claims that 15% of the problems have empty rationale reasoning data, whereas the correct percentage is 9.5%. \n2. This alteration changes the statistical representation of the data and may lead to misconceptions about the prevalence of problems lacking rationale reasoning in ScienceQA." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "15", "section": "4.1", "all_context": [ "Through RL, a training model is aligned to a specific preference.", "Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.", "However, this process produces some interesting additional results.", "First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.", "In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.", "The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).", "We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.", "Second, the training model begins to explain why other options are not the answer (See Table 15 ).", "The process of elimination is a useful method for deriving answers when options are given.", "" ], "target_context_ids": [ 7, 8 ], "selected_paragraphs": [ "[paragraph id = 7] Second, the training model begins to explain why other options are not the answer (See Table 15 ).", "[paragraph id = 8] The process of elimination is a useful method for deriving answers when options are given." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Example of the Process of Elimination After the RL Stage
\nProblem: Which is a compound sentence?
\nChoices: [\"Last night we heard an owl outside in the oak tree.\",
               \"I ate all of my dinner, so I can have some pie.\"]
\nAnswer: \"I ate all of my dinner, so I can have some pie.\"
\n\nBefore RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie.\n\n
\n\nAfter RL\n\n\n\n[Rationale reasoning]\n\n
\n\nThe first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie. The second sentence does not have a comma, so it is not a compound sentence. Last night we heard an owl outside in the oak tree.\n\n
\n
Table 15: Example of the process of elimination rationale reasoning after the RL stage.
\n
", "perturb_sentence_id": [ 7, 8 ], "output": { "perturbed_statement": "[paragraph id = 7] Second, the training model begins to explain why other options are the answer (See Table 15 ). The process of elimination is a useful method for deriving answers when options are given.", "perturbed_explanation": "The process of elimination is a useful method for deriving answers when options are given because it allows individuals to identify and discard alternatives that do not meet the given criteria or constraints. 2. However, the statement incorrectly claims that the training model explains why other options are the answer, which contradicts the concept of elimination that focuses on discarding incorrect options rather than validating them as correct answers." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "16", "section": "2.3", "all_context": [ "The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.", "However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.", "(See examples in Appendix D .)", "Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.", "We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .", "Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.", "This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.", "How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.", "The direction of learning is determined by the reward (scalar) value.", "However, the opposite direction of the reward is sometimes required.", "For example, suppose there is a truncated sentence in CoT.", "gets a negative score because it is an incomplete sentence (Table 13 ).", "If there is no correction stage, the probability of is simply reduced.", "What if contains some valuable part?", "This valuable part is ignored, and its probability decreases.", "To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.", "Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", "Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.", "The KL divergence penalty further prevents deviation.", "However, this penalty often causes the model s degeneration.", "As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.", "While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.", "Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.", "" ], "target_context_ids": [ 16, 17 ], "selected_paragraphs": [ "[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.", "[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Corrected Rationales After RL
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWill these magnets attract or repel each other?\n\n
\n\nChoices\n\n\n\nrepel, attract\n\n
\n\nAfter RL\n\n\n\nWill these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are the same repel. So, these magnets will repel each other. The picture shows two magnets that are the same shape and the same size. Both magnets are the same size. So, these magnets will repel each other.\n\n
\n\nCorrected Rationale\n\n\n\nThe magnets will attract each other because the north pole of one magnet is closest to the south pole of the other magnet. Opposite poles of magnets attract each other in this diagram. Therefore, the correct answer is that the magnets will attract.\n\n
\n\nQuestion\n\n\n\nHow long is a pen?\n\n
\n\nChoices\n\n\n\n20 kilometers, 20 centimeters\n\n
\n\nAfter RL\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is too long. 20 centimeters is too short. 20 kilometers is too long.\n\n
\n\nCorrected Rationale\n\n\n\nThe better estimate for the length of a pen is 20 centimeters. 20 kilometers is an unreasonable estimate, as it is far too long for a typical pen. 20 centimeters is a more accurate and reasonable estimate for the length of a standard pen.\n\n
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhat is the capital of South Carolina?\n\n
\n\nChoices\n\n\n\nColumbia, Montgomery, Charleston, Harrisburg\n\n
\n\nAfter RL\n\n\n\nColumbia is the capital of South Carolina.nThe city of Columbia is Charleston, South Carolina. Charleston is the smallest city in South Carolina.nHayeburg is the smallest city in South Carolina. Montgomery is the smallest city in South Carolina. But the city of Columbia is home to many cultural and historical landmarks.\n\n
\n\nCorrected Rationale\n\n\n\nColumbia is the correct answer as it is the capital city of South Carolina. It is the largest city in the state and serves as the seat of government. The other options, Montgomery (the capital of Alabama), Charleston (a city in South Carolina), and Harrisburg (the capital of Pennsylvania), are not the capital of South Carolina.\n\n
\n
Table 16: Examples of how the correction stage corrects mistakes after the RL stage.
\n
", "perturb_sentence_id": [ 17, 18 ], "output": { "perturbed_statement": "[paragraph id = 17] Additionally, RL is primarily fine-tuned through DQN (Schulman et al., 2017) to prevent the model from deviating too much from the original model. The KL divergence penalty further prevents deviation.", "perturbed_explanation": "1. The original explanation notes that RL is primarily fine-tuned through PPO (Proximal Policy Optimization), which efficiently balances exploration and exploitation. It also emphasizes the role of the KL divergence penalty in preventing the model's significant deviation from the base model. 2. The statement incorrectly claims that RL is primarily fine-tuned through DQN (Deep Q-Network). DQN is not used for fine-tuning in the context described; PPO is the correct method used, according to the cited reference (Schulman et al., 2017). This introduces a factual error in the description of the fine-tuning process." } }, { "path": "table_paper/2407.00087v2.json", "table_id": "17", "section": "4.4", "all_context": [ "We investigate whether the improved rationale also contributes to answer inference accuracy.", "Table 2 shows the main results of answer inference on the ScienceQA.", "We evaluate our base model against the MM-CoT baseline.", "achieves a 2.79% improvement compared to the corresponding baseline ().", "The large model () shows some minimal improvement compared to the corresponding baseline.", "However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).", "This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).", "The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.", "Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.", "Table 3 shows the results of answer inference on the A-OKVQA.", "We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.", "In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).", "ARES achieves 4.45% gains against and 2.35% for .", "In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.", "Figure 3 shows the rationale reasoning for 4 cases.", "The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.", "However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", "Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.", "1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.", "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.", "Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03", "" ], "target_context_ids": [ 16, 17 ], "selected_paragraphs": [ "[paragraph id = 16] However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).", "[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Comparison of Generated Rationales
\n\nImage\n\n\n\"[Uncaptioned\n\n
\n\nQuestion\n\n\n\nWhich property do these three objects have in common?\n\n
\n\nChoices\n\n\n\nrough, yellow, sticky\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nLook at each object. For each object, decide if it has that property. A rough object feels scratchy when you touch it. The rain boots and the rubber duck are not rough. A sticky object can attach or stick to other things. The rain boots and the rubber duck are not sticky. Yellow is a color. This color is yellow. All three objects are yellow. The property that all three objects have in common is yellow.\n\n
\n\n\n\n\n\nLook at each object. For each object, decide if it has that property. A sticky object can attach or stick to other things. The rubber duck is not sticky. Yellow is a color. This color is yellow. The rubber duck and the rain boots are yellow, but the lemon is not. A rough object feels scratchy when you touch it. All three objects are rough. The property that all three objects have in common is rough.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe property that all three objects have in common is yellow. The rubber duck and the rain boots are both yellow, and the lemon is also yellow. Although the objects may have other properties like roughness, the common property among them is the color yellow.\n\n
\n\n (Ours)\n\n\n\nThe three objects shown - rubber duck, rain boots, and lemon - all share the common property of being yellow in color. The rubber duck is made of a smooth, non-sticky material, the rain boots are also a yellow-colored object, and the lemon is also a yellow-colored liquid. Therefore, the correct answer is that the common property among these three objects is that they are yellow. The other options, rough and sticky, do not apply to all three objects, as the rubber duck is not rough, and the rain boots are not sticky.\n\n
\n\nQuestion\n\n\n\nWhich word would you find on a dictionary page with the following guide words? lever - loan\n\n
\n\nChoices\n\n\n\nlamb, liquid\n\n
\n\nOriginal \n
Solution
\n
\n
\n\nPut the words in alphabetical order. Since liquid is between the guide words lever - loan, it would be found on that page.\n\n
\n\n\n\n\n\nPut the words in alphabetical order. Since lamb is between the guide words lever - loan, it would be found on that page.\n\n
\n\nCorrection \n
feedback on
\n
\n
\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words.\n\n
\n\n (Ours)\n\n\n\nThe word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words. Therefore, the word \"liquid\" would be the word found on that dictionary page.\n\n
\n
Table 17: Comparison of generated rationales by baseline, baseline with correction feedback, and our model (ARES).
\n
", "perturb_sentence_id": [ 16, 17 ], "output": { "perturbed_statement": "[paragraph id = 16] Nonetheless, the corrected reasoning for MM-CoT with the RL stage has insufficient information compared to the reasoning of ARES without RL (refer to Table 17 for more examples). Table 4 also shows that inference accuracy remains the same as each part of ARES is executed.", "perturbed_explanation": "The original explanation highlights: 1. The statement discusses comparison in reasoning between MM-CoT and ARES, specifically emphasizing that the RL stage in ARES enhances reasoning, unlike MM-CoT without the RL stage, which has insufficient information. 2. In Table 4, it is stated that inference accuracy for ARES gradually improves as each part is executed. The statement is incorrect because: 3. It falsely claims that MM-CoT with the RL stage has insufficient information compared to ARES without RL, which contradicts the context where ARES's reasoning benefits from RL. 4. It incorrectly claims that Table 4 shows inference accuracy remains the same as ARES components execute, while the context indicates improvement with execution." } } ]