Spaces:
Sleeping
Sleeping
File size: 155,204 Bytes
fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 |
[
{
"path": "table_paper/2407.00087v2.json",
"table_id": "1",
"section": "4.3",
"all_context": [
"We check whether ARES improves the quality of rationale reasoning compared to the baseline model.",
"GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.",
"We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).",
"We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.",
"Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets.",
""
],
"target_context_ids": [
0,
1,
2,
3,
4
],
"selected_paragraphs": [
"[paragraph id = 0] We check whether ARES improves the quality of rationale reasoning compared to the baseline model.",
"[paragraph id = 1] GPT-4o evaluates which rationale chain is better between the rationale generated by ARES and the rationale generated by the baseline model.",
"[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).",
"[paragraph id = 3] We conduct our experiments with two different model sizes, Flan-Base and Flan-Large with ViT feature, on ScienceQA and A-OKVQA.",
"[paragraph id = 4] Table 1 shows that ARES achieves around 70% win rate against each corresponding baseline model for both datasets."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T1\">\n<p class=\"ltx_p ltx_align_center\" id=\"S4.T1.8\"><span class=\"ltx_text\" id=\"S4.T1.8.8\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T1.8.8.8\" style=\"width:181.8pt;height:108pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"S4.T1.8.8.8.8\"><span class=\"ltx_text\" id=\"S4.T1.8.8.8.8.8\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T1.8.8.8.8.8.8\">\n<span class=\"ltx_thead\">\n<span class=\"ltx_tr\" id=\"S4.T1.8.8.8.8.8.8.9.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S4.T1.8.8.8.8.8.8.9.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.8.8.8.8.8.8.9.1.1.1\">ScienceQA</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T1.8.8.8.8.8.8.9.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.8.8.8.8.8.8.9.1.2.1\">Win Rate</span></span></span>\n</span>\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"S4.T1.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T1.2.2.2.2.2.2.2.2\"> vs </span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.2.2.2.2.2.2.2.3\">69.76%</span></span>\n<span class=\"ltx_tr\" id=\"S4.T1.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T1.4.4.4.4.4.4.4.2\"> vs </span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T1.4.4.4.4.4.4.4.3\">73.76%</span></span>\n<span class=\"ltx_tr\" id=\"S4.T1.8.8.8.8.8.8.10.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T1.8.8.8.8.8.8.10.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.8.8.8.8.8.8.10.1.1.1\">A-OKVQA</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T1.8.8.8.8.8.8.10.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.8.8.8.8.8.8.10.1.2.1\">Win Rate</span></span></span>\n<span class=\"ltx_tr\" id=\"S4.T1.6.6.6.6.6.6.6\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T1.6.6.6.6.6.6.6.2\"> vs </span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T1.6.6.6.6.6.6.6.3\">69.11%</span></span>\n<span class=\"ltx_tr\" id=\"S4.T1.8.8.8.8.8.8.8\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T1.8.8.8.8.8.8.8.2\"> vs </span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T1.8.8.8.8.8.8.8.3\">66.96%</span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span>We train baseline models, MM-CoT, with the ARES pipeline and ask <span class=\"ltx_text ltx_font_typewriter\" id=\"S4.T1.14.1\">GPT-4o</span> to evaluate which rationale reasoning is better. We compare each baseline for two model sizes ( and ) and two tasks (ScienceQA and A-OKVQA).</figcaption>\n</figure>",
"perturb_sentence_id": [
2,
3
],
"output": {
"perturbed_statement": "[paragraph id = 2] We randomly shuffle the rationale chains and provide them as Option A and Option B (see Appendix A.3 ) for a fair evaluation (Yu et al., 2023 ).We conduct our experiments with two different model sizes, Flan-Base and Flan-Small with ViT feature, on ScienceQA and A-OKVQA.",
"perturbed_explanation": "The original explanation is: This statement outlines the methodology used in evaluating rationale chains, which includes randomizing them for a fair comparison and testing with two model sizes, Flan-Base and Flan-Large with ViT feature, across ScienceQA and A-OKVQA datasets. 1. The statement mistakenly refers to one of the model sizes as Flan-Small instead of Flan-Large. This is incorrect because the model sizes actually used were Flan-Base and Flan-Large, not Flan-Small."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "2",
"section": "4.4",
"all_context": [
"We investigate whether the improved rationale also contributes to answer inference accuracy.",
"Table 2 shows the main results of answer inference on the ScienceQA.",
"We evaluate our base model against the MM-CoT baseline.",
"achieves a 2.79% improvement compared to the corresponding baseline ().",
"The large model () shows some minimal improvement compared to the corresponding baseline.",
"However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).",
"This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).",
"The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.",
"Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.",
"Table 3 shows the results of answer inference on the A-OKVQA.",
"We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.",
"In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).",
"ARES achieves 4.45% gains against and 2.35% for .",
"In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.",
"Figure 3 shows the rationale reasoning for 4 cases.",
"The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.",
"However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).",
"Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.",
"1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.",
"Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
""
],
"target_context_ids": [
1,
4,
5,
7,
12,
13,
18,
19,
20
],
"selected_paragraphs": [
"[paragraph id = 1] Table 2 shows the main results of answer inference on the ScienceQA.",
"[paragraph id = 4] The large model () shows some minimal improvement compared to the corresponding baseline.",
"[paragraph id = 5] However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).",
"[paragraph id = 7] The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.",
"[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .",
"[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.",
"[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.",
"[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03"
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T2\">\n<div class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T2.4\" style=\"width:433.6pt;height:189.8pt;vertical-align:-0.6pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(-154.5pt,67.4pt) scale(0.58387,0.58387) ;\">\n<p class=\"ltx_p\" id=\"S4.T2.4.4\"><span class=\"ltx_text\" id=\"S4.T2.4.4.4\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T2.4.4.4.4\" style=\"width:742.7pt;height:325pt;vertical-align:-1.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"S4.T2.4.4.4.4.4\"><span class=\"ltx_text\" id=\"S4.T2.4.4.4.4.4.4\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.4.4.4.4.4.4.4\">\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.5.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.1.1\">Model</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.2.1\">Size</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.3.1\">NAT</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.4.1\">SOC</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.5.1\">LAN</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.6\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.6.1\">TXT</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.7.1\">IMG</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.8.1\">NO</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.9.1\">G1-6</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.10\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.10.1\">G7-12</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.11\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.5.1.11.1\">Avg</span></span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.6.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.1\">Human</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.2\">-</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.3\">90.23</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.4\">84.97</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.5\">87.48</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.6\">89.60</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.7\">87.50</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.8\">88.10</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.9\">91.59</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.10\">82.42</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.6.2.11\">88.40</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.7.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.1\">MCAN <cite class=\"ltx_cite ltx_citemacro_citep\">(Yu et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib40\" title=\"\">2019 ###reference_b40###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.2\">95M</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.3\">56.08</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.4\">46.23</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.5\">58.09</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.6\">59.43</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.7\">51.17</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.8\">55.40</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.9\">51.65</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.10\">59.72</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.7.3.11\">54.54</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.8.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.1\">Top-Down <cite class=\"ltx_cite ltx_citemacro_citep\">(Anderson et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib1\" title=\"\">2018 ###reference_b1###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.2\">70M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.3\">59.50</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.4\">54.33</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.5\">61.82</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.6\">62.90</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.7\">54.88</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.8\">59.79</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.9\">57.27</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.10\">62.16</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.8.4.11\">59.02</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.9.5\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.1\">BAN <cite class=\"ltx_cite ltx_citemacro_citep\">(Kim et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib16\" title=\"\">2018 ###reference_b16###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.2\">112M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.3\">60.88</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.4\">46.57</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.5\">66.64</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.6\">62.61</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.7\">52.60</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.8\">65.51</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.9\">56.83</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.10\">63.94</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.9.5.11\">59.37</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.10.6\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.1\">DFAF <cite class=\"ltx_cite ltx_citemacro_citep\">(Peng et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib28\" title=\"\">2019 ###reference_b28###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.2\">74M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.3\">64.03</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.4\">48.82</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.5\">63.55</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.6\">65.88</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.7\">54.49</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.8\">64.11</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.9\">57.12</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.10\">67.17</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.10.6.11\">60.72</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.11.7\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.1\">ViLT <cite class=\"ltx_cite ltx_citemacro_citep\">(Kim et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib17\" title=\"\">2021 ###reference_b17###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.2\">113M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.3\">60.48</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.4\">63.89</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.5\">60.27</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.6\">63.20</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.7\">61.38</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.8\">57.00</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.9\">60.72</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.10\">61.90</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.11.7.11\">61.14</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.12.8\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.1\">Patch-TRM <cite class=\"ltx_cite ltx_citemacro_citep\">(Lu et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib25\" title=\"\">2022b ###reference_b25###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.2\">90M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.3\">65.19</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.4\">46.79</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.5\">65.55</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.6\">66.96</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.7\">55.28</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.8\">64.95</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.9\">58.04</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.10\">67.50</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.12.8.11\">61.42</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.13.9\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.1\">VisualBERT <cite class=\"ltx_cite ltx_citemacro_citep\">(Li et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib20\" title=\"\">2019 ###reference_b20###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.2\">111M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.3\">59.33</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.4\">69.18</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.5\">61.18</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.6\">62.71</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.7\">62.17</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.8\">58.54</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.9\">62.96</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.10\">59.92</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.13.9.11\">61.87</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.14.10\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.1\">UnifiedQA<sub class=\"ltx_sub\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.1.1\">Base</sub> <cite class=\"ltx_cite ltx_citemacro_citep\">(Khashabi et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib15\" title=\"\">2020 ###reference_b15###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.2\">223M</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.3\">68.16</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.4\">69.18</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.5\">74.91</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.6\">63.78</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.7\">61.38</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.8\">77.84</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.9\">72.98</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.10\">65.00</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.14.10.11\">70.12</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.15.11\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.1\">UnifiedQA<sub class=\"ltx_sub\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.1.1\">Base</sub> w/ CoT <cite class=\"ltx_cite ltx_citemacro_citep\">(Lu et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib24\" title=\"\">2022a ###reference_b24###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.2\">223M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.3\">71.00</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.4\">76.04</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.5\">78.91</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.6\">66.42</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.7\">66.53</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.8\">81.81</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.9\">77.06</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.10\">68.82</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.15.11.11\">74.11</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.16.12\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.1\">LLaMA-Adapter <cite class=\"ltx_cite ltx_citemacro_citep\">(Zhang et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib42\" title=\"\">2023a ###reference_b42###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.2\">6B</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.3\">84.37</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.4\">88.30</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.5\">84.36</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.6\">83.72</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.7\">80.32</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.8\">86.90</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.9\">85.83</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.10\">84.05</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.4.4.4.4.4.16.12.11\">85.19</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.17.13\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.1\">LLaVA <cite class=\"ltx_cite ltx_citemacro_citep\">(Liu et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib22\" title=\"\">2023a ###reference_b22###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.2\">13B</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.3\">90.36</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.4\">95.95*</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.5\">88.00</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.6\">89.49</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.7\">88.00</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.8\">90.66</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.9\">90.93</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.10\">90.90*</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.17.13.11\">90.92</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.18.14\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.1\">InstructBLIP <cite class=\"ltx_cite ltx_citemacro_citep\">(Dai et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib8\" title=\"\">2023 ###reference_b8###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.2\">11B</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.3\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.4\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.5\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.6\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.7\">90.70*</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.8\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.9\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.10\">-</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.4.4.4.4.4.4.4.18.14.11\">-</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.1.1.1.1.1.1.1.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.1\"> <cite class=\"ltx_cite ltx_citemacro_citep\">(Zhang et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib45\" title=\"\">2023b ###reference_b45###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.2\">251M+251M</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.3\">84.59</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.4\">92.46</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.5\">83.45</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.6\">83.87</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.7\">83.29</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.8\">85.64</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.9\">86.34</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.10\">85.23</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.1.1.1.1.1.1.11\">85.95</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.2.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T2.2.2.2.2.2.2.2.2.1\"> (Ours)</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.2\">251M+30M</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.3.1\">87.92</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.4.1\">92.58</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.5.1\">85.91</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.6\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.6.1\">86.61</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.7.1\">85.82</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.8.1\">88.36</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.9.1\">88.88</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.10\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.10.1\">87.48</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T2.2.2.2.2.2.2.2.2.11\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.2.2.2.2.2.2.2.2.11.1\">88.38</span></span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.3.3.3.3.3.3.3.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.1\"> <cite class=\"ltx_cite ltx_citemacro_citep\">(Zhang et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib45\" title=\"\">2023b ###reference_b45###</a>)</cite></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.2\">790M+790M</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.3\">90.76</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.4\">93.59</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.5\">86.55</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.6\">89.69</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.7\">87.85</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.8\">89.55</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.9\">90.90</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.10\">89.12</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.3.3.3.3.3.3.11\">90.26</span></span>\n<span class=\"ltx_tr\" id=\"S4.T2.4.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.1\"> (Ours)</span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.2\">790M+76M</span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.3.1\">91.21*</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.4\">92.80</span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.5.1\">89.45*</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.6\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.6.1\">90.27*</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.7.1\">88.35</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.8.1\">91.22*</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.9.1\">91.48*</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.10\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.10.1\">90.38</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.4.4.4.4.4.4.4.4.11\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.4.4.4.4.4.4.4.4.11.1\">91.09*</span></span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>\nMain results on the ScienceQA test set (%). Size = backbone size. Question classes: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. Other results are sourced from <cite class=\"ltx_cite ltx_citemacro_citet\">Lu et al. (<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib24\" title=\"\">2022a</a>)</cite> and <cite class=\"ltx_cite ltx_citemacro_citet\">Zhang et al. (<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib45\" title=\"\">2023b</a>)</cite>. Results in bold represent the better performance corresponding baseline. (*) indicates the best performance.\n</figcaption>\n</figure>",
"perturb_sentence_id": [
20,
21
],
"output": {
"perturbed_statement": "[paragraph id = 20] Model Accuracy IPVR (OPT-66B) 50.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
"perturbed_explanation": "The original explanation notes that the model accuracy figures provided are specific and verifiable based on given context. 1. The IPVR (OPT-66B) model accuracy is stated as 48.6, which can be cross-referenced. 2. Changing IPVR (OPT-66B) accuracy to 50.6 introduces a factual error, as the original accuracy is documented as 48.6, not 50.6."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "3",
"section": "4.4",
"all_context": [
"We investigate whether the improved rationale also contributes to answer inference accuracy.",
"Table 2 shows the main results of answer inference on the ScienceQA.",
"We evaluate our base model against the MM-CoT baseline.",
"achieves a 2.79% improvement compared to the corresponding baseline ().",
"The large model () shows some minimal improvement compared to the corresponding baseline.",
"However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).",
"This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).",
"The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.",
"Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.",
"Table 3 shows the results of answer inference on the A-OKVQA.",
"We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.",
"In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).",
"ARES achieves 4.45% gains against and 2.35% for .",
"In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.",
"Figure 3 shows the rationale reasoning for 4 cases.",
"The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.",
"However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).",
"Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.",
"1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.",
"Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
""
],
"target_context_ids": [
10,
11,
12,
13
],
"selected_paragraphs": [
"[paragraph id = 10] We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.",
"[paragraph id = 11] In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).",
"[paragraph id = 12] ARES achieves 4.45% gains against and 2.35% for .",
"[paragraph id = 13] In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T3\">\n<p class=\"ltx_p\" id=\"S4.T3.4\"><span class=\"ltx_text\" id=\"S4.T3.4.4\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"S4.T3.4.4.4\" style=\"width:146.5pt;height:126pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"S4.T3.4.4.4.4\"><span class=\"ltx_text\" id=\"S4.T3.4.4.4.4.4\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T3.4.4.4.4.4.4\">\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"S4.T3.4.4.4.4.4.4.5.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T3.4.4.4.4.4.4.5.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.4.4.4.4.4.4.5.1.1.1\">Model</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_tt\" id=\"S4.T3.4.4.4.4.4.4.5.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.4.4.4.4.4.4.5.1.2.1\">Accuracy</span></span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.4.4.4.4.4.4.6.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.4.4.4.4.4.4.6.2.1\">IPVR (OPT-66B)</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.4.4.4.4.4.4.6.2.2\">48.6</span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.4.4.4.4.4.4.7.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T3.4.4.4.4.4.4.7.3.1\">ViLBERT</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T3.4.4.4.4.4.4.7.3.2\">49.1</span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.1.1.1.1.1.1.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.1.1.1.1.1.1.1.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.1.1.1.1.1.2\">60.96</span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S4.T3.2.2.2.2.2.2.2.1\"> (Ours)</span>\n<span class=\"ltx_td ltx_align_center\" id=\"S4.T3.2.2.2.2.2.2.2.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.2.2.2.2.2.2.2.2.1\">65.41</span></span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.3.3.3.3.3.3.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S4.T3.3.3.3.3.3.3.3.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.3.3.3.3.3.3.3.2\">65.68</span></span>\n<span class=\"ltx_tr\" id=\"S4.T3.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T3.4.4.4.4.4.4.4.1\"> (Ours)</span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.4.4.4.4.4.4.4.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.4.4.4.4.4.4.4.2.1\">68.03</span></span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 3: </span>Results of ARES on A-OKVQA. We mainly compare different-sized MM-CoT baselines <cite class=\"ltx_cite ltx_citemacro_citep\">(Zhang et al., <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00087v2#bib.bib45\" title=\"\">2023b</a>)</cite>. We retrain the MM-CoTs and run the ARES pipeline on these models. We evaluate these models on the validation set because the test set is hidden.</figcaption>\n</figure>",
"perturb_sentence_id": [
11,
12
],
"output": {
"perturbed_statement": "[paragraph id = 11] In our experiments, MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b ).ARES achieves 4.45% gains against XYZ baseline and 2.35% for another baseline.",
"perturbed_explanation": "1. The statement claims MM-CoT models perform around 20% better than the reported accuracy in (Zhang et al., 2023b), whereas the context mentions a 10% improvement, not 20%. 2. The addition of 'against XYZ baseline and for another baseline' adds unintended specificity not present in the context, which simply discusses percentage gains without associating them with specific baselines."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "4",
"section": "4.4",
"all_context": [
"We investigate whether the improved rationale also contributes to answer inference accuracy.",
"Table 2 shows the main results of answer inference on the ScienceQA.",
"We evaluate our base model against the MM-CoT baseline.",
"achieves a 2.79% improvement compared to the corresponding baseline ().",
"The large model () shows some minimal improvement compared to the corresponding baseline.",
"However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).",
"This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).",
"The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.",
"Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.",
"Table 3 shows the results of answer inference on the A-OKVQA.",
"We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.",
"In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).",
"ARES achieves 4.45% gains against and 2.35% for .",
"In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.",
"Figure 3 shows the rationale reasoning for 4 cases.",
"The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.",
"However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).",
"Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.",
"1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.",
"Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
""
],
"target_context_ids": [
17,
18,
19
],
"selected_paragraphs": [
"[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.",
"[paragraph id = 18] 1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"[paragraph id = 19] 1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES."
],
"table_html": "<figure class=\"ltx_table\" id=\"S5.T4\">\n<p class=\"ltx_p\" id=\"S5.T4.5\"><span class=\"ltx_text\" id=\"S5.T4.5.5\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"S5.T4.5.5.5\" style=\"width:158.8pt;height:108pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"S5.T4.5.5.5.5\"><span class=\"ltx_text\" id=\"S5.T4.5.5.5.5.5\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S5.T4.5.5.5.5.5.5\">\n<span class=\"ltx_thead\">\n<span class=\"ltx_tr\" id=\"S5.T4.5.5.5.5.5.5.6.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S5.T4.5.5.5.5.5.5.6.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.5.5.5.5.5.5.6.1.1.1\">Model</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T4.5.5.5.5.5.5.6.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.5.5.5.5.5.5.6.1.2.1\">Accuracy</span></span></span>\n</span>\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"S5.T4.1.1.1.1.1.1.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S5.T4.1.1.1.1.1.1.1.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T4.1.1.1.1.1.1.1.2\">85.95</span></span>\n<span class=\"ltx_tr\" id=\"S5.T4.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S5.T4.2.2.2.2.2.2.2.1\"><span class=\"ltx_text ltx_markedasmath\" id=\"S5.T4.2.2.2.2.2.2.2.1.1\">1st RL</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S5.T4.2.2.2.2.2.2.2.2\">86.70</span></span>\n<span class=\"ltx_tr\" id=\"S5.T4.3.3.3.3.3.3.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S5.T4.3.3.3.3.3.3.3.1\"><span class=\"ltx_text ltx_markedasmath\" id=\"S5.T4.3.3.3.3.3.3.3.1.1\">1st ARES</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S5.T4.3.3.3.3.3.3.3.2\">87.81</span></span>\n<span class=\"ltx_tr\" id=\"S5.T4.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S5.T4.4.4.4.4.4.4.4.1\"><span class=\"ltx_text ltx_markedasmath\" id=\"S5.T4.4.4.4.4.4.4.4.1.1\">1st ARES &</span><span class=\"ltx_text ltx_markedasmath\" id=\"S5.T4.4.4.4.4.4.4.4.1.2\"> 2nd RL</span></span>\n<span class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.4.4.4.4.4.2\">87.88</span></span>\n<span class=\"ltx_tr\" id=\"S5.T4.5.5.5.5.5.5.5\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S5.T4.5.5.5.5.5.5.5.1\"><span class=\"ltx_text ltx_markedasmath\" id=\"S5.T4.5.5.5.5.5.5.5.1.1\">2nd ARES</span></span>\n<span class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T4.5.5.5.5.5.5.5.2\">88.38</span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n<figcaption class=\"ltx_caption\"><span class=\"ltx_tag ltx_tag_table\">Table 4: </span>Ablation study: The accuracy gradually improves as each stage of ARES is added.</figcaption>\n</figure>",
"perturb_sentence_id": [
19,
20
],
"output": {
"perturbed_statement": "[paragraph id = 19] 1st ARES & 2nd RL represents the second ARES on 1st RL, and finally, 2nd ARES refers to one round of ARES.Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
"perturbed_explanation": "1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES. This implies a sequential progression through different stages involving ARES and RL. 2. The statement incorrectly states that '1st ARES & 2nd RL represents the second ARES on 1st RL' and '2nd ARES refers to one round of ARES'. In contrast, it should correctly describe the order of operations as involving a second RL following the 1st ARES and two rounds of ARES for the '2nd ARES' phase, rather than one round as mistakenly mentioned."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "5",
"section": "2.2",
"all_context": [
"Reinforcement Learning (RL) fine-tunes our model to maximize sum of sentence rewards from an advanced AI model such as GPT-4 and Claude 3 Opus.",
"The RL objective is as follows: where is a discount factor.",
"We use Proximal Policy Optimization (PPO) (Schulman et al., 2017 ) to achieve this RL objective, treating sentences as actions (Equation 3 ).",
"where is the original policy (baseline model) and is an advantage estimator at timestep .",
"PPO is commonly leveraged in Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ).",
"PPO s conservative update prevents the training model from deviating too far from the original model, thus avoiding degeneration.",
"Sentence-Level Nuanced Feedback: We request a score between and for each sentence in CoT through the advanced AI for RL.",
"The closer the score is to , the more relevant and helpful it is to solving the problem.",
"Table 5 presents the prompt format.",
"We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).",
"Therefore, the actual range is from to .",
"Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 ).",
"This allows us to understand the direction in which the model is intended to change through RL.",
"Advantages of Using Advanced AI for Score Feedback: Although calling the API has disadvantages, such as incurring costs or facing usage limits, there exist several advantages to using the advanced AI for feedback.",
"First, there is no need to train a reward model.",
"Second, as the RL fine-tuned model begins to generate out-of-distribution outputs that differ from the data used to train the reward model, it becomes challenging for the trained reward model to provide accurate rewards.",
"However, this out-of-distribution problem is effectively addressed with the advanced AI.",
"RL Challenge: One of the challenging factors for RL is hyperparameter tuning (Eimer et al., 2023 ).",
"This often results in generating repetitive words and truncated sentences (Ouyang et al., 2022 ).",
"Additionally, as the model size increases, finding working hyperparameters becomes infeasible for individuals.",
"To alleviate this issue, we utilize correction feedback from the advanced AI as the second stage (Section 2.3 ), and proceed with the supervised fine-tuning to stabilize the RL fine-tuned model.",
""
],
"target_context_ids": [
8,
9,
10,
11
],
"selected_paragraphs": [
"[paragraph id = 8] Table 5 presents the prompt format.",
"[paragraph id = 9] We additionally shift the reward distribution by to center it at (Zheng et al., 2023 ).",
"[paragraph id = 10] Therefore, the actual range is from to .",
"[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits emergent behaviors (please refer to Section 4 )."
],
"table_html": "<figure class=\"ltx_table\" id=\"A1.T5\">\n<table class=\"ltx_tabular ltx_centering ltx_align_middle\" id=\"A1.T5.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A1.T5.1.1.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt\" id=\"A1.T5.1.1.1.1\" style=\"width:113.8pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.1.1.1.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.1.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.1.1.1.1.1.1\">Feedback</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt\" id=\"A1.T5.1.1.1.2\" style=\"width:313.0pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.1.1.2.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.1.1.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.1.1.2.1.1.1\">Prompt Structure</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T5.1.2.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A1.T5.1.2.2.1\" style=\"width:113.8pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.2.2.1.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.2.2.1.1.1\">Sentence-Level \n<br class=\"ltx_break\"/>Nuanced Feedback</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A1.T5.1.2.2.2\" style=\"width:313.0pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.2.2.2.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.2.2.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.2.2.2.1.1.1\">[Prompt when Image is provided]\n<br class=\"ltx_break\"/></span>There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/>Question: <Question>\n<br class=\"ltx_break\"/>Options: <Choices>\n<br class=\"ltx_break\"/>Hint: <Hint>\n<br class=\"ltx_break\"/>Answer: <Answer>\n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/>{\n<br class=\"ltx_break\"/>\"<Rationale 1>\": xxx,\n<br class=\"ltx_break\"/>\"<Rationale 2>\": xxx,\n<br class=\"ltx_break\"/>\"<Rationale 3>\": xxx\n<br class=\"ltx_break\"/>}\n\n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.2.2.2.1.1.2\">[Prompt when no Image is provided]\n<br class=\"ltx_break\"/></span>There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <span class=\"ltx_text ltx_font_italic\" id=\"A1.T5.1.2.2.2.1.1.3\"><same as the prompt when the Image is provided></span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 5: </span>Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.</figcaption>\n</figure>",
"perturb_sentence_id": [
9,
10
],
"output": {
"perturbed_statement": "[paragraph id = 11] Using these nuanced scores, the RL fine-tuned model exhibits limited behaviors (please refer to Section 4 ).",
"perturbed_explanation": "1. The original statement explains the characteristics of the RL fine-tuned model, indicating that it exhibits emergent behaviors due to nuanced scores. 2. The statement incorrectly states that the RL fine-tuned model exhibits limited behaviors instead of emergent behaviors, which contradicts the information that it yields emergent behaviors as detailed in the context."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "5",
"section": "3",
"all_context": [
"Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.",
"We choose ScienceQA because it contains reasoning chains to derive the answer.",
"Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).",
"In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).",
"Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.",
"MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).",
"This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).",
"We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.",
"Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.",
"We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.",
"We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.",
"To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.",
"(See Table 5 .)",
"In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.",
"We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.",
"The RL stage often makes the training model generate repetitive sentences.",
"This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.",
"To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.",
"(See Appendix C.2 .)",
"Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.",
"The learning rate for is with epochs for PPO and a batch size of for both tasks.",
"We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.",
"For A-OKVQA, we proceed with 1 round for both model sizes.",
"For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.",
"Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).",
"The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).",
"For more detailed settings, please refer to Appendix C .",
"Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.",
"First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.",
"For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).",
"The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).",
"Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).",
"This evaluation is also performed on both model sizes and tasks.",
"Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .",
""
],
"target_context_ids": [
12,
13,
14,
15,
16,
17,
18
],
"selected_paragraphs": [
"[paragraph id = 12] (See Table 5 .)",
"[paragraph id = 13] In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.",
"[paragraph id = 14] We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.",
"[paragraph id = 15] The RL stage often makes the training model generate repetitive sentences.",
"[paragraph id = 16] This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.",
"[paragraph id = 17] To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.",
"[paragraph id = 18] (See Appendix C.2 .)"
],
"table_html": "<figure class=\"ltx_table\" id=\"A1.T5\">\n<table class=\"ltx_tabular ltx_centering ltx_align_middle\" id=\"A1.T5.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A1.T5.1.1.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt\" id=\"A1.T5.1.1.1.1\" style=\"width:113.8pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.1.1.1.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.1.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.1.1.1.1.1.1\">Feedback</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt\" id=\"A1.T5.1.1.1.2\" style=\"width:313.0pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.1.1.2.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.1.1.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.1.1.2.1.1.1\">Prompt Structure</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A1.T5.1.2.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A1.T5.1.2.2.1\" style=\"width:113.8pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.2.2.1.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.2.2.1.1.1\">Sentence-Level \n<br class=\"ltx_break\"/>Nuanced Feedback</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A1.T5.1.2.2.2\" style=\"width:313.0pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A1.T5.1.2.2.2.1\">\n<span class=\"ltx_p\" id=\"A1.T5.1.2.2.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.2.2.2.1.1.1\">[Prompt when Image is provided]\n<br class=\"ltx_break\"/></span>There exists a set comprising Image, Options, Hint, and Answer for a Question. The reasoning process used to deduce the answer is provided in JSON format. Fill in \"xxx\" with values ranging from 0.0 to 1.0, in increments of 0.1. The reasoning may include the starting point of thought, the process of elimination, or true statements, although these may not appear to be directly related to the answer at first glance. A value closer to 0.0 indicates a completely incorrect rationale, 0.5 indicates a neutral rationale such as the initial thought process or true statements that guide later guesses towards the answer, and a value closer to 1.0 denotes a correct or relevant rationale for the question. Please just fill the \"xxx\" parts and only return the JSON format. If a sentence is repetitive (appeared before), then give 0.0. \n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/>Question: <Question>\n<br class=\"ltx_break\"/>Options: <Choices>\n<br class=\"ltx_break\"/>Hint: <Hint>\n<br class=\"ltx_break\"/>Answer: <Answer>\n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/>{\n<br class=\"ltx_break\"/>\"<Rationale 1>\": xxx,\n<br class=\"ltx_break\"/>\"<Rationale 2>\": xxx,\n<br class=\"ltx_break\"/>\"<Rationale 3>\": xxx\n<br class=\"ltx_break\"/>}\n\n<br class=\"ltx_break\"/>\n<br class=\"ltx_break\"/><span class=\"ltx_text ltx_font_bold\" id=\"A1.T5.1.2.2.2.1.1.2\">[Prompt when no Image is provided]\n<br class=\"ltx_break\"/></span>There exists a set comprising Options, Hint, and Answer for a Question. The reasoning process … <span class=\"ltx_text ltx_font_italic\" id=\"A1.T5.1.2.2.2.1.1.3\"><same as the prompt when the Image is provided></span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 5: </span>Prompt structure for sentence-level nuanced feedback in a question-answering system. The table outlines the format for prompts when an image is provided and when no image is provided, detailing how to score the rationale for each sentence in terms of correctness and relevance.</figcaption>\n</figure>",
"perturb_sentence_id": [
12,
13
],
"output": {
"perturbed_statement": "[paragraph id = 12] (See Table 5 .) In order to collect the corrected dataset for the RL stage, we let Haiku refer to the given problem and correct the answer as the prompt.",
"perturbed_explanation": "1. The original statement discusses collecting a corrected dataset for the SFT stage by letting Haiku refer to the given problem and correct the answer as the prompt. 2. The statement is incorrect because it mentions collecting the corrected dataset for the RL stage, whereas the context suggests this process is associated with the SFT stage. The RL stage is mentioned in relation to generating repetitive sentences, not dataset correction."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "8",
"section": "3",
"all_context": [
"Data: We first evaluate our proposed method on the ScienceQA (Lu et al., 2022a ) dataset, a large-scale, multi-modal science dataset designed to assess multi-hop reasoning abilities.",
"We choose ScienceQA because it contains reasoning chains to derive the answer.",
"Each problem consists of a question, multiple options, multi-modal contexts, a correct answer, and an annotated lecture or solution chain (note that around lack the solution chain).",
"In addition, we conduct experiments on A-OKVQA (Schwenk et al., 2022 ), a knowledge-based multi-modal benchmark with a diverse set of challenging questions paired with rationales, demanding non-trivial commonsense knowledge (see Appendix B ).",
"Baselines: We mainly compare our method with Multimodal-CoT (MM-CoT) (Zhang et al., 2023b ) as the baseline because it utilizes reasoning chains to solve multi-modal tasks.",
"MM-CoT leverages two distinct models: the first generates a rationale for a given problem, and the second, an inference model, takes the concatenated input (problem and generated rationale).",
"This separated framework shows improved performance, even for relatively small models such as (Chia et al., 2023 ) (M) and (M).",
"We use the rationale model provided by MM-CoT for ScienceQA and retrain the rationale model ourselves for A-OKVQA because there is no provided model.",
"Prompts for Feedback: Since our proposed ARES requests different types of feedback for each stage, a corresponding prompt exists separately.",
"We use Claude 3 Haiku for all training to get feedback because it is approximately times cheaper than the top competing models, yet still demonstrates decent performance.",
"We first request scores ranging from to for each sentence in CoT to proceed with the RL stage.",
"To obtain reasonable scores, we let Haiku consider the starting point of thought, the process of elimination, or true statements.",
"(See Table 5 .)",
"In order to collect the corrected dataset for the SFT stage, we let Haiku refer to the given problem and correct the answer as the prompt.",
"We ask Haiku to maintain the format of the existing rationale chains as much as possible and correct only the parts that require correction.",
"The RL stage often makes the training model generate repetitive sentences.",
"This repetition is not easily removed even by GPT-4 when the repetitive sentence exists in the middle of rationale reasoning.",
"To reduce the burden of feedback, we simply hard-code the removal of repetitive sentences before adding the generated rationale to the prompt.",
"(See Appendix C.2 .)",
"Training Details: For the RL stage, we use a learning rate of and epochs for PPO with a batch size of for both ScienceQA and A-OKVQA.",
"The learning rate for is with epochs for PPO and a batch size of for both tasks.",
"We proceed with 2 rounds of our pipeline for and 2 rounds for for ScienceQA.",
"For A-OKVQA, we proceed with 1 round for both model sizes.",
"For the SFT stage for correction, we follow the hyperparameters used in MM-CoT for both model sizes.",
"Additionally, we replace MM-CoT s inference model, which is the same size as the rationale model, with the Low-Rank Adaptation (LoRA) (Hu et al., 2021 ) added to the rationale model (Figure 4 ).",
"The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).",
"For more detailed settings, please refer to Appendix C .",
"Evaluation Metrics: We use two main metrics to test how our pipeline (ARES) improves rationale reasoning quality.",
"First, we evaluate ARES s rationale reasoning quality against baseline models since we enhance our model based on them.",
"For two different model sizes ( and ) and two tasks (ScienceQA and A-OKVQA), rationale reasoning quality is evaluated by GPT-4o-2024-05-13 and the win rate is calculated (Section 4.3 ).",
"The GPT-4 series is actively used as an evaluation metric, replacing human judgment for various domains (Liu et al., 2023b ; Sottana et al., 2023 ).",
"Second, we assess how the improved rationale reasoning impacts answer accuracy (Section 4.4 ).",
"This evaluation is also performed on both model sizes and tasks.",
"Additionally, we analyze how the RL stage fine-tunes the training model and maximizes the sum of rewards in Section 4.1 .",
""
],
"target_context_ids": [
25,
26
],
"selected_paragraphs": [
"[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model s features with a small number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8 ).",
"[paragraph id = 26] For more detailed settings, please refer to Appendix C ."
],
"table_html": "<figure class=\"ltx_table\" id=\"A3.T8\">\n<p class=\"ltx_p ltx_align_center\" id=\"A3.T8.4\"><span class=\"ltx_text\" id=\"A3.T8.4.4\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"A3.T8.4.4.4\" style=\"width:210.6pt;height:90pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"A3.T8.4.4.4.4\"><span class=\"ltx_text\" id=\"A3.T8.4.4.4.4.4\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"A3.T8.4.4.4.4.4.4\">\n<span class=\"ltx_thead\">\n<span class=\"ltx_tr\" id=\"A3.T8.4.4.4.4.4.4.5.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.1.1\">Model</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.2.1\">Rationale</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.3.1\">Inference</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.4.1\">Total</span></span></span>\n</span>\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"A3.T8.1.1.1.1.1.1.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.2\">6m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.3\">2m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.4\">8m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.2\">-</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.3\">3m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.4\">3m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.3.3.3.3.3.3.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.2\">16m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.3\">3m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.4\">19m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.2\">-</span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.3\">6m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.4\">6m</span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 8: </span>Time Comparison between MM-CoT and ARES models for A-OKVAQ test set</figcaption>\n</figure>",
"perturb_sentence_id": [
25,
26
],
"output": {
"perturbed_statement": "[paragraph id = 25] The LoRA adapter effectively utilizes the rationale model's features with a large number of weights, enabling 2x–14x faster inference compared to MM-CoT, which introduces a separate inference model (See the time comparison in Table 7 and Table 8). For more detailed settings, please refer to Appendix C.",
"perturbed_explanation": "The original explanation likely highlighted that the LoRA adapter efficiently uses the rationale model's features with a small number of weights, which results in significantly faster inference speeds compared to MM-CoT. 2. The statement incorrectly mentions a 'large number of weights', contradicting the efficiency highlighted by the small number of weights that enable faster processing. LoRA's advantage in speed is due to its minimal weight requirement, not a large one."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "8",
"section": "5",
"all_context": [
"Chain-of-Thought (CoT) is a multi-step reasoning method for problem-solving that encourages LLMs to consider the intermediate reasoning steps.",
"Zero-Shot-CoT (Kojima et al., 2023 ) promotes CoT by using prompts such as \"Let s think step by step\" for LLMs.",
"For Few-Shot-CoT (Zhang et al., 2022b ; Wei et al., 2023 ), a few examples with reasoning processes are provided, allowing the model to refer to these examples and understand how to perform CoT.",
"Wei et al.",
"(2023 ) reveal that this CoT technique positively impacts the performance of large models (B), but has minimal effect on smaller models.",
"MM-CoT (Zhang et al., 2023b ) suggest that CoT is beneficial even for relatively small models, such as 200M, if the model that generates intermediate reasoning and the model that infers the answer are separated.",
"We find that simply adding a LoRA adapter (Hu et al., 2021 ) to the reasoning model results in comparable performance.",
"This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).",
"This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input.",
"Reinforcement Learning from Human Feedback (RLHF) (Glaese et al., 2022 ; Ouyang et al., 2022 ) and AI Feedback (RLAIF) (Bai et al., 2022 ) align LLMs with user preferences.",
"Ouyang et al.",
"(2022 ) collects ranked feedback from human labelers and uses this feedback to perform Reinforcement Learning (RL).",
"Constitutional AI (CAI) (Bai et al., 2022 ) collects ranked AI feedback rather than costly human feedback and handles harmfulness with RL.",
"Both approaches learn outcome-supervised reward models (ORM) using ranking-based feedback.",
"Lightman et al.",
"(2023 ), instead, propose a process-supervised reward model (PRM) that leverages sentence-level feedback for CoT.",
"Lightman et al.",
"(2023 ); Luo et al.",
"(2024 ) evaluate each trained ORM and PRM with searching algorithms such as best-of- or Monte Carlo Tree Search (MCTS) by selecting the highest-scored solution, demonstrating that the PRM-selected solution outperforms the ORM-selected one.",
"Wang et al.",
"(2024 ) perform RL using PRM, providing heuristic sentence-level scores for math problems that are simple to grade.",
"As an LLM is trained with RL and starts generating outputs different from the original distribution, these reward models would not correctly provide rewards (Pitis, 2023 ; Byun and Perrault, 2024 ).",
"Instead of training a reward model for a more general task, we perform RL by requesting sentence-level rewards from advanced AI models such as GPT-4.",
"Model Accuracy 85.95 1st RL 86.70 1st ARES 87.81 1st ARES & 2nd RL 87.88 2nd ARES 88.38",
""
],
"target_context_ids": [
7,
8
],
"selected_paragraphs": [
"[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, and generates answers 2x–14x faster than MM-CoT, which uses a separate inference model (See Table 7 and Table 8 ).",
"[paragraph id = 8] This speed advantage arises from the fact that our framework does not require a rationale as input, whereas the separate inference model framework must first generate the rationale before using it as input."
],
"table_html": "<figure class=\"ltx_table\" id=\"A3.T8\">\n<p class=\"ltx_p ltx_align_center\" id=\"A3.T8.4\"><span class=\"ltx_text\" id=\"A3.T8.4.4\">\n<span class=\"ltx_inline-block ltx_transformed_outer\" id=\"A3.T8.4.4.4\" style=\"width:210.6pt;height:90pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1,1) ;\">\n<span class=\"ltx_p\" id=\"A3.T8.4.4.4.4\"><span class=\"ltx_text\" id=\"A3.T8.4.4.4.4.4\">\n<span class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"A3.T8.4.4.4.4.4.4\">\n<span class=\"ltx_thead\">\n<span class=\"ltx_tr\" id=\"A3.T8.4.4.4.4.4.4.5.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.1.1\">Model</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.2.1\">Rationale</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.3.1\">Inference</span></span>\n<span class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.5.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"A3.T8.4.4.4.4.4.4.5.1.4.1\">Total</span></span></span>\n</span>\n<span class=\"ltx_tbody\">\n<span class=\"ltx_tr\" id=\"A3.T8.1.1.1.1.1.1.1\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.2\">6m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.3\">2m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.1.1.1.1.1.1.1.4\">8m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.2.2.2.2.2.2.2\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.2\">-</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.3\">3m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.2.2.2.2.2.2.2.4\">3m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.3.3.3.3.3.3.3\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.2\">16m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.3\">3m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_t\" id=\"A3.T8.3.3.3.3.3.3.3.4\">19m</span></span>\n<span class=\"ltx_tr\" id=\"A3.T8.4.4.4.4.4.4.4\">\n<span class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.1\"></span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.2\">-</span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.3\">6m</span>\n<span class=\"ltx_td ltx_align_center ltx_border_b ltx_border_t\" id=\"A3.T8.4.4.4.4.4.4.4.4\">6m</span></span>\n</span>\n</span></span></span>\n</span></span></span></p>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 8: </span>Time Comparison between MM-CoT and ARES models for A-OKVAQ test set</figcaption>\n</figure>",
"perturb_sentence_id": [
7,
8
],
"output": {
"perturbed_statement": "[paragraph id = 7] This framework enables the LoRA adapter to effectively utilize all features, from raw text to latent features, but generates answers 2x–14x slower than MM-CoT, which uses a separate inference model (See Table 7 and Table 8). This speed disadvantage arises from the fact that our framework requires an additional rationale as input, unlike the separate inference model framework that does not require any rationale before generating input.",
"perturbed_explanation": "The original explanation highlights two key aspects: 1) The LoRA adapter framework generates answers significantly faster (2x-14x) than the MM-CoT framework. 2) The speed advantage is due to not needing a rationale as input, unlike MM-CoT, which requires generating a rationale first. The statement introduces inaccuracies in the following ways: 1) It claims the LoRA framework is slower (2x–14x) than MM-CoT, which contradicts the actual performance speedup detailed earlier. 2) It incorrectly suggests that the LoRA framework needs an additional rationale as input, reversing the actual rationale requirement described."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "13",
"section": "2.3",
"all_context": [
"The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.",
"However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.",
"(See examples in Appendix D .)",
"Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.",
"We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .",
"Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.",
"This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.",
"How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.",
"The direction of learning is determined by the reward (scalar) value.",
"However, the opposite direction of the reward is sometimes required.",
"For example, suppose there is a truncated sentence in CoT.",
"gets a negative score because it is an incomplete sentence (Table 13 ).",
"If there is no correction stage, the probability of is simply reduced.",
"What if contains some valuable part?",
"This valuable part is ignored, and its probability decreases.",
"To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.",
"Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.",
"Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.",
"The KL divergence penalty further prevents deviation.",
"However, this penalty often causes the model s degeneration.",
"As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.",
"While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.",
"Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.",
""
],
"target_context_ids": [
11,
12,
13,
14,
15,
16
],
"selected_paragraphs": [
"[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 13 ).",
"[paragraph id = 12] If there is no correction stage, the probability of is simply reduced.",
"[paragraph id = 13] What if contains some valuable part?",
"[paragraph id = 14] This valuable part is ignored, and its probability decreases.",
"[paragraph id = 15] To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.",
"[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T13\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T13.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T13.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_top ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T13.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.1.1.1.1\">Example of the Incomplete\nSentence After the RL Stage</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T13.1.2.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" colspan=\"2\" id=\"A4.T13.1.2.1.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.2.1.1.1\">Problem:</span> What is the volume of a large soup pot?</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.3.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T13.1.3.2.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.3.2.1.1\">Choices:</span> [7 liters, 7 milliliters]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.4.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T13.1.4.3.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.4.3.1.1\">Answer:</span> 7 liters</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.5.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.5.4.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.5.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.5.4.1.1.1\" style=\"width:113.8pt;\">Before RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.5.4.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.5.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.5.4.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.5.4.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.6.5\">\n<td class=\"ltx_td ltx_align_top\" id=\"A4.T13.1.6.5.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"A4.T13.1.6.5.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.6.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.6.5.2.1.1\" style=\"width:313.0pt;\">The better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.7.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.7.6.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.7.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.7.6.1.1.1\" style=\"width:113.8pt;\">After RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.7.6.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.7.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.7.6.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.7.6.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.8.7\">\n<td class=\"ltx_td ltx_align_top ltx_border_bb\" id=\"A4.T13.1.8.7.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"A4.T13.1.8.7.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.8.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.8.7.2.1.1\" style=\"width:313.0pt;\">The better estimate for the volume of a large soup pot <span class=\"ltx_text\" id=\"A4.T13.1.8.7.2.1.1.1\" style=\"color:#FF0000;\">is-</span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 13: </span>Example of the incomplete sentence after the RL stage.</figcaption>\n</figure>",
"perturb_sentence_id": [
11,
12
],
"output": {
"perturbed_statement": "[paragraph id = 11] gets a negative score because it is an incomplete sentence (Table 18). If there is no correction stage, the probability of is simply increased.",
"perturbed_explanation": "1. The original explanation indicates that the context discusses a negative score given to an incomplete sentence and the probability of it being reduced if not corrected. 2. The statement is incorrect because it references Table 18 instead of Table 13, and it claims that the probability of the incomplete sentence is increased without the correction stage, which contradicts the context where the probability decreases if the valuable part is ignored."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "13",
"section": "4.2",
"all_context": [
"Despite the benefits of RL, hyperparameter tuning often requires massive effort.",
"Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.",
"To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.",
"SFT is more stable than RL.",
"We evaluate how well the SFT stage corrects errors caused by the RL stage for various RL hyperparameters.",
"We test various RL hyperparameters such as learning rate = {5e-6, 1e-5, 2e-5, 5e-5}, batch size = {2, 4, 8, 16, 32}, and PPO epoch = {5, 10, 15}.",
"As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).",
"The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 ).",
"However, excessive RL learning rates or epochs cause serious degeneration of the model, such as producing no output or generating strange words, and the results of correction feedback are also unreasonable.",
"ScienceQA Win Rate vs 69.76% vs 73.76% A-OKVQA Win Rate vs 69.11% vs 66.96% Model Size NAT SOC LAN TXT IMG NO G1-6 G7-12 Avg Human - 90.23 84.97 87.48 89.60 87.50 88.10 91.59 82.42 88.40 MCAN (Yu et al., 2019 ) 95M 56.08 46.23 58.09 59.43 51.17 55.40 51.65 59.72 54.54 Top-Down (Anderson et al., 2018 ) 70M 59.50 54.33 61.82 62.90 54.88 59.79 57.27 62.16 59.02 BAN (Kim et al., 2018 ) 112M 60.88 46.57 66.64 62.61 52.60 65.51 56.83 63.94 59.37 DFAF (Peng et al., 2019 ) 74M 64.03 48.82 63.55 65.88 54.49 64.11 57.12 67.17 60.72 ViLT (Kim et al., 2021 ) 113M 60.48 63.89 60.27 63.20 61.38 57.00 60.72 61.90 61.14 Patch-TRM (Lu et al., 2022b ) 90M 65.19 46.79 65.55 66.96 55.28 64.95 58.04 67.50 61.42 VisualBERT (Li et al., 2019 ) 111M 59.33 69.18 61.18 62.71 62.17 58.54 62.96 59.92 61.87 UnifiedQABase (Khashabi et al., 2020 ) 223M 68.16 69.18 74.91 63.78 61.38 77.84 72.98 65.00 70.12 UnifiedQABase w/ CoT (Lu et al., 2022a ) 223M 71.00 76.04 78.91 66.42 66.53 81.81 77.06 68.82 74.11 LLaMA-Adapter (Zhang et al., 2023a ) 6B 84.37 88.30 84.36 83.72 80.32 86.90 85.83 84.05 85.19 LLaVA (Liu et al., 2023a ) 13B 90.36 95.95* 88.00 89.49 88.00 90.66 90.93 90.90* 90.92 InstructBLIP (Dai et al., 2023 ) 11B - - - - 90.70* - - - - (Zhang et al., 2023b ) 251M+251M 84.59 92.46 83.45 83.87 83.29 85.64 86.34 85.23 85.95 (Ours) 251M+30M 87.92 92.58 85.91 86.61 85.82 88.36 88.88 87.48 88.38 (Zhang et al., 2023b ) 790M+790M 90.76 93.59 86.55 89.69 87.85 89.55 90.90 89.12 90.26 (Ours) 790M+76M 91.21* 92.80 89.45* 90.27* 88.35 91.22* 91.48* 90.38 91.09*",
""
],
"target_context_ids": [
1,
2,
6,
7
],
"selected_paragraphs": [
"[paragraph id = 1] Without meticulous tuning, the RL fine-tuned model may produce errors such as repetitive or incomplete sentences.",
"[paragraph id = 2] To address these issues, we add a supervised fine-tuning (SFT) stage after RL to correct these errors.",
"[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are repetitive or truncated (see Table 13 and 12 ).",
"[paragraph id = 7] The SFT stage, with correction feedback, reflects the direction in which the model is fine-tuned through RL and appropriately guides it (Table 13 and 16 )."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T13\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T13.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T13.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_top ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T13.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.1.1.1.1\">Example of the Incomplete\nSentence After the RL Stage</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T13.1.2.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" colspan=\"2\" id=\"A4.T13.1.2.1.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.2.1.1.1\">Problem:</span> What is the volume of a large soup pot?</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.3.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T13.1.3.2.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.3.2.1.1\">Choices:</span> [7 liters, 7 milliliters]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.4.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T13.1.4.3.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.4.3.1.1\">Answer:</span> 7 liters</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.5.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.5.4.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.5.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.5.4.1.1.1\" style=\"width:113.8pt;\">Before RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.5.4.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.5.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.5.4.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.5.4.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.6.5\">\n<td class=\"ltx_td ltx_align_top\" id=\"A4.T13.1.6.5.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"A4.T13.1.6.5.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.6.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.6.5.2.1.1\" style=\"width:313.0pt;\">The better estimate for the volume of a large soup pot is 7 liters. 7 milliliters is too little.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.7.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.7.6.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.7.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.7.6.1.1.1\" style=\"width:113.8pt;\">After RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T13.1.7.6.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.7.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.7.6.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T13.1.7.6.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T13.1.8.7\">\n<td class=\"ltx_td ltx_align_top ltx_border_bb\" id=\"A4.T13.1.8.7.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"A4.T13.1.8.7.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T13.1.8.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T13.1.8.7.2.1.1\" style=\"width:313.0pt;\">The better estimate for the volume of a large soup pot <span class=\"ltx_text\" id=\"A4.T13.1.8.7.2.1.1.1\" style=\"color:#FF0000;\">is-</span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 13: </span>Example of the incomplete sentence after the RL stage.</figcaption>\n</figure>",
"perturb_sentence_id": [
6,
7
],
"output": {
"perturbed_statement": "[paragraph id = 6] As a result of RL, we observe that some of the sentences in rationale chains are neither repetitive nor truncated (see Table 13 and 12 ). The SFT stage, with correction feedback, reflects a different direction from how the model is fine-tuned through RL and misguides it (Table 13 and 16 ).",
"perturbed_explanation": "This explanation addresses two aspects presented in the statement. 1. The first aspect is regarding the impact of RL fine-tuning. The original statement mentions that RL resulted in sentences within rationale chains being repetitive or truncated. 2. The second aspect involves the role of SFT. The SFT stage is designed to guide and correct the errors introduced during RL fine-tuning. The statement incorrectly claims that sentences are neither repetitive nor truncated, and that SFT misguides the model, both of which contradict the described scenario where RL leads to errors that SFT aims to correct."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "14",
"section": "4.1",
"all_context": [
"Through RL, a training model is aligned to a specific preference.",
"Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.",
"However, this process produces some interesting additional results.",
"First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.",
"In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.",
"The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).",
"We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.",
"Second, the training model begins to explain why other options are not the answer (See Table 15 ).",
"The process of elimination is a useful method for deriving answers when options are given.",
""
],
"target_context_ids": [
4,
5,
6
],
"selected_paragraphs": [
"[paragraph id = 4] In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.",
"[paragraph id = 5] The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).",
"[paragraph id = 6] We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T14\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T14.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T14.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_top ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T14.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.1.1.1.1\">Changed Empty Rationale Reasoning Through RL</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T14.1.2.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" colspan=\"2\" id=\"A4.T14.1.2.1.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.2.1.1.1\">Problem:</span> Select the solid.</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.3.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T14.1.3.2.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.3.2.1.1\">Choices:</span> [rain, water in a fishbowl, hammer]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.4.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T14.1.4.3.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.4.3.1.1\">Answer:</span> hammer</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.5.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T14.1.5.4.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.5.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.5.4.1.1.1\" style=\"width:113.8pt;\">Before RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T14.1.5.4.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.5.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.5.4.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.5.4.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.6.5\">\n<td class=\"ltx_td ltx_align_top\" id=\"A4.T14.1.6.5.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"A4.T14.1.6.5.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.6.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.6.5.2.1.1\" style=\"width:313.0pt;\"><empty></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.7.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T14.1.7.6.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.7.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.7.6.1.1.1\" style=\"width:113.8pt;\">After RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T14.1.7.6.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.7.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.7.6.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T14.1.7.6.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T14.1.8.7\">\n<td class=\"ltx_td ltx_align_top ltx_border_bb\" id=\"A4.T14.1.8.7.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"A4.T14.1.8.7.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T14.1.8.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T14.1.8.7.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text\" id=\"A4.T14.1.8.7.2.1.1.1\" style=\"color:#0000FF;\">Rain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you pour water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood.The water in a fishbowl is a liquid. A liquid takes the shape of any container it is in. If you put water from a fishbowl into a different container, the water will take the shape of that container. But the water will still take up the same amount of space. A rain is a liquid. A liquid takes the shape of any container it is in. If you put rainwater into a bucket, the rainwater will take the shape of the bucket. But the rainwater will still take up the same amount of space. A hammer is a solid. A solid has a size and shape of its own. Many hammers are made of iron and wood. Both iron and wood are solids.</span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 14: </span>Example of an empty rationale reasoning that starts generating rationale reasoning after the RL stage.</figcaption>\n</figure>",
"perturb_sentence_id": [
4,
5
],
"output": {
"perturbed_statement": "[paragraph id = 4] In particular, 15% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14).",
"perturbed_explanation": "Original Explanation: In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data. The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward. \n\n1. The statement incorrectly claims that 15% of the problems have empty rationale reasoning data, whereas the correct percentage is 9.5%. \n2. This alteration changes the statistical representation of the data and may lead to misconceptions about the prevalence of problems lacking rationale reasoning in ScienceQA."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "15",
"section": "4.1",
"all_context": [
"Through RL, a training model is aligned to a specific preference.",
"Essentially, the model increases the probability of helpful sentences receiving good rewards and reduces the probability of incorrect or meaningless sentences.",
"However, this process produces some interesting additional results.",
"First, it supplements rationale reasoning for some problems where rationale reasoning is insufficient.",
"In particular, 9.5% of problems in ScienceQA have empty rationale reasoning (solution) data.",
"The model generates nothing before the RL stage for these problems but starts generating reasoning chains afterward (See Table 14 ).",
"We observe this especially when utilizing PPO s advantage normalization or when the learning rate is large.",
"Second, the training model begins to explain why other options are not the answer (See Table 15 ).",
"The process of elimination is a useful method for deriving answers when options are given.",
""
],
"target_context_ids": [
7,
8
],
"selected_paragraphs": [
"[paragraph id = 7] Second, the training model begins to explain why other options are not the answer (See Table 15 ).",
"[paragraph id = 8] The process of elimination is a useful method for deriving answers when options are given."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T15\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T15.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T15.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_top ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T15.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.1.1.1.1\">Example of the Process of Elimination After the RL Stage</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T15.1.2.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" colspan=\"2\" id=\"A4.T15.1.2.1.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.2.1.1.1\">Problem:</span> Which is a compound sentence?</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.3.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T15.1.3.2.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.3.2.1.1\">Choices:</span> [\"Last night we heard an owl outside in the oak tree.\",</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.4.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T15.1.4.3.1\"> \"I ate all of my dinner, so I can have some pie.\"]</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.5.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" colspan=\"2\" id=\"A4.T15.1.5.4.1\">\n<span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.5.4.1.1\">Answer:</span> \"I ate all of my dinner, so I can have some pie.\"</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.6.5\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T15.1.6.5.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.6.5.1.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.6.5.1.1.1\" style=\"width:113.8pt;\">Before RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T15.1.6.5.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.6.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.6.5.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.6.5.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.7.6\">\n<td class=\"ltx_td ltx_align_top\" id=\"A4.T15.1.7.6.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"A4.T15.1.7.6.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.7.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.7.6.2.1.1\" style=\"width:313.0pt;\">The first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.8.7\">\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T15.1.8.7.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.8.7.1.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.8.7.1.1.1\" style=\"width:113.8pt;\">After RL</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"A4.T15.1.8.7.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.8.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.8.7.2.1.1\" style=\"width:313.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T15.1.8.7.2.1.1.1\">[Rationale reasoning]</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T15.1.9.8\">\n<td class=\"ltx_td ltx_align_top ltx_border_bb\" id=\"A4.T15.1.9.8.1\"></td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"A4.T15.1.9.8.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T15.1.9.8.2.1\">\n<span class=\"ltx_p\" id=\"A4.T15.1.9.8.2.1.1\" style=\"width:313.0pt;\">The first sentence is the compound sentence. It is made up of two simple sentences joined by a comma and the conjunction so. I ate all of my dinner, so I can have some pie. <span class=\"ltx_text\" id=\"A4.T15.1.9.8.2.1.1.1\" style=\"color:#0000FF;\">The second sentence does not have a comma, so it is not a compound sentence. Last night we heard an owl outside in the oak tree.</span></span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 15: </span>Example of the process of elimination rationale reasoning after the RL stage.</figcaption>\n</figure>",
"perturb_sentence_id": [
7,
8
],
"output": {
"perturbed_statement": "[paragraph id = 7] Second, the training model begins to explain why other options are the answer (See Table 15 ). The process of elimination is a useful method for deriving answers when options are given.",
"perturbed_explanation": "The process of elimination is a useful method for deriving answers when options are given because it allows individuals to identify and discard alternatives that do not meet the given criteria or constraints. 2. However, the statement incorrectly claims that the training model explains why other options are the answer, which contradicts the concept of elimination that focuses on discarding incorrect options rather than validating them as correct answers."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "16",
"section": "2.3",
"all_context": [
"The RL fine-tuning procedure makes model changes to maximize the reward sum, such as correcting mistakes or explaining why other options cannot be the answer.",
"However, without highly tuned hyperparameters (Eimer et al., 2023 ), the model after the RL phase may result in errors such as repeated sentences, truncated sentences, or incorrect content for some data points.",
"(See examples in Appendix D .)",
"Correction Feedback: Given the success of LLMs and LMMs in a wide range of areas (Brown et al., 2020 ; Chowdhery et al., 2022 ; Zhang et al., 2022a ), we are not restricted to requesting feedback in the form of scores.",
"We request correction feedback from advanced AI (Teacher) for sentences containing errors after the RL process, and obtain a corrected dataset .",
"Since the supervised fine-tuning is more stable and finding appropriate hyperparameters is easier than RL, we proceed with supervised fine-tuning using exactly as in common autoregressive model (Vaswani et al., 2023 ) training to stabilize the RL fine-tuned model.",
"This reduces the burden of RL s exhaustive hyperparameter tuning and properly guides the direction in which the training model wants to change.",
"How Correction Feedback Helps RL: RL increases the probability of positively rewarded actions (or sentences) and decreases the probability for negative rewards.",
"The direction of learning is determined by the reward (scalar) value.",
"However, the opposite direction of the reward is sometimes required.",
"For example, suppose there is a truncated sentence in CoT.",
"gets a negative score because it is an incomplete sentence (Table 13 ).",
"If there is no correction stage, the probability of is simply reduced.",
"What if contains some valuable part?",
"This valuable part is ignored, and its probability decreases.",
"To alleviate this issue, we instead receive the corrected sentence as feedback and encourage the training model to generate complete sentences, which is very challenging to achieve with only RL.",
"Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.",
"Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model.",
"The KL divergence penalty further prevents deviation.",
"However, this penalty often causes the model s degeneration.",
"As a solution, InstructGPT (Ouyang et al., 2022 ) proposes PPO-ptx, where the supervised fine-tuning term with the pretraining dataset is included in the loss function.",
"While this aims to align the training model with specific preferences, it tends to anchor the model to the pretraining dataset.",
"Instead, we perform supervised fine-tuning through the Teacher s correction feedback to allow the training model to more freely adapt and meet specific preferences without degeneration.",
""
],
"target_context_ids": [
16,
17
],
"selected_paragraphs": [
"[paragraph id = 16] Table 16 shows more examples of how the correction stage helps the RL stage by maintaining the reasoning context while changing the erroneous parts.",
"[paragraph id = 17] Additionally, RL is primarily fine-tuned through PPO (Schulman et al., 2017 ) to prevent the model from deviating too much from the original model."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T16\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T16.2\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T16.2.3.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_middle ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T16.2.3.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.3.1.1.1\">Comparison of Corrected Rationales After RL</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T16.1.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.1.1.2\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.1.1.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.1.1.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.1.1.2.1.1.1\">Image</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.1.1.1\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.1.1.1.1\"><img alt=\"[Uncaptioned image]\" class=\"ltx_graphics ltx_img_landscape\" height=\"37\" id=\"A4.T16.1.1.1.1.g1\" src=\"x7.png\" width=\"332\"/>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.4.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.4.1.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.4.1.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.4.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.4.1.1.1.1.1\">Question</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.4.1.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.4.1.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.4.1.2.1.1\">Will these magnets attract or repel each other?</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.5.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.5.2.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.5.2.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.5.2.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.5.2.1.1.1.1\">Choices</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.5.2.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.5.2.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.5.2.2.1.1\">repel, <span class=\"ltx_text\" id=\"A4.T16.2.5.2.2.1.1.1\" style=\"background-color:#FFFF00;\">attract</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.6.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.6.3.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.6.3.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.6.3.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.6.3.1.1.1.1\">After RL</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.6.3.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.6.3.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.6.3.2.1.1\">Will these magnets attract or repel? To find out, look at which poles are closest to each other. The north pole of one magnet is closest to the south pole of the other magnet. Poles that are the same repel. So, these magnets will repel each other. The picture shows two magnets that are the same shape and the same size. Both magnets are the same size. So, these magnets will repel each other.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.7.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.7.4.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.7.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.7.4.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.7.4.1.1.1.1\">Corrected Rationale</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.7.4.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.7.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.7.4.2.1.1\">The magnets will attract each other because the north pole of one magnet is closest to the south pole of the other magnet. Opposite poles of magnets attract each other in this diagram. Therefore, the correct answer is that the magnets will attract.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.8.5\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T16.2.8.5.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.8.5.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.8.5.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.8.5.1.1.1.1\">Question</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T16.2.8.5.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.8.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.8.5.2.1.1\">How long is a pen?</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.9.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.9.6.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.9.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.9.6.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.9.6.1.1.1.1\">Choices</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.9.6.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.9.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.9.6.2.1.1\">20 kilometers, <span class=\"ltx_text\" id=\"A4.T16.2.9.6.2.1.1.1\" style=\"background-color:#FFFF00;\">20 centimeters</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.10.7\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.10.7.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.10.7.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.10.7.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.10.7.1.1.1.1\">After RL</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.10.7.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.10.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.10.7.2.1.1\">The better estimate for the length of a pen is 20 centimeters. 20 kilometers is too long. 20 centimeters is too short. 20 kilometers is too long.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.11.8\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.11.8.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.11.8.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.11.8.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.11.8.1.1.1.1\">Corrected Rationale</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.11.8.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.11.8.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.11.8.2.1.1\">The better estimate for the length of a pen is 20 centimeters. 20 kilometers is an unreasonable estimate, as it is far too long for a typical pen. 20 centimeters is a more accurate and reasonable estimate for the length of a standard pen.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T16.2.2.2\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.2.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.2.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.2.2.1.1.1\">Image</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T16.2.2.1\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.2.1.1\"><img alt=\"[Uncaptioned image]\" class=\"ltx_graphics ltx_img_landscape\" height=\"190\" id=\"A4.T16.2.2.1.1.g1\" src=\"x8.png\" width=\"332\"/>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.12.9\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.12.9.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.12.9.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.12.9.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.12.9.1.1.1.1\">Question</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.12.9.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.12.9.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.12.9.2.1.1\">What is the capital of South Carolina?</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.13.10\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.13.10.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.13.10.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.13.10.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.13.10.1.1.1.1\">Choices</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.13.10.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.13.10.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.13.10.2.1.1\"><span class=\"ltx_text\" id=\"A4.T16.2.13.10.2.1.1.1\" style=\"background-color:#FFFF00;\">Columbia</span>, Montgomery, Charleston, Harrisburg</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.14.11\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.14.11.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.14.11.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.14.11.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.14.11.1.1.1.1\">After RL</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T16.2.14.11.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.14.11.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.14.11.2.1.1\">Columbia is the capital of South Carolina.nThe city of Columbia is Charleston, South Carolina. Charleston is the smallest city in South Carolina.nHayeburg is the smallest city in South Carolina. Montgomery is the smallest city in South Carolina. But the city of Columbia is home to many cultural and historical landmarks.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T16.2.15.12\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A4.T16.2.15.12.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.15.12.1.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.15.12.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T16.2.15.12.1.1.1.1\">Corrected Rationale</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A4.T16.2.15.12.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T16.2.15.12.2.1\">\n<span class=\"ltx_p\" id=\"A4.T16.2.15.12.2.1.1\">Columbia is the correct answer as it is the capital city of South Carolina. It is the largest city in the state and serves as the seat of government. The other options, Montgomery (the capital of Alabama), Charleston (a city in South Carolina), and Harrisburg (the capital of Pennsylvania), are not the capital of South Carolina.</span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 16: </span>Examples of how the correction stage corrects mistakes after the RL stage.</figcaption>\n</figure>",
"perturb_sentence_id": [
17,
18
],
"output": {
"perturbed_statement": "[paragraph id = 17] Additionally, RL is primarily fine-tuned through DQN (Schulman et al., 2017) to prevent the model from deviating too much from the original model. The KL divergence penalty further prevents deviation.",
"perturbed_explanation": "1. The original explanation notes that RL is primarily fine-tuned through PPO (Proximal Policy Optimization), which efficiently balances exploration and exploitation. It also emphasizes the role of the KL divergence penalty in preventing the model's significant deviation from the base model. 2. The statement incorrectly claims that RL is primarily fine-tuned through DQN (Deep Q-Network). DQN is not used for fine-tuning in the context described; PPO is the correct method used, according to the cited reference (Schulman et al., 2017). This introduces a factual error in the description of the fine-tuning process."
}
},
{
"path": "table_paper/2407.00087v2.json",
"table_id": "17",
"section": "4.4",
"all_context": [
"We investigate whether the improved rationale also contributes to answer inference accuracy.",
"Table 2 shows the main results of answer inference on the ScienceQA.",
"We evaluate our base model against the MM-CoT baseline.",
"achieves a 2.79% improvement compared to the corresponding baseline ().",
"The large model () shows some minimal improvement compared to the corresponding baseline.",
"However, it s worth noting that despite this seemingly small gain, beats B LLaVA (Liu et al., 2023a ).",
"This minimal improvement may be due to the 9.5% of ScienceQA problems needing more rationale reasoning (around 9.5% problems have empty rationale reasoning).",
"The RL stages can only eliminate some empty rationale reasoning, which requires numerous ARES pipeline rounds.",
"Above all, our main goal is to assess how the RL stage works and how the SFT stage aids RL.",
"Table 3 shows the results of answer inference on the A-OKVQA.",
"We retrain and and evaluate these on the validation set as in (Zhang et al., 2023b ) because the test set is hidden.",
"In our experiments, MM-CoT models perform around 10% better than the reported accuracy in (Zhang et al., 2023b ).",
"ARES achieves 4.45% gains against and 2.35% for .",
"In addition, we demonstrate that two stages, RL and SFT, are essential through an ablation study.",
"Figure 3 shows the rationale reasoning for 4 cases.",
"The baseline model (MM-CoT) produces the same rationale reasoning as the dataset.",
"However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).",
"Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed.",
"1st RL indicates a single RL run on MM-CoT, and 1st ARES means one round of the ARES pipeline.",
"1st ARES & 2nd RL represents the second RL on 1st ARES, and finally, 2nd ARES refers to two rounds of ARES.",
"Model Accuracy IPVR (OPT-66B) 48.6 ViLBERT 49.1 60.96 (Ours) 65.41 65.68 (Ours) 68.03",
""
],
"target_context_ids": [
16,
17
],
"selected_paragraphs": [
"[paragraph id = 16] However, the corrected reasoning for MM-CoT without the RL stage has insufficient information compared to the reasoning of ARES that performs RL (refer to Table 17 for more examples).",
"[paragraph id = 17] Table 4 also shows that inference accuracy gradually improves as each part of ARES is executed."
],
"table_html": "<figure class=\"ltx_table\" id=\"A4.T17\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"A4.T17.7\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"A4.T17.7.8.1\">\n<th class=\"ltx_td ltx_align_center ltx_align_middle ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"A4.T17.7.8.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.8.1.1.1\">Comparison of Generated Rationales</span></th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"A4.T17.1.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.1.1.2\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.1.1.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.1.1.2.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.1.1.2.1.1.1\">Image</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.1.1.1\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.1.1.1.1\"><img alt=\"[Uncaptioned image]\" class=\"ltx_graphics ltx_img_landscape\" height=\"155\" id=\"A4.T17.1.1.1.1.g1\" src=\"x9.png\" width=\"415\"/>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.9.1\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.9.1.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.9.1.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.9.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.9.1.1.1.1.1\">Question</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.9.1.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.9.1.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.9.1.2.1.1\">Which property do these three objects have in common?</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.10.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.10.2.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.10.2.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.10.2.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.10.2.1.1.1.1\">Choices</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.10.2.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.10.2.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.10.2.2.1.1\">rough, <span class=\"ltx_text\" id=\"A4.T17.7.10.2.2.1.1.1\" style=\"background-color:#FFFF00;\">yellow</span>, sticky</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.11.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.11.3.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.11.3.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.11.3.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.11.3.1.1.1.1\">Original \n<br class=\"ltx_break\"/>Solution</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.11.3.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.11.3.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.11.3.2.1.1\">Look at each object. For each object, decide if it has that property. A rough object feels scratchy when you touch it. The rain boots and the rubber duck are not rough. A sticky object can attach or stick to other things. The rain boots and the rubber duck are not sticky. Yellow is a color. This color is yellow. All three objects are yellow. The property that all three objects have in common is yellow.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.2.2\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.2.2.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.2.2.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.2.2.1.1.1\"></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.2.2.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.2.2.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.2.2.2.1.1\">Look at each object. For each object, decide if it has that property. A sticky object can attach or stick to other things. The rubber duck is not sticky. Yellow is a color. This color is yellow. The rubber duck and the rain boots are yellow, but the lemon is not. A rough object feels scratchy when you touch it. All three objects are rough. The property that all three objects have in common is rough.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.3.3\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.3.3.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.3.3.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.3.3.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.3.3.1.1.1.1\">Correction \n<br class=\"ltx_break\"/>feedback on </span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.3.3.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.3.3.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.3.3.2.1.1\">The property that all three objects have in common is yellow. The rubber duck and the rain boots are both yellow, and the lemon is also yellow. Although the objects may have other properties like roughness, the common property among them is the color yellow.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.4.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.4.4.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.4.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.4.4.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.4.4.1.1.1.1\"> (Ours)</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.4.4.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.4.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.4.4.2.1.1\">The three objects shown - rubber duck, rain boots, and lemon - all share the common property of being yellow in color. The rubber duck is made of a smooth, non-sticky material, the rain boots are also a yellow-colored object, and the lemon is also a yellow-colored liquid. Therefore, the correct answer is that the common property among these three objects is that they are yellow. The other options, rough and sticky, do not apply to all three objects, as the rubber duck is not rough, and the rain boots are not sticky.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.12.4\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T17.7.12.4.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.12.4.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.12.4.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.12.4.1.1.1.1\">Question</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_tt ltx_border_tt\" id=\"A4.T17.7.12.4.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.12.4.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.12.4.2.1.1\">Which word would you find on a dictionary page with the following guide words? lever - loan</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.13.5\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.13.5.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.13.5.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.13.5.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.13.5.1.1.1.1\">Choices</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.13.5.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.13.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.13.5.2.1.1\">lamb, <span class=\"ltx_text\" id=\"A4.T17.7.13.5.2.1.1.1\" style=\"background-color:#FFFF00;\">liquid</span></span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.14.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.14.6.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.14.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.14.6.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.14.6.1.1.1.1\">Original \n<br class=\"ltx_break\"/>Solution</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.7.14.6.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.14.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.14.6.2.1.1\">Put the words in alphabetical order. Since liquid is between the guide words lever - loan, it would be found on that page.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.5.5\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.5.5.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.5.5.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.5.5.1.1.1\"></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.5.5.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.5.5.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.5.5.2.1.1\">Put the words in alphabetical order. Since lamb is between the guide words lever - loan, it would be found on that page.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.6.6\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.6.6.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.6.6.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.6.6.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.6.6.1.1.1.1\">Correction \n<br class=\"ltx_break\"/>feedback on </span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_t\" id=\"A4.T17.6.6.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.6.6.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.6.6.2.1.1\">The word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words.</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"A4.T17.7.7\">\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A4.T17.7.7.1\" style=\"width:56.9pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.7.1.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.7.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"A4.T17.7.7.1.1.1.1\"> (Ours)</span></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_middle ltx_border_bb ltx_border_t\" id=\"A4.T17.7.7.2\" style=\"width:341.4pt;\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"A4.T17.7.7.2.1\">\n<span class=\"ltx_p\" id=\"A4.T17.7.7.2.1.1\">The word \"liquid\" would be found on the dictionary page with the guide words \"lever - loan\" because it falls alphabetically between those two words. Therefore, the word \"liquid\" would be the word found on that dictionary page.</span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 17: </span>Comparison of generated rationales by baseline, baseline with correction feedback, and our model (ARES).</figcaption>\n</figure>",
"perturb_sentence_id": [
16,
17
],
"output": {
"perturbed_statement": "[paragraph id = 16] Nonetheless, the corrected reasoning for MM-CoT with the RL stage has insufficient information compared to the reasoning of ARES without RL (refer to Table 17 for more examples). Table 4 also shows that inference accuracy remains the same as each part of ARES is executed.",
"perturbed_explanation": "The original explanation highlights: 1. The statement discusses comparison in reasoning between MM-CoT and ARES, specifically emphasizing that the RL stage in ARES enhances reasoning, unlike MM-CoT without the RL stage, which has insufficient information. 2. In Table 4, it is stated that inference accuracy for ARES gradually improves as each part is executed. The statement is incorrect because: 3. It falsely claims that MM-CoT with the RL stage has insufficient information compared to ARES without RL, which contradicts the context where ARES's reasoning benefits from RL. 4. It incorrectly claims that Table 4 shows inference accuracy remains the same as ARES components execute, while the context indicates improvement with execution."
}
}
] |