perturb_for_table / table_result /2407.00102v1_output.json
wcy
'modify'
0803c45
raw
history blame
46.8 kB
[
{
"path": "table_paper/2407.00102v1.json",
"table_id": "1",
"section": "4.2",
"all_context": [
"We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.",
"Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.",
"In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.",
"We keep the rest of the training protocol the same to allow for a fair comparison.",
"Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.",
"We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.",
"And the same method is applied to fine-tune our data.",
"We report our main results in Table 1 .",
"Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.",
"Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.",
"In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].",
"The improvements verify the better training effects of our data since less data amount and same model are used.",
"In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.",
"We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.",
"Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.",
"This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.",
"By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.",
"To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.",
"This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.",
"As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.",
"By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.",
""
],
"target_context_ids": [
7,
9,
10,
11
],
"selected_paragraphs": [
"[paragraph id = 7] We report our main results in Table 1 .",
"[paragraph id = 9] Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.",
"[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].",
"[paragraph id = 11] The improvements verify the better training effects of our data since less data amount and same model are used."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T1\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T1.7\" style=\"width:493.9pt;height:201.1pt;vertical-align:-0.9pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(-27.4pt,11.1pt) scale(0.9,0.9) ;\">\n<table class=\"ltx_tabular ltx_align_middle\" id=\"S4.T1.7.7\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T1.3.3.3\">\n<td class=\"ltx_td ltx_align_left ltx_border_tt\" id=\"S4.T1.3.3.3.4\">Method</td>\n<td class=\"ltx_td ltx_align_left ltx_border_tt\" id=\"S4.T1.3.3.3.5\">LLM</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.3.3.3.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.6.1.1\" style=\"width:14.2pt;\">Res.</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.3.3.3.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.7.1.1\" style=\"width:19.9pt;\">PT</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_tt\" id=\"S4.T1.3.3.3.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.8.1.1\" style=\"width:25.6pt;\">IT</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.1.1.1.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.1.1.1.1.1\">\n<span class=\"ltx_p\" id=\"S4.T1.1.1.1.1.1.1\" style=\"width:22.8pt;\">VQA</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.3.3.3.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.9.1.1\" style=\"width:22.8pt;\">GQA</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.3.3.3.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.10.1.1\" style=\"width:22.8pt;\">VisWiz</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.2.2.2.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.2.2.2.2.1\">\n<span class=\"ltx_p\" id=\"S4.T1.2.2.2.2.1.1\" style=\"width:22.8pt;\">SQA</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_tt\" id=\"S4.T1.3.3.3.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.3.3.3.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.3.3.3.3.1.1\" style=\"width:22.8pt;\">VQA</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.8.1\">\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T1.7.7.8.1.1\">BLIP-2<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib19\" title=\"\">19</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T1.7.7.8.1.2\">Vicuna-13B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.4.1.1\" style=\"width:19.9pt;\">129M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t\" id=\"S4.T1.7.7.8.1.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.5.1.1\" style=\"width:25.6pt;\">-</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.6.1.1\" style=\"width:22.8pt;\">41.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.7.1.1\" style=\"width:22.8pt;\">41</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.8.1.1\" style=\"width:22.8pt;\">19.6</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.9.1.1\" style=\"width:22.8pt;\">61</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.8.1.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.8.1.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.8.1.10.1.1\" style=\"width:22.8pt;\">42.5</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.9.2\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.9.2.1\">InstructBLIP<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib9\" title=\"\">9</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.9.2.2\">Vicuna-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.4.1.1\" style=\"width:19.9pt;\">129M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.9.2.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.5.1.1\" style=\"width:25.6pt;\">1.2M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.6.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.7.1.1\" style=\"width:22.8pt;\">49.2</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.8.1.1\" style=\"width:22.8pt;\">34.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.9.1.1\" style=\"width:22.8pt;\">60.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.9.2.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.9.2.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.9.2.10.1.1\" style=\"width:22.8pt;\">50.1</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.10.3\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.10.3.1\">InstructBLIP<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib9\" title=\"\">9</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.10.3.2\">Vicuna-13B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.4.1.1\" style=\"width:19.9pt;\">129M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.10.3.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.5.1.1\" style=\"width:25.6pt;\">1.2M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.6.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.7.1.1\" style=\"width:22.8pt;\">49.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.8.1.1\" style=\"width:22.8pt;\">33.4</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.9.1.1\" style=\"width:22.8pt;\">63.1</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.10.3.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.10.3.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.10.3.10.1.1\" style=\"width:22.8pt;\">50.7</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.11.4\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.11.4.1\">Shikra<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib6\" title=\"\">6</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.11.4.2\">Vicuna-13B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.4.1.1\" style=\"width:19.9pt;\">600K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.11.4.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.5.1.1\" style=\"width:25.6pt;\">5.5M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.6.1.1\" style=\"width:22.8pt;\">77.4</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.7.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.8.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.9.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.11.4.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.11.4.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.11.4.10.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.12.5\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.12.5.1\">IDEFICS-9B <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib16\" title=\"\">16</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.12.5.2\">LLaMA-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.4.1.1\" style=\"width:19.9pt;\">353M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.12.5.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.5.1.1\" style=\"width:25.6pt;\">1M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.6.1.1\" style=\"width:22.8pt;\">50.9</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.7.1.1\" style=\"width:22.8pt;\">38.4</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.8.1.1\" style=\"width:22.8pt;\">35.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.9.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.12.5.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.12.5.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.12.5.10.1.1\" style=\"width:22.8pt;\">25.9</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.13.6\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.13.6.1\">IDEFICS-80B<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib16\" title=\"\">16</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.13.6.2\">LLaMA-65B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.3.1.1\" style=\"width:14.2pt;\">224</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.4.1.1\" style=\"width:19.9pt;\">353M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.13.6.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.5.1.1\" style=\"width:25.6pt;\">1M</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.6.1.1\" style=\"width:22.8pt;\">60.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.7.1.1\" style=\"width:22.8pt;\">45.2</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.8.1.1\" style=\"width:22.8pt;\">36.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.9.1.1\" style=\"width:22.8pt;\">–</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.13.6.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.13.6.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.13.6.10.1.1\" style=\"width:22.8pt;\">30.9</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.5.5.5\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.5.5.5.3\">Qwen-VL<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib1\" title=\"\">1</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.5.5.5.4\">Qwen-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.5.1.1\" style=\"width:14.2pt;\">448</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.4.4.4.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.4.4.4.1.1\">\n<span class=\"ltx_p\" id=\"S4.T1.4.4.4.1.1.1\" style=\"width:19.9pt;\">1.4B<sup class=\"ltx_sup\" id=\"S4.T1.4.4.4.1.1.1.1\">†</sup></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.5.5.5.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.2.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.2.1.1\" style=\"width:25.6pt;\">50M<sup class=\"ltx_sup\" id=\"S4.T1.5.5.5.2.1.1.1\">†</sup></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.6.1.1\" style=\"width:22.8pt;\">78.8</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.7.1.1\" style=\"width:22.8pt;\">59.3</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.8.1.1\" style=\"width:22.8pt;\">35.2</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.9.1.1\" style=\"width:22.8pt;\">67.1</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.5.5.5.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.5.5.5.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.5.5.5.10.1.1\" style=\"width:22.8pt;\">63.8</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.7\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.7.3\">Qwen-VL-Chat<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib1\" title=\"\">1</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.7.4\">Qwen-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.5.1.1\" style=\"width:14.2pt;\">448</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.6.6.6.1\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.6.6.6.1.1\">\n<span class=\"ltx_p\" id=\"S4.T1.6.6.6.1.1.1\" style=\"width:19.9pt;\">1.4B<sup class=\"ltx_sup\" id=\"S4.T1.6.6.6.1.1.1.1\">†</sup></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.7.2\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.2.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.2.1.1\" style=\"width:25.6pt;\">50M<sup class=\"ltx_sup\" id=\"S4.T1.7.7.7.2.1.1.1\">†</sup></span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.6.1.1\" style=\"width:22.8pt;\">78.2</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.7.1.1\" style=\"width:22.8pt;\">57.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.8.1.1\" style=\"width:22.8pt;\">38.9</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.9.1.1\" style=\"width:22.8pt;\">68.2</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.7.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.7.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.7.10.1.1\" style=\"width:22.8pt;\">61.5</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.14.7\">\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.14.7.1\">LLAVA-V1.5<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib25\" title=\"\">25</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left\" id=\"S4.T1.7.7.14.7.2\">Vicuna-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.3.1.1\" style=\"width:14.2pt;\">336</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.4.1.1\" style=\"width:19.9pt;\">558K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r\" id=\"S4.T1.7.7.14.7.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.5.1.1\" style=\"width:25.6pt;\">665K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.6.1.1\" style=\"width:22.8pt;\">78.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.7.1.1\" style=\"width:22.8pt;\">62.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.8.1.1\" style=\"width:22.8pt;\">50.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.9.1.1\" style=\"width:22.8pt;\">66.8</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top\" id=\"S4.T1.7.7.14.7.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.14.7.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.14.7.10.1.1\" style=\"width:22.8pt;\">58.2</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.15.8\">\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T1.7.7.15.8.1\">+ SVIT-Core-157K<cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib39\" title=\"\">39</a>]</cite>\n</td>\n<td class=\"ltx_td ltx_align_left ltx_border_t\" id=\"S4.T1.7.7.15.8.2\">Vicuna-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.3.1.1\" style=\"width:14.2pt;\">336</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.4.1.1\" style=\"width:19.9pt;\">558K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_r ltx_border_t\" id=\"S4.T1.7.7.15.8.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.5.1.1\" style=\"width:25.6pt;\">+157K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.6.1.1\" style=\"width:22.8pt;\">75.9</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.7.1.1\" style=\"width:22.8pt;\">57.1</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.8.1.1\" style=\"width:22.8pt;\">49.1</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.9.1.1\" style=\"width:22.8pt;\">69.0</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_t\" id=\"S4.T1.7.7.15.8.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.15.8.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.15.8.10.1.1\" style=\"width:22.8pt;\">56.3</span>\n</span>\n</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T1.7.7.16.9\">\n<td class=\"ltx_td ltx_align_left ltx_border_bb\" id=\"S4.T1.7.7.16.9.1\">+ Ours</td>\n<td class=\"ltx_td ltx_align_left ltx_border_bb\" id=\"S4.T1.7.7.16.9.2\">Vicuna-7B</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.3\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.3.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.3.1.1\" style=\"width:14.2pt;\">336</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.4\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.4.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.4.1.1\" style=\"width:19.9pt;\">558K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb ltx_border_r\" id=\"S4.T1.7.7.16.9.5\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.5.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.5.1.1\" style=\"width:25.6pt;\">+7K</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.6\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.6.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.6.1.1\" style=\"width:22.8pt;\">77.9</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.7\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.7.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.7.1.1\" style=\"width:22.8pt;\">61.8</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.8\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.8.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.8.1.1\" style=\"width:22.8pt;\">51.1</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.9\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.9.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.9.1.1\" style=\"width:22.8pt;\">69.5</span>\n</span>\n</td>\n<td class=\"ltx_td ltx_align_justify ltx_align_top ltx_border_bb\" id=\"S4.T1.7.7.16.9.10\">\n<span class=\"ltx_inline-block ltx_align_top\" id=\"S4.T1.7.7.16.9.10.1\">\n<span class=\"ltx_p\" id=\"S4.T1.7.7.16.9.10.1.1\" style=\"width:22.8pt;\">57.3</span>\n</span>\n</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span><span class=\"ltx_text ltx_font_bold\" id=\"S4.T1.9.1\">Comparison with SoTA methods on 5 benchmarks.</span> We achieves better performance on all benchmarks than SVIT-Core-157K. Res, PT, and IT indicate input image resolution, and the number of samples in the pretraining and instruction tuning stage, respectively.\nBenchmark names are abbreviated due to space limits. VQA-v2 <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib12\" title=\"\">12</a>]</cite>, GQA <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib15\" title=\"\">15</a>]</cite>, VisWiz <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib13\" title=\"\">13</a>]</cite>, ScienceQA-IMG <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib27\" title=\"\">27</a>]</cite>, TextVQA <cite class=\"ltx_cite ltx_citemacro_cite\">[<a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00102v1#bib.bib33\" title=\"\">33</a>]</cite>. More details can be found in the Evaluation Metrics section of the Appendix.</figcaption>\n</figure>",
"perturb_sentence_id": [
10,
11
],
"output": {
"perturbed_statement": "[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 3.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].The improvements verify the better training effects of our data since less data amount and same model are used.",
"perturbed_explanation": "The statement should accurately reflect the information presented in the research results. \n\n1. The statement claimed an improvement of 3.0 points in VisWiz [13 ], which is incorrect. The correct improvement noted in the results was 2.0 points. Therefore, stating 3.0 points is a factual inaccuracy.\n\n2. The alteration in point 1 changes the reported improvement incorrectly, making the statement not consistent with the numerical results achieved and described in the previous paragraphs."
}
},
{
"path": "table_paper/2407.00102v1.json",
"table_id": "2",
"section": "4.2",
"all_context": [
"We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.",
"Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.",
"In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.",
"We keep the rest of the training protocol the same to allow for a fair comparison.",
"Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.",
"We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.",
"And the same method is applied to fine-tune our data.",
"We report our main results in Table 1 .",
"Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.",
"Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.",
"In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].",
"The improvements verify the better training effects of our data since less data amount and same model are used.",
"In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.",
"We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.",
"Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.",
"This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.",
"By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.",
"To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.",
"This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.",
"As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.",
"By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.",
""
],
"target_context_ids": [
12,
14,
15,
16
],
"selected_paragraphs": [
"[paragraph id = 12] In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.",
"[paragraph id = 14] Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.",
"[paragraph id = 15] This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.",
"[paragraph id = 16] By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance."
],
"table_html": "<figure class=\"ltx_table ltx_align_floatright\" id=\"S4.T2\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S4.T2.1\" style=\"width:166.8pt;height:126pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(0.0pt,0.0pt) scale(1.0,1.0) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.1.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_tt\" id=\"S4.T2.1.1.1.1.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.1.1.1.1.1.1\" style=\"font-size:90%;\">Strategy</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"3\" id=\"S4.T2.1.1.1.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.1.1.1.1.2.1\" style=\"font-size:90%;\">Scenario 1</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.2.2\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.1.1.2.2.1\"></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.2.2.2\"><span class=\"ltx_text\" id=\"S4.T2.1.1.2.2.2.1\" style=\"font-size:90%;\">SQA</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.2.2.3\"><span class=\"ltx_text\" id=\"S4.T2.1.1.2.2.3.1\" style=\"font-size:90%;\">TextVQA</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.2.2.4\"><span class=\"ltx_text\" id=\"S4.T2.1.1.2.2.4.1\" style=\"font-size:90%;\">GQA</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.3.3\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t\" id=\"S4.T2.1.1.3.3.1\"><span class=\"ltx_text\" id=\"S4.T2.1.1.3.3.1.1\" style=\"font-size:90%;\">DIS</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.3.3.2\"><span class=\"ltx_text\" id=\"S4.T2.1.1.3.3.2.1\" style=\"font-size:90%;\">57.06</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.3.3.3\"><span class=\"ltx_text\" id=\"S4.T2.1.1.3.3.3.1\" style=\"font-size:90%;\">56.13</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.1.1.3.3.4\"><span class=\"ltx_text\" id=\"S4.T2.1.1.3.3.4.1\" style=\"font-size:90%;\">61.06</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.4.4\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.1.1.4.4.1\"><span class=\"ltx_text\" id=\"S4.T2.1.1.4.4.1.1\" style=\"font-size:90%;\">DIL</span></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.4.4.2\"><span class=\"ltx_text\" id=\"S4.T2.1.1.4.4.2.1\" style=\"font-size:90%;\">68.82</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.4.4.3\"><span class=\"ltx_text\" id=\"S4.T2.1.1.4.4.3.1\" style=\"font-size:90%;\">56.30</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.4.4.4\"><span class=\"ltx_text\" id=\"S4.T2.1.1.4.4.4.1\" style=\"font-size:90%;\">60.87</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.5.5\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r\" id=\"S4.T2.1.1.5.5.1\"><span class=\"ltx_text\" id=\"S4.T2.1.1.5.5.1.1\" style=\"font-size:90%;\">DIQ</span></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.5.5.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.1.1.5.5.2.1\" style=\"font-size:90%;\">69.56</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.5.5.3\"><span class=\"ltx_text\" id=\"S4.T2.1.1.5.5.3.1\" style=\"font-size:90%;\">56.84</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.1.1.5.5.4\"><span class=\"ltx_text\" id=\"S4.T2.1.1.5.5.4.1\" style=\"font-size:90%;\">61.16</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.6.6\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" colspan=\"4\" id=\"S4.T2.1.1.6.6.1\"><span class=\"ltx_text ltx_font_italic\" id=\"S4.T2.1.1.6.6.1.1\" style=\"font-size:90%;\">Result with Data Curriculum</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1.7.7\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r\" id=\"S4.T2.1.1.7.7.1\"><span class=\"ltx_text\" id=\"S4.T2.1.1.7.7.1.1\" style=\"font-size:90%;\">Ours</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.1.1.7.7.2\"><span class=\"ltx_text\" id=\"S4.T2.1.1.7.7.2.1\" style=\"font-size:90%;\">69.51</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.1.1.7.7.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.1.1.7.7.3.1\" style=\"font-size:90%;\">57.25</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.1.1.7.7.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.1.1.7.7.4.1\" style=\"font-size:90%;\">61.80</span></td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\" style=\"font-size:90%;\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>Results across different methods.</figcaption>\n</figure>",
"perturb_sentence_id": [
12,
13
],
"output": {
"perturbed_statement": "[paragraph id = 12] In Table 2, we use the top-right corner in the left panel of Figure 8 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.",
"perturbed_explanation": "1. Original Explanation: The statement describes how the use of the top 5% selected by DIQ led to better performance because DIQ selects data evenly from the entire region, unlike DIS and DIL, which focus on areas with high clip scores or loss. 2. The statement incorrectly mentions Figure 8, whereas there is no mention of Figure 8; the correct reference should be Figure 7. This alters the factual accuracy of the reference provided within the context."
}
}
]