[ { "path": "table_paper/2407.00102v1.json", "table_id": "1", "section": "4.2", "all_context": [ "We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.", "Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.", "In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.", "We keep the rest of the training protocol the same to allow for a fair comparison.", "Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.", "We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.", "And the same method is applied to fine-tune our data.", "We report our main results in Table 1 .", "Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.", "Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "The improvements verify the better training effects of our data since less data amount and same model are used.", "In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.", "To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.", "This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.", "As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.", "By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.", "" ], "target_context_ids": [ 7, 9, 10, 11 ], "selected_paragraphs": [ "[paragraph id = 7] We report our main results in Table 1 .", "[paragraph id = 9] Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "[paragraph id = 11] The improvements verify the better training effects of our data since less data amount and same model are used." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
MethodLLM\n\nRes.\n\n\n\nPT\n\n\n\nIT\n\n\n\nVQA\n\n\n\nGQA\n\n\n\nVisWiz\n\n\n\nSQA\n\n\n\nVQA\n\n
BLIP-2[19]\nVicuna-13B\n\n224\n\n\n\n129M\n\n\n\n-\n\n\n\n41.0\n\n\n\n41\n\n\n\n19.6\n\n\n\n61\n\n\n\n42.5\n\n
InstructBLIP[9]\nVicuna-7B\n\n224\n\n\n\n129M\n\n\n\n1.2M\n\n\n\n\n\n\n\n49.2\n\n\n\n34.5\n\n\n\n60.5\n\n\n\n50.1\n\n
InstructBLIP[9]\nVicuna-13B\n\n224\n\n\n\n129M\n\n\n\n1.2M\n\n\n\n\n\n\n\n49.5\n\n\n\n33.4\n\n\n\n63.1\n\n\n\n50.7\n\n
Shikra[6]\nVicuna-13B\n\n224\n\n\n\n600K\n\n\n\n5.5M\n\n\n\n77.4\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
IDEFICS-9B [16]\nLLaMA-7B\n\n224\n\n\n\n353M\n\n\n\n1M\n\n\n\n50.9\n\n\n\n38.4\n\n\n\n35.5\n\n\n\n\n\n\n\n25.9\n\n
IDEFICS-80B[16]\nLLaMA-65B\n\n224\n\n\n\n353M\n\n\n\n1M\n\n\n\n60.0\n\n\n\n45.2\n\n\n\n36.0\n\n\n\n\n\n\n\n30.9\n\n
Qwen-VL[1]\nQwen-7B\n\n448\n\n\n\n1.4B\n\n\n\n50M\n\n\n\n78.8\n\n\n\n59.3\n\n\n\n35.2\n\n\n\n67.1\n\n\n\n63.8\n\n
Qwen-VL-Chat[1]\nQwen-7B\n\n448\n\n\n\n1.4B\n\n\n\n50M\n\n\n\n78.2\n\n\n\n57.5\n\n\n\n38.9\n\n\n\n68.2\n\n\n\n61.5\n\n
LLAVA-V1.5[25]\nVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n665K\n\n\n\n78.5\n\n\n\n62.0\n\n\n\n50.0\n\n\n\n66.8\n\n\n\n58.2\n\n
+ SVIT-Core-157K[39]\nVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n+157K\n\n\n\n75.9\n\n\n\n57.1\n\n\n\n49.1\n\n\n\n69.0\n\n\n\n56.3\n\n
+ OursVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n+7K\n\n\n\n77.9\n\n\n\n61.8\n\n\n\n51.1\n\n\n\n69.5\n\n\n\n57.3\n\n
\n
\n
Table 1: Comparison with SoTA methods on 5 benchmarks. We achieves better performance on all benchmarks than SVIT-Core-157K. Res, PT, and IT indicate input image resolution, and the number of samples in the pretraining and instruction tuning stage, respectively.\nBenchmark names are abbreviated due to space limits. VQA-v2 [12], GQA [15], VisWiz [13], ScienceQA-IMG [27], TextVQA [33]. More details can be found in the Evaluation Metrics section of the Appendix.
\n
", "perturb_sentence_id": [ 10, 11 ], "output": { "perturbed_statement": "[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 3.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].The improvements verify the better training effects of our data since less data amount and same model are used.", "perturbed_explanation": "The statement should accurately reflect the information presented in the research results. \n\n1. The statement claimed an improvement of 3.0 points in VisWiz [13 ], which is incorrect. The correct improvement noted in the results was 2.0 points. Therefore, stating 3.0 points is a factual inaccuracy.\n\n2. The alteration in point 1 changes the reported improvement incorrectly, making the statement not consistent with the numerical results achieved and described in the previous paragraphs." } }, { "path": "table_paper/2407.00102v1.json", "table_id": "2", "section": "4.2", "all_context": [ "We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.", "Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.", "In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.", "We keep the rest of the training protocol the same to allow for a fair comparison.", "Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.", "We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.", "And the same method is applied to fine-tune our data.", "We report our main results in Table 1 .", "Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.", "Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "The improvements verify the better training effects of our data since less data amount and same model are used.", "In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.", "To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.", "This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.", "As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.", "By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.", "" ], "target_context_ids": [ 12, 14, 15, 16 ], "selected_paragraphs": [ "[paragraph id = 12] In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "[paragraph id = 14] Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "[paragraph id = 15] This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "[paragraph id = 16] By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
StrategyScenario 1
SQATextVQAGQA
DIS57.0656.1361.06
DIL68.8256.3060.87
DIQ69.5656.8461.16
Result with Data Curriculum
Ours69.5157.2561.80
\n
\n
Table 2: Results across different methods.
\n
", "perturb_sentence_id": [ 12, 13 ], "output": { "perturbed_statement": "[paragraph id = 12] In Table 2, we use the top-right corner in the left panel of Figure 8 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "perturbed_explanation": "1. Original Explanation: The statement describes how the use of the top 5% selected by DIQ led to better performance because DIQ selects data evenly from the entire region, unlike DIS and DIL, which focus on areas with high clip scores or loss. 2. The statement incorrectly mentions Figure 8, whereas there is no mention of Figure 8; the correct reference should be Figure 7. This alters the factual accuracy of the reference provided within the context." } } ]