[ { "path": "table_paper/2407.00102v1.json", "table_id": "1", "section": "4.2", "all_context": [ "We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.", "Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.", "In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.", "We keep the rest of the training protocol the same to allow for a fair comparison.", "Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.", "We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.", "And the same method is applied to fine-tune our data.", "We report our main results in Table 1 .", "Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.", "Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "The improvements verify the better training effects of our data since less data amount and same model are used.", "In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.", "To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.", "This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.", "As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.", "By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.", "" ], "target_context_ids": [ 7, 9, 10, 11 ], "selected_paragraphs": [ "[paragraph id = 7] We report our main results in Table 1 .", "[paragraph id = 9] Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "[paragraph id = 11] The improvements verify the better training effects of our data since less data amount and same model are used." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
MethodLLM\n\nRes.\n\n\n\nPT\n\n\n\nIT\n\n\n\nVQA\n\n\n\nGQA\n\n\n\nVisWiz\n\n\n\nSQA\n\n\n\nVQA\n\n
BLIP-2[19]\nVicuna-13B\n\n224\n\n\n\n129M\n\n\n\n-\n\n\n\n41.0\n\n\n\n41\n\n\n\n19.6\n\n\n\n61\n\n\n\n42.5\n\n
InstructBLIP[9]\nVicuna-7B\n\n224\n\n\n\n129M\n\n\n\n1.2M\n\n\n\n\n\n\n\n49.2\n\n\n\n34.5\n\n\n\n60.5\n\n\n\n50.1\n\n
InstructBLIP[9]\nVicuna-13B\n\n224\n\n\n\n129M\n\n\n\n1.2M\n\n\n\n\n\n\n\n49.5\n\n\n\n33.4\n\n\n\n63.1\n\n\n\n50.7\n\n
Shikra[6]\nVicuna-13B\n\n224\n\n\n\n600K\n\n\n\n5.5M\n\n\n\n77.4\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
IDEFICS-9B [16]\nLLaMA-7B\n\n224\n\n\n\n353M\n\n\n\n1M\n\n\n\n50.9\n\n\n\n38.4\n\n\n\n35.5\n\n\n\n\n\n\n\n25.9\n\n
IDEFICS-80B[16]\nLLaMA-65B\n\n224\n\n\n\n353M\n\n\n\n1M\n\n\n\n60.0\n\n\n\n45.2\n\n\n\n36.0\n\n\n\n\n\n\n\n30.9\n\n
Qwen-VL[1]\nQwen-7B\n\n448\n\n\n\n1.4B\n\n\n\n50M\n\n\n\n78.8\n\n\n\n59.3\n\n\n\n35.2\n\n\n\n67.1\n\n\n\n63.8\n\n
Qwen-VL-Chat[1]\nQwen-7B\n\n448\n\n\n\n1.4B\n\n\n\n50M\n\n\n\n78.2\n\n\n\n57.5\n\n\n\n38.9\n\n\n\n68.2\n\n\n\n61.5\n\n
LLAVA-V1.5[25]\nVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n665K\n\n\n\n78.5\n\n\n\n62.0\n\n\n\n50.0\n\n\n\n66.8\n\n\n\n58.2\n\n
+ SVIT-Core-157K[39]\nVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n+157K\n\n\n\n75.9\n\n\n\n57.1\n\n\n\n49.1\n\n\n\n69.0\n\n\n\n56.3\n\n
+ OursVicuna-7B\n\n336\n\n\n\n558K\n\n\n\n+7K\n\n\n\n77.9\n\n\n\n61.8\n\n\n\n51.1\n\n\n\n69.5\n\n\n\n57.3\n\n
\n
\n
Table 1: Comparison with SoTA methods on 5 benchmarks. We achieves better performance on all benchmarks than SVIT-Core-157K. Res, PT, and IT indicate input image resolution, and the number of samples in the pretraining and instruction tuning stage, respectively.\nBenchmark names are abbreviated due to space limits. VQA-v2 [12], GQA [15], VisWiz [13], ScienceQA-IMG [27], TextVQA [33]. More details can be found in the Evaluation Metrics section of the Appendix.
\n
", "perturb_sentence_id": 10, "output": { "perturbed_statement": "[paragraph id = 10] In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 3.0 points in GQA [15 ], 1.5 points in VQAV2 [12 ], 1.2 points in TextVQA [33 ], 1.8 points in VisWiz [13 ], and 0.3 points in SQA [27 ]. The improvements verify the better training effects of our data since less data amount and same model are used.", "perturbed_explanation": "1. The original explanation states that the data achieved performance increases by specific margins across various datasets, as given in the results above. 2. The statement now includes invalid performance margins for some datasets, such as '3.0 points in GQA' instead of '4.7 points' and '1.5 points in VQAV2' instead of '2.0 points,' which do not match the results provided. This error alters the reported evidence of the model's effectiveness." } }, { "path": "table_paper/2407.00102v1.json", "table_id": "2", "section": "4.2", "all_context": [ "We use the LLaVA-v1.5-7B [25 ] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data.", "Subsequently, we fine-tune this model with LoRA [14 ] during the follow-up experiments.", "In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only.", "We keep the rest of the training protocol the same to allow for a fair comparison.", "Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3.", "We use the SVIT-core-157K [39 ] dataset for continuous fine-tuning to establish a baseline.", "And the same method is applied to fine-tune our data.", "We report our main results in Table 1 .", "Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup.", "Furthermore, it surpassed the base model on SQA [27 ] and VisWiz [13 ], reaching state-of-the-art (SOTA) performance.", "In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39 ] by 4.7 points in GQA [15 ], 2.0 points in VQAV2 [12 ], 1.0 point in TextVQA [33 ], 2.0 points in VisWiz [13 ], and 0.5 points in SQA [27 ].", "The improvements verify the better training effects of our data since less data amount and same model are used.", "In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.", "To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss.", "This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process.", "As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization.", "By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.", "" ], "target_context_ids": [ 12, 14, 15, 16 ], "selected_paragraphs": [ "[paragraph id = 12] In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.", "[paragraph id = 14] Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data.", "[paragraph id = 15] This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance.", "[paragraph id = 16] By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
StrategyScenario 1
SQATextVQAGQA
DIS57.0656.1361.06
DIL68.8256.3060.87
DIQ69.5656.8461.16
Result with Data Curriculum
Ours69.5157.2561.80
\n
\n
Table 2: Results across different methods.
\n
", "perturb_sentence_id": 12, "output": { "perturbed_statement": "[paragraph id = 12] In Table 2, we use the top-left corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment. We found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately. We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss.", "perturbed_explanation": "1. The original explanation clarified that the selection criterion was based on the top-right corner.\n2. The statement specifies the top-left corner, which inconsistency diminishes the accuracy of the description regarding the location derived from Figure 7." } } ]