[ { "path": "table_paper/2407.00115v3.json", "table_id": "2", "section": "5.1", "all_context": [ "CIFAR-100: image classification.", "As shown in Table 1 , we conduct image classification on the CIFAR-100 dataset to demonstrate the generalization performance of our RLKD method across 11 teacher-student pairs, including RN-56 & RN-20, etc.", "Among them, 5 pairs of teacher and student models (VGG-13 & MN-V2, etc.)", "are characterized by distinguishing architectural frameworks.", "These experimental designs we employed provide a diverse and comprehensive assessment environment.", "When the teacher and student networks share the same architecture, the experimental results show that our RLKD method has a strong generalization capacity, also exhibits a superior performance compared to CTKD.", "Specifically, in the case of RN-110 & RN-20, our method outperforms Vanilla KD by 0.78% (71.44% vs 70.66%) and CTKD by 0.36% (71.44% vs 71.08%).", "Moreover, in the case where the teacher and student networks have different architectures, the powerful generalization capacity of our RLKD is also validated.", "To validate the generalization of our RLKD method across different KD frameworks, we conduct experiments on 6 currently leading KD frameworks (see Table 3 ), including DKD, PKT, etc.", "When applied to the teacher-student pair RN110 & RN32, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%).", "Experiments conducted on other 5 KD frameworks (e.g.", "PKT, etc.)", "further confirm the strong generalization of our RLKD.", "Both the accuracy and stability of the proposed RLKD are significantly superior to CTKD, this can be attributed to our RLKD method considers the future rewards of the instance temperature adjustment operations.", "ImageNet: image classification.", "To validate the scalability of our method and its applicability in complex scenarios involving large datasets, we further conduct image classification on ImageNet.", "Table 2 details the top-1 and top-5 accuracy.", "Using CTKD and our RLKD as the adaptable plug-in approach, we incorporate them into 5 current leading distillation frameworks (i.e.", "KD, PKT, RKD, SRRL, and DKD).", "The experimental results obtained from these 5 KD frameworks unequivocally demonstrate the excellent scalability of our method.", "Remarkably, our RLKD exhibits robust performance on large dataset like ImageNet.", "For instance, in the Vanilla KD and SRRL frameworks, our method achieves improvement of 0.2% (90.51% vs 90.31%) and 0.11% (90.52% vs 90.41%) respectively.", "In contrast, CTKD obtains much fewer improvement on these KD frameworks, with gains of just 0.02% (90.33% vs 90.31%) and 0.01% (90.42% vs 90.41%) respectively, about 10 times lower.", "We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments.", "Additionally, unlike CTKD, our RLKD also takes into account the student model s grasp of individual instances during instance temperature adjustment.", "MS-COCO: object detection.", "To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset.", "As shown in Table 4 , in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD outperforms Vanilla KD by 1.36% (31.49% vs 30.13%) and CTKD by 0.28% (31.49% vs 31.21%), respectively.", "Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD also shows a significant enhancement, consistently surpasses CTKD across all size categories.", "Results demonstrate the robustness of our approach, where instance temperature adjustment is treated as a sequential decision-making task, enabling consideration of future benefits.", "" ], "target_context_ids": [ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 ], "selected_paragraphs": [ "[paragraph id = 16] Table 2 details the top-1 and top-5 accuracy.", "[paragraph id = 17] Using CTKD and our RLKD as the adaptable plug-in approach, we incorporate them into 5 current leading distillation frameworks (i.e.", "[paragraph id = 18] KD, PKT, RKD, SRRL, and DKD).", "[paragraph id = 19] The experimental results obtained from these 5 KD frameworks unequivocally demonstrate the excellent scalability of our method.", "[paragraph id = 20] Remarkably, our RLKD exhibits robust performance on large dataset like ImageNet.", "[paragraph id = 21] For instance, in the Vanilla KD and SRRL frameworks, our method achieves improvement of 0.2% (90.51% vs 90.31%) and 0.11% (90.52% vs 90.41%) respectively.", "[paragraph id = 22] In contrast, CTKD obtains much fewer improvement on these KD frameworks, with gains of just 0.02% (90.33% vs 90.31%) and 0.01% (90.42% vs 90.41%) respectively, about 10 times lower.", "[paragraph id = 23] We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments.", "[paragraph id = 24] Additionally, unlike CTKD, our RLKD also takes into account the student model s grasp of individual instances during instance temperature adjustment.", "[paragraph id = 25] MS-COCO: object detection.", "[paragraph id = 26] To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherStudentVanilla KD+CTKD+OursPKT+CTKD+OursRKD+CTKD+OursSRRL+CTKD+OursDKD+CTKD+Ours
Top-173.9670.2670.8371.2871.3970.9271.3171.5370.9471.1371.3771.0171.2571.3871.1371.4771.62
Top-591.5889.5090.3190.3390.5190.2590.3090.4290.3390.3490.4590.4190.4290.5290.3190.4490.56
\n
\n
Table 2: Top-1 and Top-5 accuracy on ImageNet with ResNet-34 as teacher and ResNet-18 as student.
\n
", "perturb_sentence_id": [ 22, 23 ], "output": { "perturbed_statement": "[paragraph id = 22] In contrast, CTKD obtains much greater improvement on these KD frameworks, with gains of 2% (90.31% vs 88.31%) and 1% (90.41% vs 89.41%) respectively, about 10 times higher.We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments.", "perturbed_explanation": "The original explanation states that CTKD obtains much fewer improvements on the KD frameworks mentioned, with specific gains of just 0.02% and 0.01%, which are about 10 times lower than those of RLKD. 1. CTKD obtains a marginal improvement on each KD framework. 2. These improvements are significantly smaller compared to the improvements achieved by RLKD. In the statement, CTKD's gains are misrepresented as much greater than they are, with exaggerated gains of 2% and 1%, respectively, compared to RLKD. This exaggeration contradicts the experimental results discussed, which highlight RLKD's superior performance and scalability." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "3", "section": "5.1", "all_context": [ "CIFAR-100: image classification.", "As shown in Table 1 , we conduct image classification on the CIFAR-100 dataset to demonstrate the generalization performance of our RLKD method across 11 teacher-student pairs, including RN-56 & RN-20, etc.", "Among them, 5 pairs of teacher and student models (VGG-13 & MN-V2, etc.)", "are characterized by distinguishing architectural frameworks.", "These experimental designs we employed provide a diverse and comprehensive assessment environment.", "When the teacher and student networks share the same architecture, the experimental results show that our RLKD method has a strong generalization capacity, also exhibits a superior performance compared to CTKD.", "Specifically, in the case of RN-110 & RN-20, our method outperforms Vanilla KD by 0.78% (71.44% vs 70.66%) and CTKD by 0.36% (71.44% vs 71.08%).", "Moreover, in the case where the teacher and student networks have different architectures, the powerful generalization capacity of our RLKD is also validated.", "To validate the generalization of our RLKD method across different KD frameworks, we conduct experiments on 6 currently leading KD frameworks (see Table 3 ), including DKD, PKT, etc.", "When applied to the teacher-student pair RN110 & RN32, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%).", "Experiments conducted on other 5 KD frameworks (e.g.", "PKT, etc.)", "further confirm the strong generalization of our RLKD.", "Both the accuracy and stability of the proposed RLKD are significantly superior to CTKD, this can be attributed to our RLKD method considers the future rewards of the instance temperature adjustment operations.", "ImageNet: image classification.", "To validate the scalability of our method and its applicability in complex scenarios involving large datasets, we further conduct image classification on ImageNet.", "Table 2 details the top-1 and top-5 accuracy.", "Using CTKD and our RLKD as the adaptable plug-in approach, we incorporate them into 5 current leading distillation frameworks (i.e.", "KD, PKT, RKD, SRRL, and DKD).", "The experimental results obtained from these 5 KD frameworks unequivocally demonstrate the excellent scalability of our method.", "Remarkably, our RLKD exhibits robust performance on large dataset like ImageNet.", "For instance, in the Vanilla KD and SRRL frameworks, our method achieves improvement of 0.2% (90.51% vs 90.31%) and 0.11% (90.52% vs 90.41%) respectively.", "In contrast, CTKD obtains much fewer improvement on these KD frameworks, with gains of just 0.02% (90.33% vs 90.31%) and 0.01% (90.42% vs 90.41%) respectively, about 10 times lower.", "We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments.", "Additionally, unlike CTKD, our RLKD also takes into account the student model s grasp of individual instances during instance temperature adjustment.", "MS-COCO: object detection.", "To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset.", "As shown in Table 4 , in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD outperforms Vanilla KD by 1.36% (31.49% vs 30.13%) and CTKD by 0.28% (31.49% vs 31.21%), respectively.", "Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD also shows a significant enhancement, consistently surpasses CTKD across all size categories.", "Results demonstrate the robustness of our approach, where instance temperature adjustment is treated as a sequential decision-making task, enabling consideration of future benefits.", "" ], "target_context_ids": [ 8, 9, 10, 11, 12 ], "selected_paragraphs": [ "[paragraph id = 8] To validate the generalization of our RLKD method across different KD frameworks, we conduct experiments on 6 currently leading KD frameworks (see Table 3 ), including DKD, PKT, etc.", "[paragraph id = 9] When applied to the teacher-student pair RN110 & RN32, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%).", "[paragraph id = 10] Experiments conducted on other 5 KD frameworks (e.g.", "[paragraph id = 11] PKT, etc.)", "[paragraph id = 12] further confirm the strong generalization of our RLKD." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherRN-56RN-110RN-110WRN-40-2WRN-40-2RN-324RN-324
Acc72.3474.3174.3175.6175.6179.4279.42
StudentRN-20RN-32RN-20WRN-16-2WRN-40-1SN-V1SN-V2
Acc69.0671.1469.0673.2671.9870.7071.82
PKT70.8573.3670.8874.8274.0174.3975.10
+CTKD71.1373.4971.0775.3474.1174.6375.52
+Ours71.4173.6871.3475.6274.2374.8975.78
SP70.8473.0970.7474.8873.7774.9775.59
+CTKD71.2973.4271.1775.3073.9775.2875.79
+Ours71.6573.7071.5175.6174.2275.3176.04
VID70.6273.0270.5974.8973.6074.8175.24
+CTKD70.8173.3871.1175.2073.7575.2375.48
+Ours71.0973.7071.3975.4874.0275.5875.81
CRD71.6973.6371.3875.5374.3675.1375.90
+CTKD72.1374.0872.0275.7174.7275.4176.20
+Ours72.2974.4172.2876.0374.9875.6876.55
SRRL71.1373.4871.0975.6974.1875.3675.90
+CTKD71.4173.8171.5275.9074.3875.6275.97
+Ours71.6174.0271.8176.2374.6475.9076.06
DKD71.4373.6671.2875.7074.5475.4476.48
+CTKD71.6273.9171.6575.8574.5775.8876.91
+Ours71.8974.2771.9176.0274.9076.0277.21
\n
\n
Table 3: Student network Top-1 accuracy on CIFAR-100 dataset.
\n
", "perturb_sentence_id": [ 9, 10 ], "output": { "perturbed_statement": "[paragraph id = 9] When applied to the teacher-student pair RN110 & RN56, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%).Experiments conducted on other 5 KD frameworks (e.g. ", "perturbed_explanation": "The original explanation is: 1. The statement describes an experiment involving the teacher-student pair RN110 & RN32 within the DKD framework, claiming an improvement and comparison to the CTKD accuracy. 2. The statement's comparison is explicit in mentioning two specific accuracy percentages (74.27% and 73.66%). 3. The improvement over another framework (CTKD) is indicated by percentages (74.27% vs 73.91%). The altered statement is incorrect because: 4. It changes the teacher-student pair from RN110 & RN32 to RN110 & RN56, which is not mentioned in the context provided, thus altering a key detail of the experiment description." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "4", "section": "5.1", "all_context": [ "CIFAR-100: image classification.", "As shown in Table 1 , we conduct image classification on the CIFAR-100 dataset to demonstrate the generalization performance of our RLKD method across 11 teacher-student pairs, including RN-56 & RN-20, etc.", "Among them, 5 pairs of teacher and student models (VGG-13 & MN-V2, etc.)", "are characterized by distinguishing architectural frameworks.", "These experimental designs we employed provide a diverse and comprehensive assessment environment.", "When the teacher and student networks share the same architecture, the experimental results show that our RLKD method has a strong generalization capacity, also exhibits a superior performance compared to CTKD.", "Specifically, in the case of RN-110 & RN-20, our method outperforms Vanilla KD by 0.78% (71.44% vs 70.66%) and CTKD by 0.36% (71.44% vs 71.08%).", "Moreover, in the case where the teacher and student networks have different architectures, the powerful generalization capacity of our RLKD is also validated.", "To validate the generalization of our RLKD method across different KD frameworks, we conduct experiments on 6 currently leading KD frameworks (see Table 3 ), including DKD, PKT, etc.", "When applied to the teacher-student pair RN110 & RN32, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%).", "Experiments conducted on other 5 KD frameworks (e.g.", "PKT, etc.)", "further confirm the strong generalization of our RLKD.", "Both the accuracy and stability of the proposed RLKD are significantly superior to CTKD, this can be attributed to our RLKD method considers the future rewards of the instance temperature adjustment operations.", "ImageNet: image classification.", "To validate the scalability of our method and its applicability in complex scenarios involving large datasets, we further conduct image classification on ImageNet.", "Table 2 details the top-1 and top-5 accuracy.", "Using CTKD and our RLKD as the adaptable plug-in approach, we incorporate them into 5 current leading distillation frameworks (i.e.", "KD, PKT, RKD, SRRL, and DKD).", "The experimental results obtained from these 5 KD frameworks unequivocally demonstrate the excellent scalability of our method.", "Remarkably, our RLKD exhibits robust performance on large dataset like ImageNet.", "For instance, in the Vanilla KD and SRRL frameworks, our method achieves improvement of 0.2% (90.51% vs 90.31%) and 0.11% (90.52% vs 90.41%) respectively.", "In contrast, CTKD obtains much fewer improvement on these KD frameworks, with gains of just 0.02% (90.33% vs 90.31%) and 0.01% (90.42% vs 90.41%) respectively, about 10 times lower.", "We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments.", "Additionally, unlike CTKD, our RLKD also takes into account the student model s grasp of individual instances during instance temperature adjustment.", "MS-COCO: object detection.", "To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset.", "As shown in Table 4 , in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD outperforms Vanilla KD by 1.36% (31.49% vs 30.13%) and CTKD by 0.28% (31.49% vs 31.21%), respectively.", "Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD also shows a significant enhancement, consistently surpasses CTKD across all size categories.", "Results demonstrate the robustness of our approach, where instance temperature adjustment is treated as a sequential decision-making task, enabling consideration of future benefits.", "" ], "target_context_ids": [ 26, 27, 28, 29 ], "selected_paragraphs": [ "[paragraph id = 26] To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset.", "[paragraph id = 27] As shown in Table 4 , in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD outperforms Vanilla KD by 1.36% (31.49% vs 30.13%) and CTKD by 0.28% (31.49% vs 31.21%), respectively.", "[paragraph id = 28] Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD also shows a significant enhancement, consistently surpasses CTKD across all size categories.", "[paragraph id = 29] Results demonstrate the robustness of our approach, where instance temperature adjustment is treated as a sequential decision-making task, enabling consideration of future benefits." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
mAPAP50AP75APlAPmAPs
T: RN-10142.0462.4845.8854.6045.5525.22
S: RN-1833.2653.6135.2643.1635.6818.96
Vanilla KD33.9754.6636.6244.1436.6718.71
+CTKD34.5155.3236.9544.7637.1719.01
+Ours34.7355.6137.1945.2737.3019.12
T: RN-5040.2261.0243.8151.9843.5324.16
S: MN-V229.4748.8730.9038.8630.7716.33
Vanilla KD30.1350.2831.3539.5631.9116.69
+CTKD31.2152.1232.0141.1133.4418.09
+Ours31.4952.5733.2341.7133.6518.31
\n
\n
Table 4: Results of our RLKD on the MS-COCO dataset, utilizing Faster-RCNN [27] with FPN [22]. We conduct experiments with the following teacher-student pairings: RN-101 paired with RN-18, and RN-50 paired with MN-V2.
\n
", "perturb_sentence_id": [ 27, 28 ], "output": { "perturbed_statement": "[paragraph id = 27] As shown in Table 4 , in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD underperforms Vanilla KD by 1.36% (29.49% vs 30.13%) and CTKD by 0.28% (31.21% vs 31.49%), respectively.Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD fails to show significant enhancement and is consistently surpassed by CTKD across all size categories.", "perturbed_explanation": "1. The original statement asserts that RLKD outperforms Vanilla KD by 1.36% in the mAP metric and CTKD by 0.28%. Moreover, for detecting objects of varied sizes, RLKD consistently surpasses CTKD, showing significant enhancement in all size categories. 2. The statement claims that RLKD underperforms compared to both Vanilla KD and CTKD in the mAP metric, implying that the mAP of RLKD is incorrectly lower than both. Additionally, it states RLKD fails to show significant enhancement, incorrectly asserting that CTKD surpasses RLKD in all object size categories." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "5", "section": "5.2", "all_context": [ "In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies.", "All experiments are conducted on the CIFAR-100 dataset with respect to the image classification task, and utilize the Vanilla KD framework.", "Uncertainty score.", "We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation.", "As shown in Table 5 , when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair.", "This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model s mastery of the training instances.", "Instance reward calibration.", "As shown in Table 6 , when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.)", "is achieved.", "E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%).", "We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.", "Efficient exploration.", "As shown in Table 7 , we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs.", "The experimental results demonstrate that our effective exploration strategy facilitates performance of the student model across 4 teacher-student pairs.", "In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%).", "We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.", "Selection of high-quality training examples.", "As shown in Table 8 , we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples.", "Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%.", "This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "We think this may due to utilizing the top 10% samples caused overfitting in the agent.", "Furthermore, in the teacher-student pair RN-56 & RN-20, when conducting the mix-up method on the training data ranked from 10% to 20% using the training data ranked 40% to 50%, there is a performance increase of 0.19% (71.40% vs 71.21%).", "The experimental results verify the validity of our mix-up method that combines instances of varying knowledge values can produce high-quality training data.", "" ], "target_context_ids": [ 0, 3, 4, 5 ], "selected_paragraphs": [ "[paragraph id = 0] In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies.", "[paragraph id = 3] We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation.", "[paragraph id = 4] As shown in Table 5 , when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair.", "[paragraph id = 5] This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model s mastery of the training instances." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherRN-56RN-110WRN-40-2VGG-13
StudentRN-20RN-32WRN-16-2VGG-8
Ours w/o US71.1673.6875.6173.57
Ours w US71.4073.8175.7973.75
\n
\n
Table 5: Ablation study of the uncertainty score (US) feature.
\n
", "perturb_sentence_id": [ 4, 5 ], "output": { "perturbed_statement": "[paragraph id = 4] As shown in Table 5, when incorporating uncertainty score into state representation, our method shows an improvement of 1.50% (72.66% vs 71.16%) in the RN-56 & RN-20 teacher-student pair. This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model's mastery of the training instances.", "perturbed_explanation": "1. The statement originally reports an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair after incorporating the uncertainty score into the state representation. 2. The statement now claims a much larger improvement of 1.50% (72.66% vs 71.16%), which is incorrect. The original improvement figure of 0.24% is consistent with careful experimentation and analysis. The altered improvement figure introduces an error, suggesting a much larger enhancement than actually observed, thus misrepresenting the effectiveness of the uncertainty score in the experiments conducted." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "6", "section": "5.2", "all_context": [ "In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies.", "All experiments are conducted on the CIFAR-100 dataset with respect to the image classification task, and utilize the Vanilla KD framework.", "Uncertainty score.", "We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation.", "As shown in Table 5 , when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair.", "This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model s mastery of the training instances.", "Instance reward calibration.", "As shown in Table 6 , when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.)", "is achieved.", "E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%).", "We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.", "Efficient exploration.", "As shown in Table 7 , we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs.", "The experimental results demonstrate that our effective exploration strategy facilitates performance of the student model across 4 teacher-student pairs.", "In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%).", "We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.", "Selection of high-quality training examples.", "As shown in Table 8 , we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples.", "Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%.", "This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "We think this may due to utilizing the top 10% samples caused overfitting in the agent.", "Furthermore, in the teacher-student pair RN-56 & RN-20, when conducting the mix-up method on the training data ranked from 10% to 20% using the training data ranked 40% to 50%, there is a performance increase of 0.19% (71.40% vs 71.21%).", "The experimental results verify the validity of our mix-up method that combines instances of varying knowledge values can produce high-quality training data.", "" ], "target_context_ids": [ 7, 8, 9, 10 ], "selected_paragraphs": [ "[paragraph id = 7] As shown in Table 6 , when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.)", "[paragraph id = 8] is achieved.", "[paragraph id = 9] E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%).", "[paragraph id = 10] We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherRN-56RN-110WRN-40-2VGG-13
StudentRN-20RN-32WRN-16-2VGG-8
Ours w/o IRA70.9173.2675.3973.32
Ours w IRA71.4073.8175.7973.75
\n
\n
Table 6: Ablation on instance reward calibration (IRA) strategy.
\n
", "perturb_sentence_id": [ 9, 10 ], "output": { "perturbed_statement": "[paragraph id = 9] E.g., our instance reward calibration strategy boosts the performance of RN-110 & RN-32 pair by 10.5% (83.81% vs 73.26%). We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.", "perturbed_explanation": "The original explanation posits that the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action. 1. The statement claims that the instance reward calibration strategy boosts performance by 10.5%, which is factually incorrect. The original text does not provide any mention of RN-110 & RN-32; instead, the context discusses the promotive effect across 4 different sets of teacher-student pairs, but it does not specify the exact percentage improvement or identify such pairs like RN-110 & RN-32. 2. It inaccurately reports a performance percentage of 83.81%, which is not supported or indicated in the context." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "7", "section": "5.2", "all_context": [ "In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies.", "All experiments are conducted on the CIFAR-100 dataset with respect to the image classification task, and utilize the Vanilla KD framework.", "Uncertainty score.", "We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation.", "As shown in Table 5 , when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair.", "This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model s mastery of the training instances.", "Instance reward calibration.", "As shown in Table 6 , when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.)", "is achieved.", "E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%).", "We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.", "Efficient exploration.", "As shown in Table 7 , we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs.", "The experimental results demonstrate that our effective exploration strategy facilitates performance of the student model across 4 teacher-student pairs.", "In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%).", "We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.", "Selection of high-quality training examples.", "As shown in Table 8 , we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples.", "Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%.", "This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "We think this may due to utilizing the top 10% samples caused overfitting in the agent.", "Furthermore, in the teacher-student pair RN-56 & RN-20, when conducting the mix-up method on the training data ranked from 10% to 20% using the training data ranked 40% to 50%, there is a performance increase of 0.19% (71.40% vs 71.21%).", "The experimental results verify the validity of our mix-up method that combines instances of varying knowledge values can produce high-quality training data.", "" ], "target_context_ids": [ 12, 13, 14, 15 ], "selected_paragraphs": [ "[paragraph id = 12] As shown in Table 7 , we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs.", "[paragraph id = 13] The experimental results demonstrate that our effective exploration strategy facilitates performance of the student model across 4 teacher-student pairs.", "[paragraph id = 14] In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%).", "[paragraph id = 15] We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherRN-56RN-110WRN-40-2VGG-13
StudentRN-20RN-32WRN-16-2VGG-8
Ours w/o EE71.0373.5275.5073.45
Ours w EE71.4073.8175.7973.75
\n
\n
Table 7: Ablation study of the efficient exploration (EE) strategy.
\n
", "perturb_sentence_id": [ 14, 15 ], "output": { "perturbed_statement": "[paragraph id = 14] In the experiments involving the MN-60 & MN-30 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%). We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.", "perturbed_explanation": "1. The original explanation describes that the efficient exploration strategy results in a performance improvement of 0.37% for the RN-56 & RN-20 teacher-student pair. 2. The statement incorrectly refers to the MN-60 & MN-30 teacher-student pair instead of the RN-56 & RN-20 pair, which is contradicting the context. The MN-60 & MN-30 teacher-student pair is not mentioned, making the performance improvement reference factually incorrect based on the available information." } }, { "path": "table_paper/2407.00115v3.json", "table_id": "8", "section": "5.2", "all_context": [ "In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies.", "All experiments are conducted on the CIFAR-100 dataset with respect to the image classification task, and utilize the Vanilla KD framework.", "Uncertainty score.", "We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation.", "As shown in Table 5 , when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair.", "This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model s mastery of the training instances.", "Instance reward calibration.", "As shown in Table 6 , when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.)", "is achieved.", "E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%).", "We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.", "Efficient exploration.", "As shown in Table 7 , we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs.", "The experimental results demonstrate that our effective exploration strategy facilitates performance of the student model across 4 teacher-student pairs.", "In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%).", "We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.", "Selection of high-quality training examples.", "As shown in Table 8 , we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples.", "Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%.", "This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "We think this may due to utilizing the top 10% samples caused overfitting in the agent.", "Furthermore, in the teacher-student pair RN-56 & RN-20, when conducting the mix-up method on the training data ranked from 10% to 20% using the training data ranked 40% to 50%, there is a performance increase of 0.19% (71.40% vs 71.21%).", "The experimental results verify the validity of our mix-up method that combines instances of varying knowledge values can produce high-quality training data.", "" ], "target_context_ids": [ 16, 17, 18, 19, 20 ], "selected_paragraphs": [ "[paragraph id = 16] Selection of high-quality training examples.", "[paragraph id = 17] As shown in Table 8 , we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples.", "[paragraph id = 18] Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%.", "[paragraph id = 19] This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "[paragraph id = 20] We think this may due to utilizing the top 10% samples caused overfitting in the agent." ], "table_html": "
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TeacherStudent\n \n\n \n
72.3469.0670.9271.2171.2771.40
75.6173.2675.3375.5775.6175.79
\n
\n
Table 8: Comparison of different high-quality training sample selection strategies. The teacher-student pairs corresponding to the second and third rows are respectively RN-56 & RN-20 and WRN-40-2 & WRN-16-2. “” denotes the mix-up operation.
\n
", "perturb_sentence_id": [ 18, 19 ], "output": { "perturbed_statement": "[paragraph id = 18] Interestingly, we observe that when using the top 5% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 5% to 15%. This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2.", "perturbed_explanation": "The original explanation highlights the performance of the student model under different selections of high-quality training data. Specifically, it focuses on the observation that using the top 10% of data seems to cause a reduction in performance compared to using data ranked from 10% to 20%, implying issues such as overfitting when only top 10% samples are used. The statement is incorrect for the following reasons: 1. The statement incorrectly mentions using the top 5% of high-quality training data, whereas the correct top percentage mentioned is the top 10%. 2. It incorrectly suggests using a range from 5% to 15% for comparison, whereas the correct range is from 10% to 20%." } } ]