Add model-based results for MedNLI, RadNLI for GPT-3.5 and GPT-4
Browse filesThis PR adds (negative) data contamination results for MedNLI and RadNLI.
Similar to earlier PRs (e.g. PR 3), this follows the method outlined in [Golchin and Surdeanu 2024](https://arxiv.org/pdf/2308.08493.pdf) to evaluate GPT-3.5 and GPT-4. The only differences in the implementation is that (1) multiple runs were performed on each split, each on different data partitions, and (2) the models were accessed through Azure OpenAI (opt out of human review + HIPAA-compliant), following MIMIC's DUA.
A sanitized version of the results that keeps the data index, label, outputs, and contamination evaluation results __without original input sentences__ can be found [here](https://github.com/j-chim/time-travel-in-llms/tree/main/results).
While there are potential positives identified by the ROUGE-based contamination detection method, the best performing (GPT-4 ICL) detector did not consider these instances to be true contaminations. As such this PR reports negative results (0% contamination for all splits on both datasets based on the examined method).
- contamination_report.csv +8 -2
@@ -446,6 +446,12 @@ nyu-mll/glue;wnli;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.
|
|
446 |
samsum;;GPT-4;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
|
447 |
samsum;;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
|
448 |
|
449 |
-
EdinburghNLP/xsum;;GPT-4;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
|
450 |
-
EdinburghNLP/xsum;;GPT-3.5;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
|
|
|
|
|
|
|
|
|
|
|
|
|
451 |
|
|
|
446 |
samsum;;GPT-4;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
|
447 |
samsum;;GPT-3.5;model;0.0;;0.0;model-based;https://arxiv.org/pdf/2308.08493;3
|
448 |
|
449 |
+
EdinburghNLP/xsum;;GPT-4;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
|
450 |
+
EdinburghNLP/xsum;;GPT-3.5;model;0.0;;100.0;model-based;https://arxiv.org/pdf/2308.08493;
|
451 |
+
|
452 |
+
MedNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
|
453 |
+
MedNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
|
454 |
+
|
455 |
+
RadNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
|
456 |
+
RadNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;
|
457 |
|