--- id: "duckdb-leakage-detection" title: "Leakage Detection" slug: "duckdb-leakage-detection-query" description: "Detect data leakage between train and test datasets by calculating the overlap percentage." code: | WITH overlapping_rows AS ( SELECT COUNT(*) AS overlap_count FROM train INTERSECT SELECT COUNT(*) AS overlap_count FROM test ), total_unique_rows AS ( SELECT COUNT(*) AS total_count FROM ( SELECT * FROM train UNION SELECT * FROM test ) combined ) SELECT overlap_count, total_count, (overlap_count * 100.0 / total_count) AS overlap_percentage FROM overlapping_rows, total_unique_rows; --- # DuckDB Leakage Detection Query This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage. ```sql WITH overlapping_rows AS ( SELECT COUNT(*) AS overlap_count FROM train INTERSECT SELECT COUNT(*) AS overlap_count FROM test ), total_unique_rows AS ( SELECT COUNT(*) AS total_count FROM ( SELECT * FROM train UNION SELECT * FROM test ) combined ) SELECT overlap_count, total_count, (overlap_count * 100.0 / total_count) AS overlap_percentage FROM overlapping_rows, total_unique_rows; ``` There is a very good in depth explanation of leakage in public datasets in this [article](https://huggingface.co/blog/lbourdois/lle). # Example Query You can check out this [link](https://huggingface.co/datasets/stanfordnlp/imdb?sql_console=true&sql=WITH%0A++++overlapping_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+train%0A++++++++INTERSECT%0A++++++++SELECT+COUNT%28*%29+AS+overlap_count%0A++++++++FROM+test%0A++++%29%2C%0A++++total_unique_rows+AS+%28%0A++++++++SELECT+COUNT%28*%29+AS+total_count%0A++++++++FROM+%28%0A++++++++++++SELECT+*+FROM+train%0A++++++++++++UNION%0A++++++++++++SELECT+*+FROM+test%0A++++++++%29+combined%0A++++%29%0ASELECT%0A++++overlap_count%2C%0A++++total_count%2C%0A++++%28overlap_count+*+100.0+%2F+total_count%29+AS+overlap_percentage%0AFROM+overlapping_rows%2C+total_unique_rows%3B%0A) for the leakage query for the IMDB dataset. ![leakage-query-imdb](./leakage-query-image.png)