Original PDF dataset

#1
by Neronuser - opened

Hi,

thanks for building the Vidore benchmark, both academia and industry definitely lacked an open and varied practical document retrieval dataset.
However, are there any plans to share original documents in PDF format? Your research does discuss the difference between purely textual retrieval and vision-based retrieval, but original PDFs contain text, images, and coordinates for all of the page elements, so you can imagine hybrid approaches combining text and coordinates instead of plain text. They probably won't beat Colpali, but to get the full picture, it would be interesting to see where they fall on the leaderboard.

Sorry if I didn't find the original PDFs dataset and it's shared already, could someone point me to it then?

Sign up or log in to comment