Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0040.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.08 kB

	8

	INTRODUCTION

	CDIP [252]), or are restricted to a single domain or a small set of document
	types.
	We posit that larger, fundamental questions in DU remain unanswered due to a
	lack of sufficiently complex datasets and benchmarks with a rich methodology
	covering evaluation beyond the independent and identically distributed (i.i.d.)
	test set setting. While there exist performant models for DU subtasks such
	as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis
	and recognition tasks to models that can reason and understand documents. A
	truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language
	questions. Moreover, it should be able to generalize to any question on any
	document and reason over multiple pages and modalities.
	The following research questions are addressed in Chapters 4 and 5:
	RQ 6. How can we iteratively close the gap between research and practice in DU?
	RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art?
	RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs?
	How can these be incorporated in a benchmark to allow proper measurements
	of future improvements?
	However, moving the goalpost beyond a single-page context inevitably requires
	us to reconsider the research challenge of efficiency in DU. The rise of LLMs
	has enabled a new generation of DU pipelines, which are more flexible and
	easier to maintain than separate and specialized subtask modules, but also
	more computationally demanding. Importantly, most LLMs are not designed
	to handle the multimodality and long context windows of multipage documents,
	and are often unaware of the visual and layout semantics of documents.
	The research questions for Chapter 6 address the efficiency challenge in DU:
	RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for
	more focused information extraction?
	RQ 10. To what degree can model compression resolve the problem of efficiency
	in processing documents?