Spaces:
Paused
Paused
8 | |
INTRODUCTION | |
CDIP [252]), or are restricted to a single domain or a small set of document | |
types. | |
We posit that larger, fundamental questions in DU remain unanswered due to a | |
lack of sufficiently complex datasets and benchmarks with a rich methodology | |
covering evaluation beyond the independent and identically distributed (i.i.d.) | |
test set setting. While there exist performant models for DU subtasks such | |
as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis | |
and recognition tasks to models that can reason and understand documents. A | |
truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language | |
questions. Moreover, it should be able to generalize to any question on any | |
document and reason over multiple pages and modalities. | |
The following research questions are addressed in Chapters 4 and 5: | |
RQ 6. How can we iteratively close the gap between research and practice in DU? | |
RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art? | |
RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs? | |
How can these be incorporated in a benchmark to allow proper measurements | |
of future improvements? | |
However, moving the goalpost beyond a single-page context inevitably requires | |
us to reconsider the research challenge of efficiency in DU. The rise of LLMs | |
has enabled a new generation of DU pipelines, which are more flexible and | |
easier to maintain than separate and specialized subtask modules, but also | |
more computationally demanding. Importantly, most LLMs are not designed | |
to handle the multimodality and long context windows of multipage documents, | |
and are often unaware of the visual and layout semantics of documents. | |
The research questions for Chapter 6 address the efficiency challenge in DU: | |
RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for | |
more focused information extraction? | |
RQ 10. To what degree can model compression resolve the problem of efficiency | |
in processing documents? | |