Spaces:
Paused
Paused
Abstract | |
Human communication is increasingly document-based, requiring machines | |
to understand a wide variety of visually-rich documents to assist humans in | |
their daily lives. Amid the digital evolution, documents continue to facilitate | |
crucial human and organizational interactions but are tethered to manual | |
processing, causing inefficiency. We examine why organizations lag in adopting | |
automated document processing solutions and outline two primary challenges: | |
the complexity of processing long, multimodal documents algorithmically and | |
the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current | |
state-of-the-art technology is not yet reliable and robust enough to be deployed | |
in autonomous systems. | |
The practical objective set is to develop Intelligent Automation () systems | |
capable of estimating confidence in their actions, thereby increasing throughput | |
without accruing additional costs due to errors. We analyze the key challenges | |
and propose solutions to bridge the gap between research and practical | |
applications, with a focus on realistic datasets and experimental methodologies. | |
Building upon foundations of Document Understanding (), this dissertation | |
introduces advanced methodologies combining Machine Learning, Natural | |
Language Processing, and Computer Vision. | |
Addressing the evident gaps in research, this work presents novel methods | |
for predictive uncertainty quantification () alongside practical frameworks for | |
evaluating the robustness and reliability of DU technologies. The contribution | |
culminates in the introduction of two novel multipage document classification | |
datasets and a multifaceted benchmark, DUDE | |
, designed to rigorously | |
challenge and assess the state-of-the-art in DU. Extensive experiments across | |
these datasets reveal that while advancements have been made, significant | |
room for improvement remains, particularly in long-context modeling for | |
multipage document processing and calibrated, selective document visual | |
question answering. Efficient DU is also explored, revealing the effectiveness of | |
iii | |