ask_my_thesis / assets /txts /pg_0007.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.15 kB
Abstract
Human communication is increasingly document-based, requiring machines
to understand a wide variety of visually-rich documents to assist humans in
their daily lives. Amid the digital evolution, documents continue to facilitate
crucial human and organizational interactions but are tethered to manual
processing, causing inefficiency. We examine why organizations lag in adopting
automated document processing solutions and outline two primary challenges:
the complexity of processing long, multimodal documents algorithmically and
the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current
state-of-the-art technology is not yet reliable and robust enough to be deployed
in autonomous systems.
The practical objective set is to develop Intelligent Automation () systems
capable of estimating confidence in their actions, thereby increasing throughput
without accruing additional costs due to errors. We analyze the key challenges
and propose solutions to bridge the gap between research and practical
applications, with a focus on realistic datasets and experimental methodologies.
Building upon foundations of Document Understanding (), this dissertation
introduces advanced methodologies combining Machine Learning, Natural
Language Processing, and Computer Vision.
Addressing the evident gaps in research, this work presents novel methods
for predictive uncertainty quantification () alongside practical frameworks for
evaluating the robustness and reliability of DU technologies. The contribution
culminates in the introduction of two novel multipage document classification
datasets and a multifaceted benchmark, DUDE
, designed to rigorously
challenge and assess the state-of-the-art in DU. Extensive experiments across
these datasets reveal that while advancements have been made, significant
room for improvement remains, particularly in long-context modeling for
multipage document processing and calibrated, selective document visual
question answering. Efficient DU is also explored, revealing the effectiveness of
iii