Spaces:
Paused
Paused
PROBLEM STATEMENT AND QUESTIONS | |
7 | |
RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources | |
/ distribution shifts are they most sensitive to? | |
RQ 3. How can we obtain better PUQ estimates without overrelying on | |
computationally prohibitive methods, e.g., Deep Ensemble [238]? | |
RQ 4. How important are certain prior, neural architecture or hyperparameter | |
influences on the quality of PUQ estimation? | |
In a later chapter (Chapter 5), we introduce a complex benchmark for generic | |
DU that additionally tests for robustness to domain, visual and layout shifts, | |
and explores the novel problem of hallucination and control in natural language | |
generation (NLG) with LLMs from the perspective of calibrated and selective | |
DocVQA. The general task formulation involves a natural language question (on | |
content, aspect, form, visual/layout), an input document, and a set of reference | |
answers. The model is expected to provide a natural language answer, an answer | |
confidence and a (binary) abstention decision. Evaluation is done in terms of | |
answer correctness, calibration and selective prediction. On the one hand, one | |
expects a model to lower confidence when unsure about the correctness of a | |
predicted answer. On the other hand, one expects a model to abstain from | |
answering and refrain from hallucinations on unanswerable questions (which | |
had been explicitly added in the dataset). | |
RQ 5. How severe is the problem of hallucination and control in LLMs when | |
evaluated in a selective, free-form DocVQA task setting? | |
1.2.2 | |
Realistic and Efficient Document Understanding | |
The second part of the dissertation focuses on the more applied research questions | |
of realistic and efficient DU. The overall objective is to make DU technology | |
more generically applicable (Chapter 5), evaluation more in sync with real-world | |
requirements (Chapters 4 and 5), and more efficient at modeling the multimodal | |
and compositional nature of documents (Chapters 5 and 6). | |
Due to the proximity to business applications and the risks of leaking personal | |
information, DU research benchmarks have diverged substantially from the | |
real-world distributions of document data. For instance, DU datasets are often | |
limited to single-page document images, are from outdated sources (e.g., IIT- | |