ask_my_thesis / assets /txts /pg_0039.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.27 kB
PROBLEM STATEMENT AND QUESTIONS
7
RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources
/ distribution shifts are they most sensitive to?
RQ 3. How can we obtain better PUQ estimates without overrelying on
computationally prohibitive methods, e.g., Deep Ensemble [238]?
RQ 4. How important are certain prior, neural architecture or hyperparameter
influences on the quality of PUQ estimation?
In a later chapter (Chapter 5), we introduce a complex benchmark for generic
DU that additionally tests for robustness to domain, visual and layout shifts,
and explores the novel problem of hallucination and control in natural language
generation (NLG) with LLMs from the perspective of calibrated and selective
DocVQA. The general task formulation involves a natural language question (on
content, aspect, form, visual/layout), an input document, and a set of reference
answers. The model is expected to provide a natural language answer, an answer
confidence and a (binary) abstention decision. Evaluation is done in terms of
answer correctness, calibration and selective prediction. On the one hand, one
expects a model to lower confidence when unsure about the correctness of a
predicted answer. On the other hand, one expects a model to abstain from
answering and refrain from hallucinations on unanswerable questions (which
had been explicitly added in the dataset).
RQ 5. How severe is the problem of hallucination and control in LLMs when
evaluated in a selective, free-form DocVQA task setting?
1.2.2
Realistic and Efficient Document Understanding
The second part of the dissertation focuses on the more applied research questions
of realistic and efficient DU. The overall objective is to make DU technology
more generically applicable (Chapter 5), evaluation more in sync with real-world
requirements (Chapters 4 and 5), and more efficient at modeling the multimodal
and compositional nature of documents (Chapters 5 and 6).
Due to the proximity to business applications and the risks of leaking personal
information, DU research benchmarks have diverged substantially from the
real-world distributions of document data. For instance, DU datasets are often
limited to single-page document images, are from outdated sources (e.g., IIT-