2

INTRODUCTION

Documents are containers of information that are easily shareable. The concept
of a document dates back to when humans started writing and has been a
cornerstone of human communication ever since. In the age of digital technology,
documents are still the primary means of communication between humans and
organizations and form the backbone of many business processes. Human
communication is increasingly happening through digital channels, and the
COVID-19 pandemic has only accelerated this trend. We are increasingly living
in a “document society” [53], dependent on documents in our daily lives or for
recording second-hand knowledge. With instant gratification as the norm in
the digital age, people expect similar seamless interactions with businesses and
governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with
organizations drowning under the sheer volume of documents they receive.
So why have organizations not switched en masse to
automated document processing?
The answer lies for some part in (I) the complexity of the task, and for the
other part in (II) the need for reliability and risk control.
(I) While it might be straightforward for a human (white-collar) worker to read
a long, structured document, understand its contents, categorize it, and extract
crucial information accordingly, this is not so easy for a machine. This could be
perceived as an instance of Moravec’s paradox [319], which states that tasks
that are easy for humans are hard for machines, and vice versa. However, in
recent times, significant strides forward have been made thanks to technological
advances combining Natural Language Processing (NLP), Computer Vision
(CV) and Machine Learning (ML). Document Understanding (DU) is
the umbrella term for both the end-to-end solution and the research field
studying to make machines interpret and understand documents (elaborated
on in Section 2.3). It has seen a surge in interest in the past few years, with
the rise of large-scale pretrained Language and Vision models (LLM, VLM)
[52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs.
What makes DU challenging is that it encompasses multiple subtasks, each of
which is a research field in its own right, such as Optical Character Recognition
(OCR), Document Layout Analysis (DLA), Document Classification (DC), Key
Information Extraction (KIE), Visual Question Answering (VQA), etc. The
complexity of the task is further increased by the fact that documents are
multimodal, containing both text and images and that they are compositional,
i.e., the meaning of the document is not just the sum of its parts. Information
can appear in a wide range of forms including text, images, tables or graphs,
and be spread across multiple pages. Moreover, the meaning of a document