Spaces:
Paused
Paused
2 | |
INTRODUCTION | |
Documents are containers of information that are easily shareable. The concept | |
of a document dates back to when humans started writing and has been a | |
cornerstone of human communication ever since. In the age of digital technology, | |
documents are still the primary means of communication between humans and | |
organizations and form the backbone of many business processes. Human | |
communication is increasingly happening through digital channels, and the | |
COVID-19 pandemic has only accelerated this trend. We are increasingly living | |
in a “document society” [53], dependent on documents in our daily lives or for | |
recording second-hand knowledge. With instant gratification as the norm in | |
the digital age, people expect similar seamless interactions with businesses and | |
governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with | |
organizations drowning under the sheer volume of documents they receive. | |
So why have organizations not switched en masse to | |
automated document processing? | |
The answer lies for some part in (I) the complexity of the task, and for the | |
other part in (II) the need for reliability and risk control. | |
(I) While it might be straightforward for a human (white-collar) worker to read | |
a long, structured document, understand its contents, categorize it, and extract | |
crucial information accordingly, this is not so easy for a machine. This could be | |
perceived as an instance of Moravec’s paradox [319], which states that tasks | |
that are easy for humans are hard for machines, and vice versa. However, in | |
recent times, significant strides forward have been made thanks to technological | |
advances combining Natural Language Processing (NLP), Computer Vision | |
(CV) and Machine Learning (ML). Document Understanding (DU) is | |
the umbrella term for both the end-to-end solution and the research field | |
studying to make machines interpret and understand documents (elaborated | |
on in Section 2.3). It has seen a surge in interest in the past few years, with | |
the rise of large-scale pretrained Language and Vision models (LLM, VLM) | |
[52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs. | |
What makes DU challenging is that it encompasses multiple subtasks, each of | |
which is a research field in its own right, such as Optical Character Recognition | |
(OCR), Document Layout Analysis (DLA), Document Classification (DC), Key | |
Information Extraction (KIE), Visual Question Answering (VQA), etc. The | |
complexity of the task is further increased by the fact that documents are | |
multimodal, containing both text and images and that they are compositional, | |
i.e., the meaning of the document is not just the sum of its parts. Information | |
can appear in a wide range of forms including text, images, tables or graphs, | |
and be spread across multiple pages. Moreover, the meaning of a document | |