2 INTRODUCTION Documents are containers of information that are easily shareable. The concept of a document dates back to when humans started writing and has been a cornerstone of human communication ever since. In the age of digital technology, documents are still the primary means of communication between humans and organizations and form the backbone of many business processes. Human communication is increasingly happening through digital channels, and the COVID-19 pandemic has only accelerated this trend. We are increasingly living in a “document society” [53], dependent on documents in our daily lives or for recording second-hand knowledge. With instant gratification as the norm in the digital age, people expect similar seamless interactions with businesses and governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with organizations drowning under the sheer volume of documents they receive. So why have organizations not switched en masse to automated document processing? The answer lies for some part in (I) the complexity of the task, and for the other part in (II) the need for reliability and risk control. (I) While it might be straightforward for a human (white-collar) worker to read a long, structured document, understand its contents, categorize it, and extract crucial information accordingly, this is not so easy for a machine. This could be perceived as an instance of Moravec’s paradox [319], which states that tasks that are easy for humans are hard for machines, and vice versa. However, in recent times, significant strides forward have been made thanks to technological advances combining Natural Language Processing (NLP), Computer Vision (CV) and Machine Learning (ML). Document Understanding (DU) is the umbrella term for both the end-to-end solution and the research field studying to make machines interpret and understand documents (elaborated on in Section 2.3). It has seen a surge in interest in the past few years, with the rise of large-scale pretrained Language and Vision models (LLM, VLM) [52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs. What makes DU challenging is that it encompasses multiple subtasks, each of which is a research field in its own right, such as Optical Character Recognition (OCR), Document Layout Analysis (DLA), Document Classification (DC), Key Information Extraction (KIE), Visual Question Answering (VQA), etc. The complexity of the task is further increased by the fact that documents are multimodal, containing both text and images and that they are compositional, i.e., the meaning of the document is not just the sum of its parts. Information can appear in a wide range of forms including text, images, tables or graphs, and be spread across multiple pages. Moreover, the meaning of a document