Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0034.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.9 kB

	2

	INTRODUCTION

	Documents are containers of information that are easily shareable. The concept
	of a document dates back to when humans started writing and has been a
	cornerstone of human communication ever since. In the age of digital technology,
	documents are still the primary means of communication between humans and
	organizations and form the backbone of many business processes. Human
	communication is increasingly happening through digital channels, and the
	COVID-19 pandemic has only accelerated this trend. We are increasingly living
	in a “document society” [53], dependent on documents in our daily lives or for
	recording second-hand knowledge. With instant gratification as the norm in
	the digital age, people expect similar seamless interactions with businesses and
	governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with
	organizations drowning under the sheer volume of documents they receive.
	So why have organizations not switched en masse to
	automated document processing?
	The answer lies for some part in (I) the complexity of the task, and for the
	other part in (II) the need for reliability and risk control.
	(I) While it might be straightforward for a human (white-collar) worker to read
	a long, structured document, understand its contents, categorize it, and extract
	crucial information accordingly, this is not so easy for a machine. This could be
	perceived as an instance of Moravec’s paradox [319], which states that tasks
	that are easy for humans are hard for machines, and vice versa. However, in
	recent times, significant strides forward have been made thanks to technological
	advances combining Natural Language Processing (NLP), Computer Vision
	(CV) and Machine Learning (ML). Document Understanding (DU) is
	the umbrella term for both the end-to-end solution and the research field
	studying to make machines interpret and understand documents (elaborated
	on in Section 2.3). It has seen a surge in interest in the past few years, with
	the rise of large-scale pretrained Language and Vision models (LLM, VLM)
	[52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs.
	What makes DU challenging is that it encompasses multiple subtasks, each of
	which is a research field in its own right, such as Optical Character Recognition
	(OCR), Document Layout Analysis (DLA), Document Classification (DC), Key
	Information Extraction (KIE), Visual Question Answering (VQA), etc. The
	complexity of the task is further increased by the fact that documents are
	multimodal, containing both text and images and that they are compositional,
	i.e., the meaning of the document is not just the sum of its parts. Information
	can appear in a wide range of forms including text, images, tables or graphs,
	and be spread across multiple pages. Moreover, the meaning of a document