16 FUNDAMENTALS Definition 1. Probabilistic predictor f : X → ∆Y that outputs a conditional probability distribution P (y 0 |x) over outputs y 0 ∈ Y for an i.i.d. drawn sample (x,y). |Y| Definition 2 (Probability Simplex). Let ∆Y := {v ∈ R≥0 : kvk1 = 1} be a probability simplex of size |Y| − 1 as a geometric representation of a probability space, where each vertex represents a mutually exclusive label and each point has an associated probability vector v [368]. Figure 2.1 illustrates a multi-class classifier, where Y = [K] for K=3 classes. photos.google.com Google Photos Home for all your photos and videos, automatically organized and easy to share. https://photos.google.com/search/fox Figure 2.1. Scatter plot of a ternary problem (K = 3, N = 100) in the probability simplex space. Example of overconfident misprediction (above is a Shiba Inu dog) and correct sharp prediction (clear image of Beagle). In practice, loss functions are proper scoring rules [330], S : ∆Y × Y → R, that measure the quality of a probabilistic prediction P (ŷ|x) given the true label y. The cross-entropy (CE) loss is a popular loss function for classification, while the mean-squared error (MSE) loss is used for regression. In Section 2.2, we will discuss the evaluation of probabilistic predictors in more detail, including the calibration of confidence estimates and the detection of out-of-distribution samples. 2.1.3 Architectures Throughout the chapters of the thesis, we have primarily used the following NN architectures: Convolutional Neural Networks (CNNs), Transformer Networks . We will briefly introduce the building blocks of these architectures, with a focus on how they are used in the context of document understanding.