Spaces:
Paused
Paused
14 | |
FUNDAMENTALS | |
Equation (2.4) defines regularized empirical risk minimization (RERM), | |
where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the | |
trade-off between the empirical risk (denoted with R̂) and the regularization | |
term. | |
All these concepts will be revisited in the context of neural networks in | |
Section 2.1.1, where we will also discuss the optimization process of the model | |
parameters θ, how inference differs in the case of probabilistic models to estimate | |
uncertainty (Section 2.2.5), and how regularization affects confidence estimation | |
and calibration (Section 2.2.4). | |
2.1.1 | |
Neural Networks | |
An artificial neural network (NN) is a mathematical approximation inspired | |
by data processing in the human brain [396]. It can be represented by a | |
network topology of interconnected neurons that are organized in layers that | |
successively refine intermediately learned feature representations of the input | |
[448] that are useful for the task at hand, e.g., classifying an animal by means | |
of its size, shape and fur, or detecting the sentiment of a review by focusing on | |
adjectives. | |
A basic NN building block is a linear layer, which is a linear function of the | |
input parameters: f (x) = W x + b, where the bias term b is a constant vector | |
shifting the decision boundary away from the origin and the weight matrix | |
W holds most parameters that rotate the decision boundary in input space. | |
Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to | |
introduce non-linearity in the model, which is required for learning complex | |
functions. | |
The first deep learning (DL) network (stacking multiple linear layers) dates | |
back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398]. | |
The first successful DL application was a demonstration of digit recognition | |
in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent | |
success of DL is attributed to the availability of large datasets, the increase in | |
computational power, the development of new algorithms and architectures, | |
and the commercial interest of large companies. | |
Consider a conventional DL architecture as a composition of parameterized | |
functions. Each consists of a configuration of layers (e.g., convolution, pooling, | |
activation function, normalization, embeddings) determining the type of input | |
transformation (e.g., convolutional, recurrent, attention) with (trainable) | |
parameters linear/non-linear w.r.t. the input x. Given the type of input, | |
e.g., language which is naturally discrete-sequential, or vision which presents a | |