Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0046.txt

jordyvl

First commit

e0a78f5 10 months ago

raw

history blame

2.56 kB

	14

	FUNDAMENTALS

	Equation (2.4) defines regularized empirical risk minimization (RERM),
	where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the
	trade-off between the empirical risk (denoted with R̂) and the regularization
	term.
	All these concepts will be revisited in the context of neural networks in
	Section 2.1.1, where we will also discuss the optimization process of the model
	parameters θ, how inference differs in the case of probabilistic models to estimate
	uncertainty (Section 2.2.5), and how regularization affects confidence estimation
	and calibration (Section 2.2.4).

	2.1.1

	Neural Networks

	An artificial neural network (NN) is a mathematical approximation inspired
	by data processing in the human brain [396]. It can be represented by a
	network topology of interconnected neurons that are organized in layers that
	successively refine intermediately learned feature representations of the input
	[448] that are useful for the task at hand, e.g., classifying an animal by means
	of its size, shape and fur, or detecting the sentiment of a review by focusing on
	adjectives.
	A basic NN building block is a linear layer, which is a linear function of the
	input parameters: f (x) = W x + b, where the bias term b is a constant vector
	shifting the decision boundary away from the origin and the weight matrix
	W holds most parameters that rotate the decision boundary in input space.
	Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to
	introduce non-linearity in the model, which is required for learning complex
	functions.
	The first deep learning (DL) network (stacking multiple linear layers) dates
	back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398].
	The first successful DL application was a demonstration of digit recognition
	in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent
	success of DL is attributed to the availability of large datasets, the increase in
	computational power, the development of new algorithms and architectures,
	and the commercial interest of large companies.
	Consider a conventional DL architecture as a composition of parameterized
	functions. Each consists of a configuration of layers (e.g., convolution, pooling,
	activation function, normalization, embeddings) determining the type of input
	transformation (e.g., convolutional, recurrent, attention) with (trainable)
	parameters linear/non-linear w.r.t. the input x. Given the type of input,
	e.g., language which is naturally discrete-sequential, or vision which presents a