Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0045.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.03 kB

	STATISTICAL LEARNING

	13

	possible functions. The objective is to find a function f ∈ F that minimizes the
	risk, or even better, the Bayes risk
	f ∗ = inf R(f ),
	f ∈F

	(2.2)

	which is the minimum achievable risk over all functions in F. The latter is only
	realizable with infinite data or having access to the data-generating distribution
	P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a
	function fˆ that minimizes the empirical risk
	N
	1 X
	`(yi , f (xi )),
	fˆ =
	N i=1

	(2.3)

	where (xi , yi ) are N independently and identically distributed (i.i.d.) samples
	drawn from an unknown distribution P on X × Y. This is known as empirical
	risk minimization (ERM), which is a popular approach to supervised learning,
	under which three important processes are defined.
	Training or model fitting is the process of estimating the parameters θ of a
	model, which is done by minimizing a suitable loss function ` over a training
	set D = {(xi , yi )}N
	i=1 of N i.i.d. samples.
	Inference or prediction is the process of estimating the output of a model for
	a given input, which is typically done by computing the posterior probability
	P (y\|x) over the output space Y. Classification output is a discrete label, while
	regression output is a continuous value.
	Evaluation involves measuring the quality of a model’s predictions, which is
	typically done by computing a suitable evaluation metric over a test set Dtest
	of i.i.d. samples, which were not used for training.
	However, ERM has its caveats concerning generalization to unseen data,
	requiring either additional assumptions on the hypothesis class F, which
	are known as inductive biases, and/or regularization to penalize the
	complexity of the function class F [445]. In neural networks (discussed in
	detail Section 2.1.1), the former is controlled by the architecture of the network,
	while the latter involves specifying constraints to parameters or adding a
	regularization term to the loss function.


	fˆ = arg min R̂(f ) + λΨ(θ)
	f ∈F

	(2.4)