Spaces:
Paused
Paused
STATISTICAL LEARNING | |
13 | |
possible functions. The objective is to find a function f ∈ F that minimizes the | |
risk, or even better, the Bayes risk | |
f ∗ = inf R(f ), | |
f ∈F | |
(2.2) | |
which is the minimum achievable risk over all functions in F. The latter is only | |
realizable with infinite data or having access to the data-generating distribution | |
P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a | |
function fˆ that minimizes the empirical risk | |
N | |
1 X | |
`(yi , f (xi )), | |
fˆ = | |
N i=1 | |
(2.3) | |
where (xi , yi ) are N independently and identically distributed (i.i.d.) samples | |
drawn from an unknown distribution P on X × Y. This is known as empirical | |
risk minimization (ERM), which is a popular approach to supervised learning, | |
under which three important processes are defined. | |
Training or model fitting is the process of estimating the parameters θ of a | |
model, which is done by minimizing a suitable loss function ` over a training | |
set D = {(xi , yi )}N | |
i=1 of N i.i.d. samples. | |
Inference or prediction is the process of estimating the output of a model for | |
a given input, which is typically done by computing the posterior probability | |
P (y|x) over the output space Y. Classification output is a discrete label, while | |
regression output is a continuous value. | |
Evaluation involves measuring the quality of a model’s predictions, which is | |
typically done by computing a suitable evaluation metric over a test set Dtest | |
of i.i.d. samples, which were not used for training. | |
However, ERM has its caveats concerning generalization to unseen data, | |
requiring either additional assumptions on the hypothesis class F, which | |
are known as inductive biases, and/or regularization to penalize the | |
complexity of the function class F [445]. In neural networks (discussed in | |
detail Section 2.1.1), the former is controlled by the architecture of the network, | |
while the latter involves specifying constraints to parameters or adding a | |
regularization term to the loss function. | |
fˆ = arg min R̂(f ) + λΨ(θ) | |
f ∈F | |
(2.4) | |