Spaces:
Paused
Paused
30 | |
FUNDAMENTALS | |
For the sake of completeness, there exist different notions of calibration, differing | |
in the subset of predictions considered over ∆Y [463]: | |
I. top-1 [156] | |
II. top-r [159] | |
III. canonical calibration [51] | |
Formally, a classifier f is said to be canonically calibrated iff, | |
P(Y = yk | f (X) = ρ) = ρk | |
∀k ∈ [K] ∧ ∀ρ ∈ [0, 1]K where K = |Y|. (2.17) | |
However, the most strict notion of calibration becomes infeasible to compute | |
once the output space cardinality exceeds a certain size [157]. | |
For discrete target spaces with a large number of classes, there is plenty interest | |
in knowing that a model is calibrated on less likely predictions as well. Some | |
relaxed notions of calibration have been proposed, which are more feasible | |
to compute and can be used to compare models on a more equal footing. | |
These include: top-label [157], top-r [159], within-top-r [159], marginal | |
[229, 231, 342, 492]. | |
2.2.5 | |
Predictive Uncertainty Quantification | |
Bayes’ theorem [26] is a fundamental result in probability theory, which | |
provides a principled way to update beliefs about an event given new evidence. | |
Bayesian Deep Learning (BDL) methods build on these solid mathematical | |
foundations and promise reliable predictive uncertainty quantification (PUQ) | |
[124, 136, 140, 238, 301, 325, 326, 464, 466, 496]. | |
The Bayesian approach consists of casting learning and prediction as an | |
inference task about hypotheses (uncertain quantities, with θ representing | |
all BNN parameters: weights w, biases b, and model structure) from training | |
N | |
data (measurable quantities, D = {(xi , yi )}i=1 = (X, Y )). | |
Bayesian Neural Networks (BNN) are in theory able to avoid the pitfalls | |
of stochastic non-convex optimization on non-linear tunable functions with | |
many high-dimensional parameters [300]. More specifically, BNNs can capture | |
the uncertainty in the NN parameters by learning a distribution over them, | |
rather than a single point estimate. This offers advantages in terms of data | |
efficiency, avoiding overfitting thanks to regularization from parameter priors, | |
model complexity control, and robustness to noise due to the probabilistic | |
nature. However, they come with their own challenges such as the increased | |
computational cost of learning and inference, the difficulty of specifying | |
appropriate weight or function priors, and the need for specialized training | |
algorithms or architectural extensions. | |