ask_my_thesis / assets /txts /pg_0058.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.21 kB
26
FUNDAMENTALS
the true class. This measure more heavily penalizes sharp probabilities,
which are close to the wrong edge or class by over/under-confidence.
`NLL (f ) = −
N K
1 XX
I [yi = k] · log (fk (xi ))
N i=1
(2.10)
k=1
• Brier Score [50] is a scoring rule that measures the accuracy of a
probabilistic classifier and is related to the mean-squared error (MSE) loss
function. Brier score is more commonly used in industrial practice since it
is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities
less severely than NLL.
`BS (f ) =
N K
1 XX
2
(I (yi = k) − fk (xi ))
N i=1
(2.11)
k=1
All metrics following require a CSF g(x) to be defined, and can pertain to
specific evaluation settings [389] tested in Section 3.4.5.
Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate
top-1 prediction miscalibration. A calibration estimator (Definition 7) measures
the Lp norm difference between a model’s posterior and the true likelihood of
being correct.
Definition 7 (Lp Calibration Error). [231, 463]
The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y )
with the Lp norm p ∈ [1, ∞) is given by:


CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp
(2.12)
The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the
above with p = 1, where the expectation is approximated using a histogram.
MaxCE defines the worst-case risk version with p = ∞, effectively reporting on
the bin with the highest error. As part of Chapter 5, we contributed a novel
empirical estimator of top-1 calibration for the task of VQA, where the exact
accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior
work [329] used a similar strategy of thresholding continuous quality scores to
be able to estimate ECE.
In practice, ECE is implemented as a histogram binning estimator that
discretizes predicted probabilities into ranges of possible values for which
conditional expectation can be estimated. Concretely, the probability space
is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap
between observed accuracy and bin confidence P¯b is measured, with a final