Spaces:
Paused
Paused
26 | |
FUNDAMENTALS | |
the true class. This measure more heavily penalizes sharp probabilities, | |
which are close to the wrong edge or class by over/under-confidence. | |
`NLL (f ) = − | |
N K | |
1 XX | |
I [yi = k] · log (fk (xi )) | |
N i=1 | |
(2.10) | |
k=1 | |
• Brier Score [50] is a scoring rule that measures the accuracy of a | |
probabilistic classifier and is related to the mean-squared error (MSE) loss | |
function. Brier score is more commonly used in industrial practice since it | |
is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities | |
less severely than NLL. | |
`BS (f ) = | |
N K | |
1 XX | |
2 | |
(I (yi = k) − fk (xi )) | |
N i=1 | |
(2.11) | |
k=1 | |
All metrics following require a CSF g(x) to be defined, and can pertain to | |
specific evaluation settings [389] tested in Section 3.4.5. | |
Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate | |
top-1 prediction miscalibration. A calibration estimator (Definition 7) measures | |
the Lp norm difference between a model’s posterior and the true likelihood of | |
being correct. | |
Definition 7 (Lp Calibration Error). [231, 463] | |
The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y ) | |
with the Lp norm p ∈ [1, ∞) is given by: | |
CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp | |
(2.12) | |
The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the | |
above with p = 1, where the expectation is approximated using a histogram. | |
MaxCE defines the worst-case risk version with p = ∞, effectively reporting on | |
the bin with the highest error. As part of Chapter 5, we contributed a novel | |
empirical estimator of top-1 calibration for the task of VQA, where the exact | |
accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior | |
work [329] used a similar strategy of thresholding continuous quality scores to | |
be able to estimate ECE. | |
In practice, ECE is implemented as a histogram binning estimator that | |
discretizes predicted probabilities into ranges of possible values for which | |
conditional expectation can be estimated. Concretely, the probability space | |
is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap | |
between observed accuracy and bin confidence P¯b is measured, with a final | |