Spaces:
Paused
Paused
File size: 2,210 Bytes
e0a78f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
26 FUNDAMENTALS the true class. This measure more heavily penalizes sharp probabilities, which are close to the wrong edge or class by over/under-confidence. `NLL (f ) = − N K 1 XX I [yi = k] · log (fk (xi )) N i=1 (2.10) k=1 • Brier Score [50] is a scoring rule that measures the accuracy of a probabilistic classifier and is related to the mean-squared error (MSE) loss function. Brier score is more commonly used in industrial practice since it is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities less severely than NLL. `BS (f ) = N K 1 XX 2 (I (yi = k) − fk (xi )) N i=1 (2.11) k=1 All metrics following require a CSF g(x) to be defined, and can pertain to specific evaluation settings [389] tested in Section 3.4.5. Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate top-1 prediction miscalibration. A calibration estimator (Definition 7) measures the Lp norm difference between a model’s posterior and the true likelihood of being correct. Definition 7 (Lp Calibration Error). [231, 463] The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y ) with the Lp norm p ∈ [1, ∞) is given by: CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp (2.12) The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the above with p = 1, where the expectation is approximated using a histogram. MaxCE defines the worst-case risk version with p = ∞, effectively reporting on the bin with the highest error. As part of Chapter 5, we contributed a novel empirical estimator of top-1 calibration for the task of VQA, where the exact accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior work [329] used a similar strategy of thresholding continuous quality scores to be able to estimate ECE. In practice, ECE is implemented as a histogram binning estimator that discretizes predicted probabilities into ranges of possible values for which conditional expectation can be estimated. Concretely, the probability space is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap between observed accuracy and bin confidence P¯b is measured, with a final |