Spaces:
Paused
Paused
28 | |
FUNDAMENTALS | |
The standard curve metric can be obtained by sorting all CSF estimates and | |
P | |
T P +F P | |
evaluating risk ( T PF+F | |
P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P | |
if above threshold) from high to low, together with their respective correctness (T | |
if correct). This is normally based on exact match, yet for generative evaluation | |
in Section 5.3.5, we have applied ANLS thresholding instead. Formulated | |
this way, the best possible AURC is constrained by the model’s test error | |
(1-ANLS) and the number of test instances. AURC might be more sensible for | |
evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be | |
better controlled and error tolerance is an apriori system-level decision [115]. | |
This metric was used in every chapter of Part II. | |
For the evaluation under distribution shift in Chapter 3, we have used binary | |
classification metrics following [172], Area Under the Receiver Operating | |
Characteristic Curve (AUROC) and Area Under the Precision-Recall | |
Curve (AUPR), which are threshold-independent measures that summarize | |
detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability | |
that a randomly chosen out-of-distribution sample is assigned a higher confidence | |
score than a randomly chosen in-distribution sample. AUPR is more informative | |
under class imbalance. | |
2.2.4 | |
Calibration | |
The study of calibration originated in the meteorology and statistics literature, | |
primarily in the context of proper loss functions [330] for evaluating | |
probabilistic forecasts. Calibration promises i) interpretability, ii) system | |
integration, iii) active learning, and iv) improved accuracy. A calibrated model, | |
as defined in Definition 8, can be interpreted as a probabilistic model, which can | |
be integrated into a larger system, and can guide active learning with potentially | |
fewer samples. Research into calibration regained popularity after repeated | |
empirical observations of overconfidence in DNNs [156, 339]. | |
Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of | |
an empirical predictor f , which states that on finite-sample data it converges | |
to a solution where the confidence scoring function reflects the probability ρ of | |
being correct. Perfect calibration, CE(f ) = 0, is satisfied iff: | |
P(Y = Ŷ | f (X) = ρ) = ρ, | |
∀ρ ∈ [0, 1] | |
(2.15) | |
Below, we characterize calibration research in two directions: (A) CSF evaluation | |
with both theoretical guarantees and practical estimation methodologies | |
• Estimators for calibration notions beyond top-1 [229, 231, 342, 463] | |