RELIABILITY AND ROBUSTNESS

29

• Theoretical frameworks to generalize over existing metrics and design
novel metrics [43, 231, 492, 493]
• Specialize towards a task such as multi-class classification [463], regression
[228, 428], or structured prediction [227]
• Alternative error estimation procedures, based on histogram regression
[156, 331, 332, 340, 343], kernels [230, 370, 492, 493] or splines [159]
(B) Calibration methods for improving the reliability of a model by adapting
the CSF or inducing calibration during training of f :
• Learn a post-hoc forecaster F : f (X) → [0, 1] on top of f (overview: [298])
• Modify the training procedure with regularization (overview: [277, 370])
Due to its importance in practice, we will provide more detail on train-time
calibration methods. It has been shown for a broad class of loss functions
that risk minimization leads to Fisher consistent, Bayes optimal classifiers in
the asymptotic limit [25, 495]. These can be shown to decompose into a sum
of multiple metrics including both accuracy and calibration error [144, 177].
However, there is no –finite data, nor asymptotic– guarantee that classifiers
trained with proper loss functions containing an explicit calibration term
will eventually be well-calibrated. In practice, being entangled with other
optimization terms often leads to sub-optimal calibration. For this reason,
recent studies [12, 230, 492] have derived trainable estimators of calibration
to have a better handle (γ > 0) on penalizing miscalibration, i.e., by jointly
optimizing risk (R(f ) = EX,Y [` (Y, f (X))]) and parameterized calibration error
(CE) as in Equation (2.16).
fˆ = arg min (R(f ) + γ CE(f ))
f ∈F

(2.16)

Many of these methods are implicitly or explicitly maximizing entropy of
predictions or entropy relative to another probability distribution, e.g., Entropy
Regularization [361], Label Smoothing (LS) [327], Focal Loss [324], Marginbased LS [277], next to more direct (differentiable), kernel-based calibration
error estimation [211, 230, 370, 492, 493, 526]. We had expected community
contribution on the DUDE competition (Chapter 5) to take advantage of this
wealth of calibration methods, yet the majority of submissions used uncalibrated
models with MSP, requiring more education on the importance of calibration
in practice.