RELIABILITY AND ROBUSTNESS 29 • Theoretical frameworks to generalize over existing metrics and design novel metrics [43, 231, 492, 493] • Specialize towards a task such as multi-class classification [463], regression [228, 428], or structured prediction [227] • Alternative error estimation procedures, based on histogram regression [156, 331, 332, 340, 343], kernels [230, 370, 492, 493] or splines [159] (B) Calibration methods for improving the reliability of a model by adapting the CSF or inducing calibration during training of f : • Learn a post-hoc forecaster F : f (X) → [0, 1] on top of f (overview: [298]) • Modify the training procedure with regularization (overview: [277, 370]) Due to its importance in practice, we will provide more detail on train-time calibration methods. It has been shown for a broad class of loss functions that risk minimization leads to Fisher consistent, Bayes optimal classifiers in the asymptotic limit [25, 495]. These can be shown to decompose into a sum of multiple metrics including both accuracy and calibration error [144, 177]. However, there is no –finite data, nor asymptotic– guarantee that classifiers trained with proper loss functions containing an explicit calibration term will eventually be well-calibrated. In practice, being entangled with other optimization terms often leads to sub-optimal calibration. For this reason, recent studies [12, 230, 492] have derived trainable estimators of calibration to have a better handle (γ > 0) on penalizing miscalibration, i.e., by jointly optimizing risk (R(f ) = EX,Y [` (Y, f (X))]) and parameterized calibration error (CE) as in Equation (2.16). fˆ = arg min (R(f ) + γ CE(f )) f ∈F (2.16) Many of these methods are implicitly or explicitly maximizing entropy of predictions or entropy relative to another probability distribution, e.g., Entropy Regularization [361], Label Smoothing (LS) [327], Focal Loss [324], Marginbased LS [277], next to more direct (differentiable), kernel-based calibration error estimation [211, 230, 370, 492, 493, 526]. We had expected community contribution on the DUDE competition (Chapter 5) to take advantage of this wealth of calibration methods, yet the majority of submissions used uncalibrated models with MSP, requiring more education on the importance of calibration in practice.