RELIABILITY AND ROBUSTNESS 27 average weighted by the number of samples per bin |bi |. ECE = B X |bi | i=1 N acc(bi ) − P¯b (bi ) (2.13) To minimize the drawbacks inherited from histogram binning, as suggested by the literature [231, 342, 393, √ 463], we have applied an equal-mass binning scheme with 100 bins (close to N ). While plenty of histogram-based ECE estimator implementations exist, many design hyperparameters are not reported or exposed: I. II. III. IV. V. `p norm The number of bins (beyond the unfounded default of |B| = 15) Different binning schemes (equal-range, equal-mass) Binning range to define the operating zone Proxy used as bin accuracy (lower-e.g., center, upper-edge) We upstreamed 1 a generic implementation of binning-based ECE as part of the ICDAR 2023 DUDE competition (Chapter 5). Alternative formulations have been developed for multi-class [342, 370, 492] and multi-label calibration [493, 520]. Measurements of “strong” calibration, over the full predicted vector instead of the winning class, are reported less in practice. Possible reasons are that they render class-wise scorings, either based on adaptive thresholds or require estimation of kernel-based calibration error to derive hypothesis tests. While we are mindful of alternatives (revisited in Section 2.2.4), we have found that the simpler “weak” calibration measured by ECE meets the practical requirements for most of our benchmarking. Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error % under given coverage). The metric explicitly assesses i.i.d. failure detection performance as desired for safe deployment. It has advantages as a primary evaluation metric given that it is effective both when underlying prediction models are the same or different (as opposed to AUROC or AUPR). Its most general form (without any curve approximation), with a task-specific evaluation metric ` and CSF g, is defined as:   E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]] AURC(f, g) = Ex∼P(X) (2.14) Ex̃∼PX [I[g(x̃) > g(x)]] This captures the intuition that the CSF g should be able to rank instances by their risk, and that the risk should be low for instances with high confidence. 1 https://huggingface.co/spaces/jordyvl/ece