Spaces:
Paused
Paused
RELIABILITY AND ROBUSTNESS | |
27 | |
average weighted by the number of samples per bin |bi |. | |
ECE = | |
B | |
X | |
|bi | | |
i=1 | |
N | |
acc(bi ) − P¯b (bi ) | |
(2.13) | |
To minimize the drawbacks inherited from histogram binning, as suggested | |
by the literature [231, 342, 393, √ | |
463], we have applied an equal-mass binning | |
scheme with 100 bins (close to N ). While plenty of histogram-based ECE | |
estimator implementations exist, many design hyperparameters are not reported | |
or exposed: | |
I. | |
II. | |
III. | |
IV. | |
V. | |
`p norm | |
The number of bins (beyond the unfounded default of |B| = 15) | |
Different binning schemes (equal-range, equal-mass) | |
Binning range to define the operating zone | |
Proxy used as bin accuracy (lower-e.g., center, upper-edge) | |
We upstreamed 1 a generic implementation of binning-based ECE as part of | |
the ICDAR 2023 DUDE competition (Chapter 5). | |
Alternative formulations have been developed for multi-class [342, 370, 492] | |
and multi-label calibration [493, 520]. Measurements of “strong” calibration, | |
over the full predicted vector instead of the winning class, are reported less in | |
practice. Possible reasons are that they render class-wise scorings, either based | |
on adaptive thresholds or require estimation of kernel-based calibration error | |
to derive hypothesis tests. While we are mindful of alternatives (revisited in | |
Section 2.2.4), we have found that the simpler “weak” calibration measured by | |
ECE meets the practical requirements for most of our benchmarking. | |
Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error % | |
under given coverage). The metric explicitly assesses i.i.d. failure detection | |
performance as desired for safe deployment. It has advantages as a primary | |
evaluation metric given that it is effective both when underlying prediction | |
models are the same or different (as opposed to AUROC or AUPR). Its most | |
general form (without any curve approximation), with a task-specific evaluation | |
metric ` and CSF g, is defined as: | |
E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]] | |
AURC(f, g) = Ex∼P(X) | |
(2.14) | |
Ex̃∼PX [I[g(x̃) > g(x)]] | |
This captures the intuition that the CSF g should be able to rank instances by | |
their risk, and that the risk should be low for instances with high confidence. | |
1 https://huggingface.co/spaces/jordyvl/ece | |