ask_my_thesis / assets /txts /pg_0059.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.34 kB
RELIABILITY AND ROBUSTNESS
27
average weighted by the number of samples per bin |bi |.
ECE =
B
X
|bi |
i=1
N
acc(bi ) − P¯b (bi )
(2.13)
To minimize the drawbacks inherited from histogram binning, as suggested
by the literature [231, 342, 393, √
463], we have applied an equal-mass binning
scheme with 100 bins (close to N ). While plenty of histogram-based ECE
estimator implementations exist, many design hyperparameters are not reported
or exposed:
I.
II.
III.
IV.
V.
`p norm
The number of bins (beyond the unfounded default of |B| = 15)
Different binning schemes (equal-range, equal-mass)
Binning range to define the operating zone
Proxy used as bin accuracy (lower-e.g., center, upper-edge)
We upstreamed 1 a generic implementation of binning-based ECE as part of
the ICDAR 2023 DUDE competition (Chapter 5).
Alternative formulations have been developed for multi-class [342, 370, 492]
and multi-label calibration [493, 520]. Measurements of “strong” calibration,
over the full predicted vector instead of the winning class, are reported less in
practice. Possible reasons are that they render class-wise scorings, either based
on adaptive thresholds or require estimation of kernel-based calibration error
to derive hypothesis tests. While we are mindful of alternatives (revisited in
Section 2.2.4), we have found that the simpler “weak” calibration measured by
ECE meets the practical requirements for most of our benchmarking.
Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error %
under given coverage). The metric explicitly assesses i.i.d. failure detection
performance as desired for safe deployment. It has advantages as a primary
evaluation metric given that it is effective both when underlying prediction
models are the same or different (as opposed to AUROC or AUPR). Its most
general form (without any curve approximation), with a task-specific evaluation
metric ` and CSF g, is defined as:


E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]]
AURC(f, g) = Ex∼P(X)
(2.14)
Ex̃∼PX [I[g(x̃) > g(x)]]
This captures the intuition that the CSF g should be able to rank instances by
their risk, and that the risk should be low for instances with high confidence.
1 https://huggingface.co/spaces/jordyvl/ece