Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0059.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.34 kB

	RELIABILITY AND ROBUSTNESS

	27

	average weighted by the number of samples per bin \|bi \|.
	ECE =

	B
	X
	\|bi \|
	i=1

	N

	acc(bi ) − P¯b (bi )

	(2.13)

	To minimize the drawbacks inherited from histogram binning, as suggested
	by the literature [231, 342, 393, √
	463], we have applied an equal-mass binning
	scheme with 100 bins (close to N ). While plenty of histogram-based ECE
	estimator implementations exist, many design hyperparameters are not reported
	or exposed:
	I.
	II.
	III.
	IV.
	V.

	`p norm
	The number of bins (beyond the unfounded default of \|B\| = 15)
	Different binning schemes (equal-range, equal-mass)
	Binning range to define the operating zone
	Proxy used as bin accuracy (lower-e.g., center, upper-edge)

	We upstreamed 1 a generic implementation of binning-based ECE as part of
	the ICDAR 2023 DUDE competition (Chapter 5).
	Alternative formulations have been developed for multi-class [342, 370, 492]
	and multi-label calibration [493, 520]. Measurements of “strong” calibration,
	over the full predicted vector instead of the winning class, are reported less in
	practice. Possible reasons are that they render class-wise scorings, either based
	on adaptive thresholds or require estimation of kernel-based calibration error
	to derive hypothesis tests. While we are mindful of alternatives (revisited in
	Section 2.2.4), we have found that the simpler “weak” calibration measured by
	ECE meets the practical requirements for most of our benchmarking.
	Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error %
	under given coverage). The metric explicitly assesses i.i.d. failure detection
	performance as desired for safe deployment. It has advantages as a primary
	evaluation metric given that it is effective both when underlying prediction
	models are the same or different (as opposed to AUROC or AUPR). Its most
	general form (without any curve approximation), with a task-specific evaluation
	metric ` and CSF g, is defined as:


	E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]]
	AURC(f, g) = Ex∼P(X)
	(2.14)
	Ex̃∼PX [I[g(x̃) > g(x)]]
	This captures the intuition that the CSF g should be able to rank instances by
	their risk, and that the risk should be low for instances with high confidence.
	1 https://huggingface.co/spaces/jordyvl/ece