Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0060.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.64 kB

	28

	FUNDAMENTALS

	The standard curve metric can be obtained by sorting all CSF estimates and
	P
	T P +F P
	evaluating risk ( T PF+F
	P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P
	if above threshold) from high to low, together with their respective correctness (T
	if correct). This is normally based on exact match, yet for generative evaluation
	in Section 5.3.5, we have applied ANLS thresholding instead. Formulated
	this way, the best possible AURC is constrained by the model’s test error
	(1-ANLS) and the number of test instances. AURC might be more sensible for
	evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be
	better controlled and error tolerance is an apriori system-level decision [115].
	This metric was used in every chapter of Part II.
	For the evaluation under distribution shift in Chapter 3, we have used binary
	classification metrics following [172], Area Under the Receiver Operating
	Characteristic Curve (AUROC) and Area Under the Precision-Recall
	Curve (AUPR), which are threshold-independent measures that summarize
	detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability
	that a randomly chosen out-of-distribution sample is assigned a higher confidence
	score than a randomly chosen in-distribution sample. AUPR is more informative
	under class imbalance.

	2.2.4

	Calibration

	The study of calibration originated in the meteorology and statistics literature,
	primarily in the context of proper loss functions [330] for evaluating
	probabilistic forecasts. Calibration promises i) interpretability, ii) system
	integration, iii) active learning, and iv) improved accuracy. A calibrated model,
	as defined in Definition 8, can be interpreted as a probabilistic model, which can
	be integrated into a larger system, and can guide active learning with potentially
	fewer samples. Research into calibration regained popularity after repeated
	empirical observations of overconfidence in DNNs [156, 339].
	Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of
	an empirical predictor f , which states that on finite-sample data it converges
	to a solution where the confidence scoring function reflects the probability ρ of
	being correct. Perfect calibration, CE(f ) = 0, is satisfied iff:
	P(Y = Ŷ \| f (X) = ρ) = ρ,

	∀ρ ∈ [0, 1]

	(2.15)

	Below, we characterize calibration research in two directions: (A) CSF evaluation
	with both theoretical guarantees and practical estimation methodologies
	• Estimators for calibration notions beyond top-1 [229, 231, 342, 463]