Spaces:
Paused
Paused
32 | |
FUNDAMENTALS | |
fit the model to the data well and ensure that the approximate posterior is | |
encouraged to be as close as possible to the true posterior distribution. | |
Even a non-Bayesian, classic NN can be interpreted in this framework as an | |
approximate, degenerate posterior distribution, i.e., a Dirac delta function | |
centered on the MAP estimate of the parameters, q(θ|D; φ) = δ(θ − θ̂MAP ). | |
More PUQ methods based on different posterior approximations are discussed | |
in detail in Chapter 3, with additional updates on the state-of-the-art. | |
2.2.6 | |
Failure Prediction | |
Based on the principle of selective prediction [138, 139], failure prediction is | |
the task of predicting whether a model will fail on a given input. In every chapter | |
following Chapter 3, this topic is addressed in the context of the respective | |
task. Since it is an important topic in the context of IA-DU that is generating | |
increasing interest [81, 114, 127, 193, 391], it warrants a brief overview of | |
how it provides a unified perspective. We refer the reader to [171, 536] for a | |
comprehensive survey. | |
Failure prediction subsumes many related tasks in the sense that it requires | |
a failure source to be defined to form a binary classification task. The failure | |
source can be i.i.d. mispredictions, covariate shifts (e.g., input corruptions, | |
concept drift, domain shift), a new class, domain, modality, task, or concept. | |
The goal of failure prediction is to predict these failures before they occur, | |
allowing for more reliable and robust ML systems. | |
First, note that calibration does not imply failure prediction, as a calibrated | |
model w.r.t. i.i.d. data can still be overconfident on OOD inputs [549]. The | |
example in Example 2.2.1 sketches the independent requirements of calibration | |
and confidence ranking. | |
Example 2.2.1. Classifier A scores 90% accuracy on the test set, with a CSF | |
using the entire range [0, 1]. Classifier B scores 92% accuracy on the test set, | |
but the CSF always reports 0.92 for any input. Which classifier is preferred in | |
a real-world setting? | |
• Classifier A is calibrated, but it is not possible to know whether it will | |
fail on a given input. | |
• Classifier B might be less calibrated, but the CSF allows separability to | |
predict failure on a given input. | |
Specific to OOD failure prediction, [527] provides a comprehensive categorization | |
of failure tasks and methods. | |