Spaces:
Paused
Paused
RELIABILITY AND ROBUSTNESS | |
23 | |
domain or task). If the model has access to limited samples for training | |
on the new distribution, it is referred to as few-shot learning or no samples at | |
all, zero-shot learning; if it is able to adapt to new distributions over time, or | |
accumulate knowledge over different tasks without retraining from scratch [87], | |
it is referred to as continual learning or incremental learning. | |
Many of these settings are referred to in business as out-of-the-box, self-learning, | |
yet without any formal definitions given. Domain and task generalization are | |
major selling points of pretrained LLMs, which are able to perform well on a | |
wide range of tasks and domains. In the case of very different distributions, e.g., | |
a different task/expected output or an additional domain/input modality, it is | |
often necessary to fine-tune the model on a small amount of data from the new | |
distribution, which is known as transfer learning. Specific to LLMs, instruction | |
tuning is a form of transfer learning, where samples from a new distribution are | |
appended with natural language instructions [69, 532]. This approach has been | |
used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an | |
effort to reduce the amount of annotated data required to generalize to unseen | |
domains and questions. | |
2.2.2 | |
Confidence Estimation | |
A quintessential component of reliability and robustness requires a model to | |
estimate its own uncertainty, or inversely to translate model outputs into | |
probabilities or ‘confidence’ (Definition 6). | |
Definition 6 [Confidence Scoring Function]. Any function g : X → R | |
whose continuous output aims to separate a model’s failures from correct | |
predictions can be interpreted as a confidence scoring function (CSF) [193]. | |
Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier | |
thresholding, this is not a strict requirement. | |
Circling back on the question of why one needs a CSF, there are multiple reasons: | |
i) ML models are continually improving, yet 0 test error is an illusion, even a | |
toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed, | |
performance deterioration is expected due to i.i.d. assumptions breaking; iii) | |
generative models are prone to hallucinations [198], requiring some control | |
mechanisms and guardrails to guide them. | |
Below, we present some common CSFs used in practice [114, 172, 194, 539], | |
where for convenience the subscript is reused to denote the k-th element of the | |
output vector g(x) = gk (x). | |