ask_my_thesis / assets /txts /pg_0055.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
2.52 kB
RELIABILITY AND ROBUSTNESS
23
domain or task). If the model has access to limited samples for training
on the new distribution, it is referred to as few-shot learning or no samples at
all, zero-shot learning; if it is able to adapt to new distributions over time, or
accumulate knowledge over different tasks without retraining from scratch [87],
it is referred to as continual learning or incremental learning.
Many of these settings are referred to in business as out-of-the-box, self-learning,
yet without any formal definitions given. Domain and task generalization are
major selling points of pretrained LLMs, which are able to perform well on a
wide range of tasks and domains. In the case of very different distributions, e.g.,
a different task/expected output or an additional domain/input modality, it is
often necessary to fine-tune the model on a small amount of data from the new
distribution, which is known as transfer learning. Specific to LLMs, instruction
tuning is a form of transfer learning, where samples from a new distribution are
appended with natural language instructions [69, 532]. This approach has been
used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
effort to reduce the amount of annotated data required to generalize to unseen
domains and questions.
2.2.2
Confidence Estimation
A quintessential component of reliability and robustness requires a model to
estimate its own uncertainty, or inversely to translate model outputs into
probabilities or ‘confidence’ (Definition 6).
Definition 6 [Confidence Scoring Function]. Any function g : X → R
whose continuous output aims to separate a model’s failures from correct
predictions can be interpreted as a confidence scoring function (CSF) [193].
Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
thresholding, this is not a strict requirement.
Circling back on the question of why one needs a CSF, there are multiple reasons:
i) ML models are continually improving, yet 0 test error is an illusion, even a
toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
performance deterioration is expected due to i.i.d. assumptions breaking; iii)
generative models are prone to hallucinations [198], requiring some control
mechanisms and guardrails to guide them.
Below, we present some common CSFs used in practice [114, 172, 194, 539],
where for convenience the subscript is reused to denote the k-th element of the
output vector g(x) = gk (x).