Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0055.txt

jordyvl

First commit

e0a78f5 10 months ago

raw

history blame

2.52 kB

	RELIABILITY AND ROBUSTNESS

	23

	domain or task). If the model has access to limited samples for training
	on the new distribution, it is referred to as few-shot learning or no samples at
	all, zero-shot learning; if it is able to adapt to new distributions over time, or
	accumulate knowledge over different tasks without retraining from scratch [87],
	it is referred to as continual learning or incremental learning.
	Many of these settings are referred to in business as out-of-the-box, self-learning,
	yet without any formal definitions given. Domain and task generalization are
	major selling points of pretrained LLMs, which are able to perform well on a
	wide range of tasks and domains. In the case of very different distributions, e.g.,
	a different task/expected output or an additional domain/input modality, it is
	often necessary to fine-tune the model on a small amount of data from the new
	distribution, which is known as transfer learning. Specific to LLMs, instruction
	tuning is a form of transfer learning, where samples from a new distribution are
	appended with natural language instructions [69, 532]. This approach has been
	used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
	effort to reduce the amount of annotated data required to generalize to unseen
	domains and questions.

	2.2.2

	Confidence Estimation

	A quintessential component of reliability and robustness requires a model to
	estimate its own uncertainty, or inversely to translate model outputs into
	probabilities or ‘confidence’ (Definition 6).
	Definition 6 [Confidence Scoring Function]. Any function g : X → R
	whose continuous output aims to separate a model’s failures from correct
	predictions can be interpreted as a confidence scoring function (CSF) [193].
	Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
	thresholding, this is not a strict requirement.
	Circling back on the question of why one needs a CSF, there are multiple reasons:
	i) ML models are continually improving, yet 0 test error is an illusion, even a
	toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
	performance deterioration is expected due to i.i.d. assumptions breaking; iii)
	generative models are prone to hallucinations [198], requiring some control
	mechanisms and guardrails to guide them.
	Below, we present some common CSFs used in practice [114, 172, 194, 539],
	where for convenience the subscript is reused to denote the k-th element of the
	output vector g(x) = gk (x).