Spaces:
Paused
Paused
RELIABILITY AND ROBUSTNESS | |
21 | |
of a foundation model for DU tasks (Chapters 4 to 6) or to contrast with 1-D | |
CNNs in text classification (Chapter 3). Note that [265] share our concerns that | |
NLP needs a new ‘playground’ with more realistic tasks and benchmarks, which | |
extend beyond sentence-level contexts to more complex document-level tasks. | |
Alternative sub-quadratic architectures have started addressing Transformer’s | |
computational inefficiency on long sequences, e.g., Mamba [152] and Longnet | |
[99]. Time will tell if these will be able to compete with the Transformer’s | |
dominance in foundation models. | |
2.2 | |
Reliability and Robustness | |
Chapter 3 contains a lot of relevant content on the basic relation between | |
uncertainty quantification, calibration, and distributional generalization or | |
detection tasks. Here, we will focus on the more general concepts of reliability | |
and robustness, and how they relate to concepts used throughout the rest of | |
the thesis. Next, we discuss the need for confidence estimation and appropriate | |
evaluation metrics, followed by short summaries of the main research trends in | |
calibration and uncertainty quantification. | |
Emerging guidance and regulations [2, 3, 475] place increasing importance on | |
the reliability and robustness of ML systems, particularly once they are used | |
in the public sphere or in safety-critical applications. In ML, reliability and | |
robustness are often used interchangeably [78, 420, 455], yet they are distinct | |
concepts, and it is important to understand the difference between them. This | |
thesis uses the following definitions of reliability and robustness, adapted from | |
systems engineering literature [395]: | |
Definition 3 [Reliability]. Reliability is the ability of a system to consistently | |
perform its intended function in a specific, known environment for a specific | |
period of time, with a specific level of expected accuracy [395]. Closer to the ML | |
context, this entails all evaluation under the i.i.d. assumption, allowing for some | |
benign shifts of the distribution, including predictive performance evaluation | |
with task-dependent metrics (accuracy, F1, perplexity, etc.), calibration, selective | |
prediction, uncertainty estimation, etc. | |
Reliability requires to clearly specify the role an ML component plays in a | |
larger system, and to define the expected behavior of the system as a function | |
of alignment with the training data distribution. This is particularly important | |
in the context of black-box models, where the inner workings of the model are | |
not transparent to the user. In this case, the user needs to be aware of the | |
model’s limitations, e.g., model misspecification, lack of training data, and the | |