RELIABILITY AND ROBUSTNESS 21 of a foundation model for DU tasks (Chapters 4 to 6) or to contrast with 1-D CNNs in text classification (Chapter 3). Note that [265] share our concerns that NLP needs a new ‘playground’ with more realistic tasks and benchmarks, which extend beyond sentence-level contexts to more complex document-level tasks. Alternative sub-quadratic architectures have started addressing Transformer’s computational inefficiency on long sequences, e.g., Mamba [152] and Longnet [99]. Time will tell if these will be able to compete with the Transformer’s dominance in foundation models. 2.2 Reliability and Robustness Chapter 3 contains a lot of relevant content on the basic relation between uncertainty quantification, calibration, and distributional generalization or detection tasks. Here, we will focus on the more general concepts of reliability and robustness, and how they relate to concepts used throughout the rest of the thesis. Next, we discuss the need for confidence estimation and appropriate evaluation metrics, followed by short summaries of the main research trends in calibration and uncertainty quantification. Emerging guidance and regulations [2, 3, 475] place increasing importance on the reliability and robustness of ML systems, particularly once they are used in the public sphere or in safety-critical applications. In ML, reliability and robustness are often used interchangeably [78, 420, 455], yet they are distinct concepts, and it is important to understand the difference between them. This thesis uses the following definitions of reliability and robustness, adapted from systems engineering literature [395]: Definition 3 [Reliability]. Reliability is the ability of a system to consistently perform its intended function in a specific, known environment for a specific period of time, with a specific level of expected accuracy [395]. Closer to the ML context, this entails all evaluation under the i.i.d. assumption, allowing for some benign shifts of the distribution, including predictive performance evaluation with task-dependent metrics (accuracy, F1, perplexity, etc.), calibration, selective prediction, uncertainty estimation, etc. Reliability requires to clearly specify the role an ML component plays in a larger system, and to define the expected behavior of the system as a function of alignment with the training data distribution. This is particularly important in the context of black-box models, where the inner workings of the model are not transparent to the user. In this case, the user needs to be aware of the model’s limitations, e.g., model misspecification, lack of training data, and the