RELIABILITY AND ROBUSTNESS 25 Consider for simplicity, the evaluation of a single non-list ground truth answer G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.  1 if NA(G) ∧ |P̂ | > 0,      0 if NA(G) ∧ |P̂ | = 0,      |G| if |P̂ | = 0, LD(G, P̂ ) = LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],     if G[0] 6= P̂ [0] (deletion),  LD(tail(G), P̂ )     1 + min LD(G, tail(P̂ )) if G[0] 6= P̂ [0] (insertion),     LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution) (2.7) Each of the conditions is tested in turn, and the first one that is true is executed. The normalized similarity metric is then defined as NLS(G, P̂ ) = 1 − LD(G, P̂ ) max(1, |G|, |P̂ |) . Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted answer for P̂Qi for each question Q in the test set of size N , we define the complete metric as follows: N    1 X ANLS = max s a, P̂Qi N i=1 a∈Gi   s a, P̂Qi =     NLS a, P̂Q i  0   if NLS a, P̂Qi > τ   , if NLS a, P̂Qi < τ (2.8) (2.9) where we follow prior literature [39, 449] in setting the threshold τ = 0.5. In the case of a list-type question, Hungarian matching is performed following [449] according to NLS between each ground truth answer part and each prediction answer part. Proper scoring rules [330] are used for generic evaluation of predictive performance, which calculate scoring at the instance-level while measuring both the quality of the predictive function and predicted probability distribution (as they are not compatible with an arbitrary CSF): • Negative Log Likelihood (NLL) [378] is both a popular loss function (cross-entropy) and scoring rule which only penalizes (wrong) log probabilities qi given to the true class, with I an indicator function defining