RELIABILITY AND ROBUSTNESS

25

Consider for simplicity, the evaluation of a single non-list ground truth answer
G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.

1 if NA(G) ∧ |P̂ | > 0,





0 if NA(G) ∧ |P̂ | = 0,




 |G| if |P̂ | = 0,
LD(G, P̂ ) =
LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],




if G[0] 6= P̂ [0] (deletion),
 LD(tail(G), P̂ )




1 + min
LD(G, tail(P̂ ))
if G[0] 6= P̂ [0] (insertion),




LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution)
(2.7)
Each of the conditions is tested in turn, and the first one that is true is executed.
The normalized similarity metric is then defined as
NLS(G, P̂ ) =

1 − LD(G, P̂ )
max(1, |G|, |P̂ |)

.

Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted
answer for P̂Qi for each question Q in the test set of size N , we define the
complete metric as follows:
N 


1 X
ANLS =
max s a, P̂Qi
N i=1 a∈Gi





s a, P̂Qi =




 NLS a, P̂Q
i
 0



if NLS a, P̂Qi > τ


,
if NLS a, P̂Qi < τ

(2.8)

(2.9)

where we follow prior literature [39, 449] in setting the threshold τ = 0.5.
In the case of a list-type question, Hungarian matching is performed following
[449] according to NLS between each ground truth answer part and each
prediction answer part.
Proper scoring rules [330] are used for generic evaluation of predictive
performance, which calculate scoring at the instance-level while measuring both
the quality of the predictive function and predicted probability distribution (as
they are not compatible with an arbitrary CSF):
• Negative Log Likelihood (NLL) [378] is both a popular loss function
(cross-entropy) and scoring rule which only penalizes (wrong) log
probabilities qi given to the true class, with I an indicator function defining