@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and…"

gsarti

posted an update Mar 4

Post

🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and scalable method for localizing LLM behaviour to components by J. Kramár T. Lieberum R. Shah @NeelNanda

The attribution patching method (AtP) can provide fast and effective approximations of activation patching, requiring only two forward passes and one backward pass to estimate the contribution of all network components for a given prompt pair.

While previous work highlighted the effectiveness of attribution patching, authors identify two settings leading to false negatives using AtP:

- When estimating the contribution of pre-activation components, if clean and noise inputs don’t lie in the same activation region, the first-order gradient approximation provided by the gradient leads to large errors (Fig 3).
- When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution.

Authors propose two changes to the AtP method to mitigate such issues:

- recomputing the attention softmax for the selected component, and then taking a linear approximation to the remaining part of the model (QK Fix)
- Iteratively zeroing gradients at layers contributing to the indirect effects causing cancellation (GradDrop)

AtP and AtP* are compared across several patching settings for Pythia models, finding them effective while much less computationally expensive than other approaches. A new methodology is also proposed to estimate the magnitude of AtP* false negatives given a set of samples and desired confidence levels.

📄 Paper: AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

nirmalendu01

Apr 10

I don't understand this part " When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution." Is it possible for you to elaborate. Thanks!

gsarti

Apr 10

Figure 3 shows this well: if you visualize the linear approximation induced by AtP between clean/patched attentions, this approximations fares poorly when the values are in a saturated region of the softmax (see the Native AtP error between patched vs. approximated probability). If the values were in the non saturated region of the softmax instead (i.e. the steep part of the curve), the approximation would be much better! Hope it helps!

Join the conversation