Post
π Today's pick in Interpretability & Analysis of LMs: Can Large Language Models Explain Themselves? by
@andreasmadsen
Sarath Chandar &
@sivareddyg
LLMs can provide wrong but convincing explanations for their behavior, and this might lead to ill-placed confidence in their predictions. This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.
π Paper: Can Large Language Models Explain Themselves? (2401.07927)
LLMs can provide wrong but convincing explanations for their behavior, and this might lead to ill-placed confidence in their predictions. This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.
π Paper: Can Large Language Models Explain Themselves? (2401.07927)