Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation
Abstract
Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git
Community
The Frame Representation Hypothesis is a robust framework for understanding and controlling LLMs: we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept.
We showcase these tools through Guided Decoding, which can intuitively steer text generation using concepts of choice: Top-k tokens derived from LLM and one maximizing correlation with target Concept Frame is chosen.
We use the Open Multilingual WordNet to generate Concept Frames that can both guide the model text generation and expose biases or vulnerabilities.
CAUTION: There are examples containing sensitive material that may be distressing for some audiences.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings (2024)
- Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models (2024)
- LBPE: Long-token-first Tokenization to Improve Large Language Models (2024)
- Training Large Language Models to Reason in a Continuous Latent Space (2024)
- Training and Evaluating Language Models with Template-based Data Generation (2024)
- A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models (2024)
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Really interesting work, specially showing how these models have many vulnerabilities which need to be fixed.
Would also be interesting to explore more about these effects on other language models which have more diverse language capabilities.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper