Papers
arxiv:2412.07334

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

Published on Dec 10
· Submitted by pvalois on Dec 11
Authors:
,
,

Abstract

Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git

Community

Paper author Paper submitter
edited 8 days ago

The Frame Representation Hypothesis is a robust framework for understanding and controlling LLMs: we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept.

overview.jpg

We showcase these tools through Guided Decoding, which can intuitively steer text generation using concepts of choice: Top-k tokens derived from LLM and one maximizing correlation with target Concept Frame is chosen.

guidance.png

We use the Open Multilingual WordNet to generate Concept Frames that can both guide the model text generation and expose biases or vulnerabilities.

CAUTION: There are examples containing sensitive material that may be distressing for some audiences.

example-men.png

example-women.png

example-children.png

example-women-2.png

Paper author Paper submitter
edited 8 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Really interesting work, specially showing how these models have many vulnerabilities which need to be fixed.
Would also be interesting to explore more about these effects on other language models which have more diverse language capabilities.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.07334 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.07334 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.07334 in a Space README.md to link it from this page.

Collections including this paper 1