GO:OD:AM PRO

tegridydev

AI & ML interests

Mechanistic Interpretability (MI) Research & sp00ky code stuff

Recent Activity

reacted to their post with 👀 about 9 hours ago

So, what is #MechanisticInterpretability 🤔 Mechanistic Interpretability (MI) is the discipline of opening the black box of large language models (and other neural networks) to understand the underlying circuits, features and/or mechanisms that give rise to specific behaviours Instead of treating a model as a monolithic function, we can: 1. Trace how input tokens propagate through attention heads & MLP layers 2. Identify localized “circuit motifs” 3. Develop methods to systematically break down or “edit” these circuits to confirm we understand the causal structure. Mechanistic Interpretability aims to yield human-understandable explanations of how advanced models represent and manipulate concepts which hopefully leads to 1. Trust & Reliability 2. Safety & Alignment 3. Better Debugging / Development Insights https://bsky.app/profile/mechanistics.bsky.social/post/3lgvvv72uls2x

liked a model about 21 hours ago

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

updated a dataset about 22 hours ago

tegridydev/open-malsec

View all activity

Organizations

None yet

tegridydev's activity

reacted to their post with 👀 about 9 hours ago

Post

680

So, what is #MechanisticInterpretability 🤔

Mechanistic Interpretability (MI) is the discipline of opening the black box of large language models (and other neural networks) to understand the underlying circuits, features and/or mechanisms that give rise to specific behaviours

Instead of treating a model as a monolithic function, we can:

1. Trace how input tokens propagate through attention heads & MLP layers
2. Identify localized “circuit motifs”
3. Develop methods to systematically break down or “edit” these circuits to confirm we understand the causal structure.

Mechanistic Interpretability aims to yield human-understandable explanations of how advanced models represent and manipulate concepts which hopefully leads to

1. Trust & Reliability
2. Safety & Alignment
3. Better Debugging / Development Insights

https://bsky.app/profile/mechanistics.bsky.social/post/3lgvvv72uls2x

1 reply

liked a model about 21 hours ago

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Text Generation • Updated 5 days ago • 225k • 568

updated a dataset about 22 hours ago

tegridydev/open-malsec

Updated about 22 hours ago • 10 • 1

liked a model about 23 hours ago

PowerInfer/SmallThinker-3B-Preview

Text Generation • Updated 14 days ago • 110k • 377

posted an update 1 day ago

Post

680

So, what is #MechanisticInterpretability 🤔

Mechanistic Interpretability (MI) is the discipline of opening the black box of large language models (and other neural networks) to understand the underlying circuits, features and/or mechanisms that give rise to specific behaviours

Instead of treating a model as a monolithic function, we can:

1. Trace how input tokens propagate through attention heads & MLP layers
2. Identify localized “circuit motifs”
3. Develop methods to systematically break down or “edit” these circuits to confirm we understand the causal structure.

Mechanistic Interpretability aims to yield human-understandable explanations of how advanced models represent and manipulate concepts which hopefully leads to

1. Trust & Reliability
2. Safety & Alignment
3. Better Debugging / Development Insights

https://bsky.app/profile/mechanistics.bsky.social/post/3lgvvv72uls2x