SAELens

suggestion: notate the canonical SAEs

#9
by dribnet - opened

Initially this huggingface release highlighted a family of default SAEs in a special "canonical" folder. This was very helpful as it highlighted which of the many obscurely named models should be used as the default for any particular set of parameters (layer, width, etc). But these canonical folders were recently removed here and now are only updated and maintained externally.

In order to encourage building a set of compatible tools around these models, it would be useful to continue to indicate which models in this release are blessed as canonical and should be used as defaults. If this is not at least explained in the README, people will likely not realize that this designation exists and downstream research efforts risk being incompatible with one another, etc. because they unknowingly use different non-canonical subsets of the SAE models.

Google org

Copy-and-pasting my comment on the Open Source Mech Interp slack:

Canonical SAEs deleted from Gemma Scope: decide which L0 is best for your use case instead

Hi all, when we released Gemma Scope I added "canonical" SAEs for some of the 2B MLP and residual stream SAEs, a copy of one the SAEs at each site with an L0 we thought was good (closest to 100).

However, the release was in flux at the last moment meaning the measured L0s, released with all SAEs changed and therefore sometimes the canonical SAE on HuggingFace was not the SAE with labelled L0 closest to 100 :melting_face:

This has caused lots of confusion, and so I'm deleting them now (thanks to Joseph Bloom and Samuel Marks for surfacing problems here). You can still download them if you really need them with the snippet in thread

I can certainly point people towards the canonical SAEs for the 2B-Resid-PT and 2B-MLP-PT -- does that resolve the issue or do you think more is needed?

My $0.02: It's difficult enough for someone approaching interpretability to choose which layers and widths to focus on without also forcing them to choose from a dozen different sparsity settings with no guidance on which one to use (eg: gemma-scope-9b-pt-res-layer_20-width_131k has 14). If the intention of this release is to have people using, understanding, and comparing downstream results on these models then not encouraging sensible defaults is counterproductive.

To expand further on reasoning: the canonical folders should not be restored since silent inconsistency with SAELens is much worse than extra steps to select the right one.

Note that https://huggingface.co/google/gemma-scope-2b-pt-res#4-how-can-i-use-these-saes-straight-away for example provides code to load the canonical SAEs.

I'll add this snippet to all repos and find a way to print the average L0 too.

ArthurConmyGDM changed discussion status to closed

Sign up or log in to comment