Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
AI & ML interests
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Recent Activity
View all activity
Papers
View all PapersObfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 1.22k -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 3 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 2
Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 1.22k -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 3 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 2
models 629
AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Updated
AlignmentResearch/diverse-deception-probe-gemma-3-12b-it
Updated
AlignmentResearch/diverse-deception-probe-qwen3-8b
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-think
Updated
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.001-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl1-det1-seed3-mbpp_probe
Updated • 4
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 3
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.01-det1-seed3-mbpp_probe
Updated • 2
datasets 99
AlignmentResearch/deceptive-followup-v19
Viewer • Updated • 49.4k
AlignmentResearch/deceptive-followup-v17
Viewer • Updated • 44.2k
AlignmentResearch/deceptive-followup-v16
Viewer • Updated • 42.6k
AlignmentResearch/deceptive-followup-v15
Viewer • Updated • 44.8k • 51
AlignmentResearch/deceptive-followup-v13
Viewer • Updated • 39.7k • 120
AlignmentResearch/deceptive-followup-v11
Viewer • Updated • 32.6k • 30
AlignmentResearch/deceptive-followup-v9
Viewer • Updated • 30.3k • 25
AlignmentResearch/deceptive-followup-v7
Viewer • Updated • 28k • 89
AlignmentResearch/deceptive-followup-v6
Viewer • Updated • 24.7k • 47
AlignmentResearch/deceptive-followup-v5
Viewer • Updated • 21k • 50