PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models Paper • 2406.15513 • Published Jun 20, 2024 • 1
ProgressGym: Alignment with a Millennium of Moral Progress Paper • 2406.20087 • Published Jun 28, 2024 • 3
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark Paper • 2304.03279 • Published Apr 6, 2023 • 1
When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning Paper • 2402.17747 • Published Feb 27, 2024
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Paper • 2311.01011 • Published Nov 2, 2023