benchmark_submission.md · agent-evals/leaderboard at 178673fe9894abb088078c9f69efda8b35d8fdf2

To submit a new benchmark to the library:

Implement a new benchmark using some standard format (such as the METR Task Standard). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
We will encourage developers to support running their tasks on separate VMs and specify the exact hardware specifications for each task in the task environment.