Spaces:

agent-evals
/

leaderboard

Running

App Files Files Community

leaderboard / agent_submission_core.md

benediktstroebl

init v1

7c691e6 4 months ago

preview code

raw

history blame

2.77 kB

	### To submit a new agent to the CORE leaderboard, follow these steps:

	1. Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench). When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The content of this file must be in JSON format and at least include the keys `cost` and `agent_trace`:

	```json
	{
	"cost": 0.59,
	"agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
	}
	```

	- `cost`: A float representing the total cost (USD) of API calls made by the agent. We recommend using [Weave](https://github.com/wandb/weave) for easy cost logging.
	- `agent_trace`: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
	- Human-readable.
	- Reflects the intermediate steps your system took that led to the final solution.
	- Generated with the inference process, not post-hoc.

	If you have any trouble implementing this, feel free to reach out to us for support.

	2. Run your agent on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. You can submit results for any of the three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, or CORE-Bench-Hard.

	3. Submit the following two directories from the harness:
	- `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
	- `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
	- These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.

	Compress these directories into two `.tar.gz` or `.zip` files and email them to [zss@princeton.edu](mailto:zss@princeton.edu). If the files are too large to email, please upload them to Google Drive, Dropbox, etc., and email the link. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.

	4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.