We've been improving Transformers Agents a lot lately.
So with @sergeipetrov we set out to prove that it's the best agent framework out there.
To prove this, we went to beat the ๐๐๐๐ ๐น๐ฒ๐ฎ๐ฑ๐ฒ๐ฟ๐ฏ๐ผ๐ฎ๐ฟ๐ฑ, the most comprehensive benchmark out there for evaluating LLM agents. Its questions make you explore different flavours of pain:
๐ ๏ธ ๐ฅ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ ๐๐๐ถ๐ป๐ด ๐๐ผ๐ผ๐น๐, at least a web browser ๐ข ๐ฅ๐ถ๐ด๐ผ๐ฟ๐ผ๐๐ ๐น๐ผ๐ด๐ถ๐ฐ, many questions having strong math aspects ๐ผ๏ธ ๐ ๐๐น๐๐ถ๐บ๐ผ๐ฑ๐ฎ๐น, the agent had to handle all file types: ๐, ๐ผ๏ธ, ๐ฌ... ๐ฃ ๐ ๐๐น๐๐ถ-๐๐๐ฒ๐ฝ, with many questions requiring over 10 steps to be solved.
Some Level 3 questions are crazy hard ๐ณ > "In NASAโs Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?" (๐ฏ๐ฐ ๐ง๐ช๐ญ๐ฆ ๐ข๐ต๐ต๐ข๐ค๐ฉ๐ฆ๐ฅ ๐ฐ๐ง ๐ค๐ฐ๐ถ๐ณ๐ด๐ฆ, ๐ต๐ฉ๐ฆ ๐ข๐จ๐ฆ๐ฏ๐ต ๐ฉ๐ข๐ด ๐ต๐ฐ ๐ง๐ช๐ฏ๐ฅ ๐ข๐ญ๐ญ ๐ต๐ฉ๐ฆ ๐ช๐ฏ๐ง๐ฐ)
โก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!
๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ: ๐ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen. ๐ฅ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.