𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐀𝐠𝐞𝐧𝐭𝐬 𝐫𝐞𝐚𝐜𝐡𝐞𝐬 𝐭𝐡𝐞 𝐭𝐨𝐩 𝐨𝐟 𝐆𝐀𝐈𝐀 𝐥𝐞𝐚𝐝𝐞𝐫𝐛𝐨𝐚𝐫𝐝! 🥳

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the 𝗚𝗔𝗜𝗔 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

🛠️ 𝗥𝗲𝗾𝘂𝗶𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀, at least a web browser
🔢 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗹𝗼𝗴𝗶𝗰, many questions having strong math aspects
🖼️ 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹, the agent had to handle all file types: 🔊, 🖼️, 🎬...
👣 𝗠𝘂𝗹𝘁𝗶-𝘀𝘁𝗲𝗽, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard 😳
> "In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(𝘯𝘰 𝘧𝘪𝘭𝘦 𝘢𝘵𝘵𝘢𝘤𝘩𝘦𝘥 𝘰𝘧 𝘤𝘰𝘶𝘳𝘴𝘦, 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘩𝘢𝘴 𝘵𝘰 𝘧𝘪𝘯𝘥 𝘢𝘭𝘭 𝘵𝘩𝘦 𝘪𝘯𝘧𝘰)

➡️ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
🚀 Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
🥇 On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

𝙂𝙤 𝙘𝙝𝙚𝙘𝙠 𝙤𝙪𝙩 𝙩𝙝𝙚 𝙡𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 👉 gaia-benchmark/leaderboard

2 replies

updated 2 models 6 months ago

umd-zhou-lab/controllable-wizardlm-7b

Text Generation • Updated Jun 30 • 14

umd-zhou-lab/controllable-llama2-7b

Text Generation • Updated Jun 30 • 15

upvoted a collection 9 months ago

Foundation AI Papers

Collection

Curated List of Must-Reads on LLM reasoning at Temus AI team • 135 items • Updated Jun 15 • 27