Dang Nguyen

dangmn
ยท

AI & ML interests

None yet

Recent Activity

Organizations

Tianyi Lab @ UMD's profile picture

dangmn's activity

upvoted 12 papers 3 months ago
New activity in gaia-benchmark/leaderboard 3 months ago

Error at test submission

2
#22 opened 4 months ago by
dangmn
reacted to m-ric's post with ๐Ÿ”ฅ 6 months ago
view post
Post
780
๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐€๐ ๐ž๐ง๐ญ๐ฌ ๐ซ๐ž๐š๐œ๐ก๐ž๐ฌ ๐ญ๐ก๐ž ๐ญ๐จ๐ฉ ๐จ๐Ÿ ๐†๐€๐ˆ๐€ ๐ฅ๐ž๐š๐๐ž๐ซ๐›๐จ๐š๐ซ๐! ๐Ÿฅณ

We've been improving Transformers Agents a lot lately.

So with @sergeipetrov we set out to prove that it's the best agent framework out there.

To prove this, we went to beat the ๐—š๐—”๐—œ๐—” ๐—น๐—ฒ๐—ฎ๐—ฑ๐—ฒ๐—ฟ๐—ฏ๐—ผ๐—ฎ๐—ฟ๐—ฑ, the most comprehensive benchmark out there for evaluating LLM agents.
Its questions make you explore different flavours of pain:

๐Ÿ› ๏ธ ๐—ฅ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐˜‚๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ผ๐—น๐˜€, at least a web browser
๐Ÿ”ข ๐—ฅ๐—ถ๐—ด๐—ผ๐—ฟ๐—ผ๐˜‚๐˜€ ๐—น๐—ผ๐—ด๐—ถ๐—ฐ, many questions having strong math aspects
๐Ÿ–ผ๏ธ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น, the agent had to handle all file types: ๐Ÿ”Š, ๐Ÿ–ผ๏ธ, ๐ŸŽฌ...
๐Ÿ‘ฃ ๐— ๐˜‚๐—น๐˜๐—ถ-๐˜€๐˜๐—ฒ๐—ฝ, with many questions requiring over 10 steps to be solved.

Some Level 3 questions are crazy hard ๐Ÿ˜ณ
> "In NASAโ€™s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute?"
(๐˜ฏ๐˜ฐ ๐˜ง๐˜ช๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ข๐˜ค๐˜ฉ๐˜ฆ๐˜ฅ ๐˜ฐ๐˜ง ๐˜ค๐˜ฐ๐˜ถ๐˜ณ๐˜ด๐˜ฆ, ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜จ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฉ๐˜ข๐˜ด ๐˜ต๐˜ฐ ๐˜ง๐˜ช๐˜ฏ๐˜ฅ ๐˜ข๐˜ญ๐˜ญ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ช๐˜ฏ๐˜ง๐˜ฐ)

โžก๏ธ We used Transformers Agents' React Code Agent, that writes its actions in code. We created a new planning component that we'll incorporate in the framework. More info soon in a blog post!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿš€ Our submission scores #2 overall on the test set and #1 on the validation set. On both sets we're the leading submission based on a public framework, beating Microsoft's Autogen.
๐Ÿฅ‡ On both sets we are #1 on the hardest Level 3 questions, reaching nearly 20%.

๐™‚๐™ค ๐™˜๐™๐™š๐™˜๐™  ๐™ค๐™ช๐™ฉ ๐™ฉ๐™๐™š ๐™ก๐™š๐™–๐™™๐™š๐™ง๐™—๐™ค๐™–๐™ง๐™™ ๐Ÿ‘‰ gaia-benchmark/leaderboard
  • 2 replies
ยท