Evals - a michaelchen Collection

michaelchen 's Collections

Evals

Evals

updated Jul 23

SciCode: A Research Coding Benchmark Curated by Scientists

Paper • 2407.13168 • Published Jul 18 • 13
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Paper • 2407.15711 • Published Jul 22 • 9
The Vision of Autonomic Computing: Can LLMs Make It a Reality?

Paper • 2407.14402 • Published Jul 19 • 13