pinned
Running
144
GIFT Eval
🥇
GIFT-Eval: A Benchmark for General Time Series Forecasting
None defined yet.
Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
GIFT-Eval: A Benchmark for General Time Series Forecasting
A realistic benchmark with real CRM tasks for LLM agents.
View and submit LLM benchmark evaluations
Filter and view LLM benchmark data
Explore efficient reasoning techniques with large language models
Generate captions and chat about images