BigCodeArena: Judging code generations end to end with code executions

Community Article Published October 7, 2025

bigcode

Evaluating the quality of AI-generated code is notoriously difficult. While humans can easily spot whether a piece of code "looks right," determining if it actually works correctly, handles edge cases properly, and produces the intended result requires running and testing it. This is why today, we're thrilled to announce BigCodeArena -- the first human-in-the-loop platform for evaluating code generation models through execution.

Inspired by LMArena for LLMs, we've built a platform that allows anyone to compare code generation models side-by-side, but with a crucial difference: you can actually run the code and see what it produces. Just submit a coding task, watch two different models generate solutions, execute both programs, and vote on which model produced better results. The outcomes are organized into a leaderboard that displays the community's highest-rated models.

Motivation

The field of code generation has long struggled with reliable evaluation methods. Traditional benchmarks like HumanEval test code against predefined test cases, but these represent only a tiny fraction of real-world programming tasks. Human evaluation platforms exist for general chatbots, but they fall short for code: reading raw source code and mentally simulating its execution is cognitively demanding and error-prone, especially for longer programs or complex UI applications.

Consider this scenario:

You ask two AI models to build a responsive photo gallery website. Both generate code that looks syntactically correct. But which one is actually better? Without running the code, it's nearly impossible to tell. One might produce a beautiful, functional grid layout, while the other might have subtle bugs or poor styling that only become apparent when rendered in a browser.

This observation led us to a key insight: execution feedback is essential for humans to judge code quality reliably. That's exactly what BigCodeArena provides.

The BigCodeArena Platform

BigCodeArena extends the Chatbot Arena framework with powerful features specifically designed for code evaluation:

Real-Time Execution

Every code snippet generated by models is automatically executed in isolated sandbox environments. Whether it's a Python script, a React web app, a PyGame game, or a C++ algorithm, you can see the actual output, not just the source code.

Multi-Language & Framework Support

We currently support 10 languages (Python, JavaScript, TypeScript, HTML, C, C++, Java, Go, Rust, and Markdown) and 8 execution environments:

Web Frameworks: React, Vue, Core Web (vanilla HTML/CSS/JS)
Python Frameworks: Streamlit, Gradio, PyGame
Diagrams: Mermaid
General Purpose Interpreters: Python and JavaScript code interpreters, plus compiled language runners

Interactive Testing

Unlike static code comparison, you can actually interact with the generated applications:

Click buttons and test UI elements in web apps
Play the games generated by models
Edit the code and re-run it to test modifications
View visual outputs like plots, charts, and diagrams

Multi-Turn Conversations

Real programming isn't one-and-done. BigCodeArena supports multi-turn interactions, allowing you to refine requirements, ask for features to be added, or request bug fixes -- just like working with a real coding assistant.

What We've Learned: 5 Months of Community Evaluation

Since launching in February 2025, BigCodeArena has collected over 14,000 conversations from more than 500 unique users, with 4,700+ high-quality preference votes comparing 10 frontier LLMs.

Programming Topics in the Wild

Our users have explored remarkably diverse coding scenarios:

Web Design (36%): Building responsive websites, interactive dashboards, and web applications
Problem Solving (23%): Algorithms, data structures, and computational challenges
Game Development (16%): Creating interactive games with physics, collision detection, and graphics
Scientific Computing (14%): Data analysis, visualization, and numerical simulations
Creative Coding (8%): Artistic visualizations, generative art, and experimental interfaces
Diagram Creation (3%): Flowcharts, system architectures, and data visualizations

Language and Framework Popularity

Python dominates with over 4,000 conversations, followed by JavaScript/TypeScript (3,359), HTML (1,601), and C++ (642). Among frameworks, direct Python interpreters lead usage (6,000 sessions), with React (2,729), Core Web (1,574), Streamlit (1,254), and PyGame (1,087) also seeing heavy use.

User Interaction Patterns

Most interactions are focused and efficient: 76% of conversations consist of just 2 turns (one request, one response), with a mean conversation length of 4.12 messages. However, the platform supports extended multi-turn debugging sessions when needed, with some conversations exceeding 10 turns as users refine complex applications.

Model Rankings from Community Votes

From our 14K conversations, we filtered for high-quality pairwise comparisons: conversations with at least two turns and actual code execution. This yielded 4,731 voting samples, with each evaluated model receiving at least 700 votes. We aggregate these votes into Elo ratings using the Bradley-Terry model, which estimates the probability that one model beats another based on head-to-head comparisons.

To ensure robust rankings, we use 100 bootstrap resamples to construct 95% confidence intervals, so we can identify statistically significant performance differences between models.

We evaluate models under three settings to control for different factors:

All Data: Uses all pairwise comparisons regardless of execution environment or programming language
Environment Matched: Only compares models when both were executed in the same sandbox (e.g., both in React or both in PyGame)
Language Matched: Further restricts comparisons to the same programming language

Rankings remain remarkably consistent across all three settings, revealing clear performance tiers:

Top Tier: o3-mini and o1-mini consistently lead with the highest Elo ratings and tight confidence intervals. These models maintain top performance regardless of environment or language constraints, showing strong robustness across coding scenarios. Claude-3.5-Sonnet follows closely, particularly excelling when language is controlled.

Mid Tier: GPT-4o, o1, and Gemini-2.0-Pro/Flash form a competitive middle tier. GPT-4o shows some sensitivity to language matching, suggesting room for improvement in multilingual consistency.

Open Source Models: Qwen2.5 variants and Llama-3.3-70B lag behind frontier proprietary models, highlighting the performance gap that remains between leading closed and open models.

Figure: Overall win rate heatmaps (percentage of all pairwise comparisons won) of each model in the sessions across languages (left) and execution environments (right). For each category, we only keep models that appear in at least 3 conversation sessions.

Performance Across Languages

Breaking down performance by programming language reveals interesting patterns:

Top-tier models like o3-mini and o1-mini achieve dominant win rates in mainstream languages like Python, Java, and C++
Gemini-2.0-Pro shows particular strength in Rust, achieving the highest win rate in that category
Different models exhibit distinct areas of expertise, with frontier models excelling in different niches
Open models like Qwen2.5 variants show inconsistent performance, particularly struggling with Rust and Go

Performance Across Execution Environments

Analyzing win rates by execution environment reveals how models handle different runtime contexts:

Robust Performers: o3-mini maintains consistently strong performance across React, Streamlit, Gradio, Core Web, and PyGame, demonstrating excellent environmental adaptability.

Stable but Selective: Claude-3.5-Sonnet and Gemini-2.0-Flash show generally stable performance but with reduced win rates in complex UI-heavy environments like Vue and Mermaid.

Framework-Specific Weaknesses: Qwen2.5 models, while competitive in some web frameworks (Core Web, React), struggle significantly with interactive and visualization-oriented environments like PyGame, Vue, and Mermaid. These environments often require precise handling of control flow, graphics rendering, and package dependencies.

These results highlight an important insight: aggregate Elo scores don't tell the whole story. Some models remain brittle under specific runtime constraints, and execution environment matters significantly for real-world deployment.

Two New Benchmarks: BigCodeReward and AutoCodeArena

To advance research beyond crowdsourced evaluation, we're releasing two complementary benchmarks:

BigCodeReward: Evaluating Reward Models for Code

Building on our 4,700+ preference votes, BigCodeReward tests how well LLMs can judge code quality when acting as reward models. The key finding? Execution results dramatically improve judgment accuracy.

When models can see execution outputs (screenshots of web apps, game visuals, program logs), their alignment with human preferences increases substantially:

Claude-Sonnet-4: 56.7% → 62.3% accuracy
GPT-4o: 54.6% → 63.8% accuracy
Qwen2.5-VL-72B: 58.7% → 66.2% accuracy

This reinforces our core thesis: you can't reliably judge code without running it -- and this applies to both humans and AI judges.

AutoCodeArena: Automated Code Generation Benchmarks

Inspired by Arena-Hard-Auto, AutoCodeArena provides a scalable way to evaluate new models without waiting for thousands of human votes. We carefully selected 600 representative prompts from our crowdsourced data, spanning all programming topics and frameworks.

Using automated LLM judges (Claude-3.7-Sonnet) to evaluate code execution results against a GPT-4.1 baseline, we can rapidly benchmark new models. This approach enables weekly leaderboard updates as new models are released.

Our automated benchmark evaluated 20+ cutting-edge models, including recently released systems:

Top Performers:

GPT-5 -- Establishes new state-of-the-art by a significant margin
Claude-Opus-4 and Claude-Sonnet-4 -- Strong second tier, excelling in reasoning-heavy tasks
Qwen3-Coder, Kimi-K2, GLM-4.5 -- Leading open models that narrow the gap with mid-tier proprietary systems

Figure: Win rates of recent LLMs on AutoCodeArena against a GPT-4.1 baseline, judged by Claude-3.7-Sonnet. The 50% mark represents parity with GPT-4.1. Models above this line outperform the baseline, while those below underperform. Error bars show 95% confidence intervals. Note: Claude-3.7-Sonnet is excluded from rankings to avoid self-judgment bias, and GPT-4.1 appears only as the reference baseline.

The results show that while proprietary models maintain an edge, open-source models are rapidly closing the gap, with some approaching GPT-4.1-level performance.

Try It Yourself

BigCodeArena is open to everyone -- no account required! Visit https://huggingface.co/spaces/bigcode/arena to:

Compare code from more recent frontier LLMs (e.g., Qwen3, DeepSeek-V3.X, and other proprietary models)
Test web apps, games, visualizations, and algorithms
See real execution results, not just source code
Vote on your preferences to help improve the leaderboard
Explore multi-turn coding conversations

Whether you're building a React dashboard, creating a PyGame game, solving algorithmic challenges, or generating creative visualizations, BigCodeArena lets you see which models truly deliver.

Open Source Everything

Following the BigCode Project's commitment to transparency, we're releasing:

Codebase: Full evaluation pipelines and Gradio application source (GitHub)
Crowdsourced Data: 14K raw conversations and 4.7K preference votes (HuggingFace Collection)
Benchmarks: BigCodeReward and AutoCodeArena datasets

What's Next?

We envision BigCodeArena as a long-term project that evolves with the community:

Expanded Language Support: More programming languages and frameworks.
Live Benchmarks: Continuously refreshed evaluation prompts to prevent overfitting
Agent-Based Evaluation: Using AI agents to interact with web apps for deeper testing
Better Reward Models: Advancing automated code quality assessment
Community Contributions: We welcome new execution environments, evaluation criteria, and model additions. PRs are always welcome!

Conclusion

Evaluating code isn't like evaluating text -- you need to run it, test it, and interact with it. BigCodeArena makes this possible at scale, combining human judgment with real execution feedback to create the most reliable evaluation platform for code generation models.

Join us in building the future of code generation evaluation. Write a prompt, compare the models, and vote for your favorite. Your feedback helps the entire community understand which models truly deliver on the promise of AI-assisted programming.

We'd love to hear your feedback! Connect with us on GitHub, join discussions in the Hugging Face Space community tab, or reach out to the BigCode Project at contact@bigcode-project.org.

Acknowledgements

We thank Leandro von Werra for his valuable suggestions and feedback on the blog.

Citation

@article{zhuo2025bigcodearena,
    title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution},
    author={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra},
    year={2025}
}

Try BigCodeArena now: Hugging Face Space

Read the paper: Hugging Face

Run the code: GitHub

Explore the collection: Hugging Face Collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote