arxiv:2510.04363

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Published on Oct 5

Authors:

Abstract

MacroBench evaluates LLMs' ability to generate reusable browser automation scripts from natural language by assessing code quality, safety, and functionality across various tasks.

AI-generated summary

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.04363 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.04363 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.