HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Abstract
HomeSafe-Bench presents a comprehensive benchmark for vision-language models to detect unsafe actions in household environments, accompanied by HD-Guard, a hierarchical architecture that balances real-time safety monitoring with detection accuracy through dual-brain processing.
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Community
As embodied agents and household robots move from controlled factories into real homes, safety becomes a critical challenge.
Unlike structured environments, households are dynamic and unpredictable, where perception delays or missing common-sense reasoning can easily lead to dangerous actions (e.g., placing metal in a microwave).
However, current safety benchmarks mostly focus on text, static images, or general hazards, leaving a key question unanswered:
Can Vision-Language Models detect unsafe robot actions in real household scenarios?
To address this gap, we introduce HomeSafe-Bench and HD-Guard, a benchmark and real-time detection system for household embodied agent safety.
🚀 TL;DR
• HomeSafe-Bench — a video-based benchmark for unsafe action detection in household environments
• 438 hazard cases across 6 household functional areas with fine-grained annotations
• HD-Guard — a hierarchical dual-brain safety monitor combining fast screening with deep reasoning
• Achieves strong latency–accuracy trade-offs for real-world deployment
📊 HomeSafe-Bench
HomeSafe-Bench evaluates Vision-Language Models (VLMs) on detecting unsafe behaviors of embodied agents in household scenarios.
Key properties
- 🎥 438 hazard videos
- 🏠 6 household functional areas
- 🧠 multi-dimensional annotations
- ⚠️ diverse unsafe behaviors
The dataset is generated through a hybrid pipeline combining LLM-driven hazard discovery, physical simulation, and video generation, ensuring both physical realism and scenario diversity.
🧠 HD-Guard: Dual-Brain Safety Detection
We further propose HD-Guard, a hierarchical dual-brain architecture for real-time safety monitoring.
FastBrain
- lightweight streaming VLM
- high-frequency frame-level safety screening
SlowBrain
- large-scale VLM
- deep multimodal reasoning for uncertain cases
This design enables continuous monitoring while maintaining strong reasoning capability, achieving an effective latency–performance balance.
🔍 Key Findings
Our experiments reveal major limitations in current VLM safety detection:
- models miss critical visual entities
- models struggle with temporal grounding
- models show weak causal reasoning for hazards
HD-Guard mitigates these issues and significantly improves practical safety detection for household embodied agents.
✨ Contributions
• HomeSafe-Bench — first benchmark for unsafe action detection in household embodied agents
• HD-Guard — hierarchical dual-brain safety monitoring architecture
• Comprehensive analysis of VLM limitations in real-world safety detection
💡 HomeSafe-Bench highlights an urgent challenge:
As robots enter everyday homes, reliable safety monitoring becomes essential for embodied AI deployment.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories (2026)
- Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure (2026)
- TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs (2026)
- BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models (2026)
- Rethinking Video Generation Model for the Embodied World (2026)
- Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models (2026)
- RoboBrain 2.5: Depth in Sight, Time in Mind (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper