Abstract
HaluMem, a benchmark for evaluating memory hallucinations in AI systems, identifies and analyzes hallucinations across memory extraction, updating, and question answering stages using large-scale human-AI interaction datasets.
Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.
Community
HaluMem: Evaluating Hallucinations in Memory Systems of Agents
We introduce HaluMem, the first operation-level benchmark designed to systematically evaluate hallucinations in memory systems of AI agents. Unlike conventional black box QA benchmarks, HaluMem opens up the internal mechanisms of memory processing, tracking how hallucinations arise, propagate, and impact final outputs across three key operations: memory extraction, updating, and question answering.
π§© How It Works
HaluMem decomposes memory workflows into granular, traceable stages. It introduces the βthree-step hallucination tracing mechanismβ, which monitors hallucinations from information extraction to memory revision and final retrieval. Each step is paired with fine-grained gold annotations, enabling precise hallucination localization and quantitative analysis.
π Why It Matters
By exposing hallucinations at their operational roots, HaluMem transforms memory evaluation from opaque outcome testing into interpretable mechanism analysis, thereby laying the empirical foundation for next-generation reliable memory systems for LLMs.
π A Dual-Scale Dataset for Realistic Evaluation
HaluMem, built via a six-stage pipeline blending automation, GPT-4o refinement, and human checks, includes:
- HaluMem-Medium β standard multi-turn dialogues (~160k tokens/user);
- HaluMem-Long β 1M-token sessions capturing context drift and memory decay.
Together, they model realistic long-term userβAI interactions to evaluate both short-term reliability and long-term robustness.
πΎ Three Categories of Memory Points
HaluMem categorizes internal memory points into three interpretable types, enabling targeted evaluation of how hallucinations affect distinct aspects of long-term knowledge:
- Persona Memory β stable attributes such as identity, interests, personality traits, and enduring beliefs.
- Event Memory β chronological records of experiences, decisions, or specific life events that evolve over time.
- Relationship Memory β representations of social connections, interactions, and opinions toward other entities or users.
β Six Rich Categories of Evaluation Questions
HaluMem automatically generates six categories of evaluation questions, covering the full spectrum from factual recall to contradiction detection: Basic Fact Recall, Multi-hop Inference, Dynamic Update, Memory Boundary, Generalization & Application, and Memory Conflict.
π¬ Key Experimental Findings
We conducted a comprehensive evaluation of several state-of-the-art memory systems, including Mem0, Memobase, SuperMemory, and Zep, under consistent parameter settings to ensure fair comparison. Evaluations of additional memory systems, such as Memos, will be continuously updated.
- Universal Long-Context Degradation β All systems perform worse on HaluMem-Long, revealing major weaknesses under extended, noisy contexts.
- Weak Memory Extraction β Recall < 60% and accuracy < 62%; systems struggle to capture complete and correct memories.
- Fragile Updating β Correct update < 50%, omission > 50%; extraction and updating stages are poorly connected.
- QA Bottleneck β Answer accuracy < 56%; heavily dependent on upstream memory quality.
- Core Insight β Current memory systems lack robustness to context drift, interference, and long-term accumulation, underscoring the need for stronger relevance filtering and stable long-term memory linkage.
- Overall Low Accuracy β All systems show modest accuracy across categories, revealing substantial room for improvement.
- Strength in Detection β All systems handle memory boundary and conflict questions relatively well, indicating good recognition of unknown or misleading information.
- Weak in Reasoning β Performance drops markedly on multi-hop inference, dynamic update, and generalization & application, showing ongoing challenges in reasoning and preference tracking.
π Dataset & Resources
π§© Hugging Face: IAAR-Shanghai/HaluMem
π Paper: arXiv:2511.03506
π» Code & Benchmark Suite: github.com/MemTensor/HaluMem
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SGMem: Sentence Graph Memory for Long-Term Conversational Agents (2025)
- Evaluating Long-Term Memory for Long-Context Question Answering (2025)
- MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues (2025)
- Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue (2025)
- PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency (2025)
- Mem-Ξ±: Learning Memory Construction via Reinforcement Learning (2025)
- Memory-QA: Answering Recall Questions Based on Multimodal Memories (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper