Papers
arXiv:2511.03506

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Published on Nov 5
Β· Submitted by Ding Chen on Nov 11
#1 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

HaluMem, a benchmark for evaluating memory hallucinations in AI systems, identifies and analyzes hallucinations across memory extraction, updating, and question answering stages using large-scale human-AI interaction datasets.

AI-generated summary

Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.

Community

Paper author Paper submitter
β€’
edited 1 day ago

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

We introduce HaluMem, the first operation-level benchmark designed to systematically evaluate hallucinations in memory systems of AI agents. Unlike conventional black box QA benchmarks, HaluMem opens up the internal mechanisms of memory processing, tracking how hallucinations arise, propagate, and impact final outputs across three key operations: memory extraction, updating, and question answering.

🧩 How It Works

HaluMem decomposes memory workflows into granular, traceable stages. It introduces the β€œthree-step hallucination tracing mechanism”, which monitors hallucinations from information extraction to memory revision and final retrieval. Each step is paired with fine-grained gold annotations, enabling precise hallucination localization and quantitative analysis.

πŸ“ˆ Why It Matters

By exposing hallucinations at their operational roots, HaluMem transforms memory evaluation from opaque outcome testing into interpretable mechanism analysis, thereby laying the empirical foundation for next-generation reliable memory systems for LLMs.

πŸ“š A Dual-Scale Dataset for Realistic Evaluation

HaluMem, built via a six-stage pipeline blending automation, GPT-4o refinement, and human checks, includes:

  • HaluMem-Medium β€” standard multi-turn dialogues (~160k tokens/user);
  • HaluMem-Long β€” 1M-token sessions capturing context drift and memory decay.

Together, they model realistic long-term user–AI interactions to evaluate both short-term reliability and long-term robustness.

πŸ’Ύ Three Categories of Memory Points

HaluMem categorizes internal memory points into three interpretable types, enabling targeted evaluation of how hallucinations affect distinct aspects of long-term knowledge:

  • Persona Memory β€” stable attributes such as identity, interests, personality traits, and enduring beliefs.
  • Event Memory β€” chronological records of experiences, decisions, or specific life events that evolve over time.
  • Relationship Memory β€” representations of social connections, interactions, and opinions toward other entities or users.
❓ Six Rich Categories of Evaluation Questions

HaluMem automatically generates six categories of evaluation questions, covering the full spectrum from factual recall to contradiction detection: Basic Fact Recall, Multi-hop Inference, Dynamic Update, Memory Boundary, Generalization & Application, and Memory Conflict.

πŸ”¬ Key Experimental Findings

We conducted a comprehensive evaluation of several state-of-the-art memory systems, including Mem0, Memobase, SuperMemory, and Zep, under consistent parameter settings to ensure fair comparison. Evaluations of additional memory systems, such as Memos, will be continuously updated.

  • Universal Long-Context Degradation β€” All systems perform worse on HaluMem-Long, revealing major weaknesses under extended, noisy contexts.
  • Weak Memory Extraction β€” Recall < 60% and accuracy < 62%; systems struggle to capture complete and correct memories.
  • Fragile Updating β€” Correct update < 50%, omission > 50%; extraction and updating stages are poorly connected.
  • QA Bottleneck β€” Answer accuracy < 56%; heavily dependent on upstream memory quality.
  • Core Insight β€” Current memory systems lack robustness to context drift, interference, and long-term accumulation, underscoring the need for stronger relevance filtering and stable long-term memory linkage.
  • Overall Low Accuracy β€” All systems show modest accuracy across categories, revealing substantial room for improvement.
  • Strength in Detection β€” All systems handle memory boundary and conflict questions relatively well, indicating good recognition of unknown or misleading information.
  • Weak in Reasoning β€” Performance drops markedly on multi-hop inference, dynamic update, and generalization & application, showing ongoing challenges in reasoning and preference tracking.

🌐 Dataset & Resources
🧩 Hugging Face: IAAR-Shanghai/HaluMem
πŸ“„ Paper: arXiv:2511.03506
πŸ’» Code & Benchmark Suite: github.com/MemTensor/HaluMem

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.03506 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.03506 in a Space README.md to link it from this page.

Collections including this paper 4