Papers
arxiv:2604.07338

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Published on Apr 8
· Submitted by
Yuechen Jiang
on Apr 10
Authors:
,
,
,
,
,

Abstract

Vision-language models demonstrate limited capability in inferring structured cultural metadata from visual input, showing inconsistent performance across different cultures and metadata types.

AI-generated summary

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Community

Paper author Paper submitter

We introduce Appear2Meaning, a cross-cultural benchmark for structured cultural metadata inference from images.

Unlike standard image captioning, this task requires models to predict non-observable attributes such as culture, period, origin, and creator from visual input alone. The dataset contains 750 curated objects from the Getty and the Metropolitan Museum of Art, covering multiple object types and four cultural regions.

Key Findings

  • Structured metadata inference is significantly harder than image captioning
  • Models capture partial signals but fail at coherent multi-attribute prediction
  • Strong variation across cultural regions, with East Asia performing better
  • Frequent errors include cross-cultural misattribution and period compression

Why it matters

This work highlights the gap between visual perception and culturally grounded reasoning in VLMs, providing a benchmark for studying bias, generalization, and structured multimodal inference in cultural heritage.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.07338
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07338 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07338 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.