arxiv:2604.07338

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Published on Apr 8

· Submitted by

Yuechen Jiang on Apr 10

AI4Museum

Upvote

Authors:

Yuechen Jiang ,

Abstract

Vision-language models demonstrate limited capability in inferring structured cultural metadata from visual input, showing inconsistent performance across different cultures and metadata types.

AI-generated summary

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

View arXiv page View PDF Project page Add to collection

Community

CarolynJiang

Paper author Paper submitter about 15 hours ago

We introduce Appear2Meaning, a cross-cultural benchmark for structured cultural metadata inference from images.

Unlike standard image captioning, this task requires models to predict non-observable attributes such as culture, period, origin, and creator from visual input alone. The dataset contains 750 curated objects from the Getty and the Metropolitan Museum of Art, covering multiple object types and four cultural regions.

Key Findings

Structured metadata inference is significantly harder than image captioning
Models capture partial signals but fail at coherent multi-attribute prediction
Strong variation across cultural regions, with East Asia performing better
Frequent errors include cross-cultural misattribution and period compression

Why it matters

This work highlights the gap between visual perception and culturally grounded reasoning in VLMs, providing a benchmark for studying bias, generalization, and structured multimodal inference in cultural heritage.

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.07338

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07338 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07338 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.