TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
Abstract
TAU, a benchmark of culturally specific Taiwanese soundmarks, reveals that state-of-the-art large audio-language models underperform compared to local humans, highlighting the need for localized evaluations.
Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.
Community
Can current models generalize to localized, non-semantic audio that communities instantly recognize, but
outsiders do not?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence (2025)
- DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models'Understanding on Indian Culture (2025)
- Grounding Multilingual Multimodal LLMs With Cultural Knowledge (2025)
- WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations (2025)
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models (2025)
- AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning (2025)
- MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper