arxiv:2509.26542

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

Published on Sep 30

· Submitted by

Yueqian Lin on Oct 1

Adobe

Upvote

Authors:

Yueqian Lin ,

Abstract

VERA is a benchmark for evaluating reasoning ability in voice-interactive systems, revealing significant performance gaps compared to text models and highlighting challenges in real-time interaction.

AI-generated summary

We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.

View arXiv page View PDF GitHub 4 Add to collection

Community

linyueqian

Paper author Paper submitter 15 days ago

VERA benchmarks reasoning in voice systems: 2,931 voice-native episodes across 5 tracks. It exposes a large Voice Reasoning Gap (text ≈54% vs voice ≈11%) and a ~1.5s "real-time plateau" near ~10% accuracy.

librarian-bot

14 days ago

This comment has been hidden (marked as Off-Topic)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26542 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.26542 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26542 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.