Talk Arena
Talking is a natural and intuitive way for people to interact with AI assistants. However, most research evaluates Large Audio Models using a set of static benchmarks, which are effective for assessing isolated tasks but may not capture how well models can interact with people in real-world scenarios. Therefore, we introduce Talk Arena, an interactive open platform to evaluate Large Audio Models through interactions with users in real-world settings. We used Talk Arena's dynamic evaluation to benchmark five large audio models, and correlated these results to those on 18 static benchmarks in speech comprehension.
Recent efforts towards creating multimodal models have resulted in LLMs capable of processing audio inputs such as speech. Speech is a low-friction interface which expands social and phonetic interaction opportunities with end users. Prior work has benchmarked audio models on a set of disjoint static audio tests such as sarcasm or humor detection. However such static benchmarks lack the complex dynamics of real user interactions and preferences. Inspired by arena-style evaluations for text LLMs we introduce Talk Arena, an open platform for evaluating Large Audio Models with pairwise human preferences. Talk Arena helps to reveal insights on:
- Which Large Audio Model users prefer the most? Users vote their preferences with self-initiated prompts, which better reflects the actual user experience.
- Are static speech comprehension benchmarks predictive of user preferences in interactive settings? Talk Arena reveals a gap between the mainstream evaluation method for audio models and actual user preferences.
Try it now at talkarena.org, where you can also find a version of this article with interactive visuals.
Static Evaluation
Tasks and Datasets
We select all speech comprehension benchmarks from existing holistic evaluation sets for audio models (namely AudioBench and AIR-Bench). There are 18 different datasets in total and we perform evaluation for 11 different large audio models.
The datasets cover a wide range of tasks that evaluate models' knowledge of Speaker Cognitive State, Speaker Identity, and Speech Content Understanding. They include Humor Detection, Sarcasm Detection, Intent Detection, Relationship Classification, Gender Classification, Age Classification, Accent Classification, Speech Grounding, Language Identification, Speech Entity Recognition, Speech Question Answering, and Speech Instruction Following.
Result Analysis
To ensure robustness, we report the average of model performance using three different prompt variations.
For public_sg_speech
, openhermes
, and alpaca
datasets, we report the
cfm metric. For other tasks, we report the macro F1 scores.
In general, close sourced models like Gemini and GPT4o generally top the leaderboard: Gemini has the highest performance on SLURP intent classification (F1: 91.4), MELD emotion recognition (F1: 26.9), CN_college_listen (F1: 66.1) and GPT4o performs the best on MUSTARD sarcasm detection (F1: 53.6), IEMOCAP emotion recognition (F1: 31.5), CallHome relation classification (F1: 59.7), and Commonvoice accent classification (F1: 35.3).
Among the open-sourced models, Qwen2-Audio demonstrates outstanding performance on SpeechQA and Gender/Age classification tasks and DiVA shows excellent humor detection and speech instruction following capability that outperforms all other models. They also show relatively good performance on other tasks, demonstrating good generalizability. NextGPT and PandaGPT perform relatively worse, especially for tasks like intent and emotion recognition, accent recognition, and instruction following. They share similar encoder architecture (ImageBind) and this suggests the limitation of using ImageBind for encoding audio features.
We also perform evaluation for the sequential pipeline of Whisper plus Llama3-8B-Instruct. It shows relatively good performance for tasks like emotion recognition and speech QA, which means some of the data instances can be inferred from content only. However, for each and every task there are speech models outperforming the whisper+llama3 pipeline. This suggests that some information like emotion, relationship, and sarcasm can be embedded in vocal cues and requires understanding beyond content.
Interactive Evaluation
User Preference
As an initial effort, we collected a total 5000 votes on Prolific using Talk Arena for pairwise comparisons among GPT4o, Gemini-1.5-pro, Typhoon, Qwen2-Audio and DiVA, which are top performing models from the results of static evaluation. For each of the ten combinations, we collect 500 votes from more than 50 different crowdworkers. In total, we have around 359 different voters.
Comparison with Static Evaluation
We compare the user preference result in interactive evaluation to that of static evaluation by computing the top-k Kendall Tau Distance between rank in static evaluation and that in interactive evaluation:
Here are some observations:
- None of the static bench reflects exactly the same rank in interactive eval
- Ranks on emotion recognition and language detection benchmarks are most similar to that in interactive eval
- Ranks on gender detection and nuanced intent (humor, sarcasm) detection are not very correlated with that in interactive eval
These are our observations from Prolific study, and we hope to reach more conclusions with the vote from the public.
Looking Forward
We are inspired by how the Chatbot Arena has rapidly accelerated research on real-world applications of language models in conversational systems. As we look ahead, we aim to similarly focus the development of speech-enabled language models on user needs, rather than limiting innovation to what current benchmarks can measure.
Incorporating Human Preferences in Speech Data We don't currently store any data other than votes right now, but long-term we want to work with the community to build frameworks for consensual data sharing. Speech data requires special care since it can inherently identify individuals or even to train models to mimic their voices. We would love for data from Talk Arena to help directly improve open-source and academic Speech models, but clear consent processes and careful data handling are pre-requisites to make this possible in a way that is both useful and ethical.
Managing Free-Form Conversational Dynamics Speech conversations flow differently than text chats - they are more dynamic and less strictly turn-based. These are what make speech compelling for users, but they present challenges for Arena-style evaluation. As more conversational speech systems are released, we are looking at how to assess these natural speech interactions effectively.
Developing Robust Static Benchmarks While interactive feedback from users is invaluable, we also recognize that it is often too slow to be used to measure intermediate progress for model developers. Using our qualitative insights from paid participants, as well as looking at general correlations with public ratings, we are hopeful that insights from Talk Arena can be used to design static evaluations that are better aligned with user preferences to provide more rapid and inexpensive feedback.
Collaboration
We are open to collaboration in many ways! If you are interested in contributing to this project, please feel free to contact us at ellamzli@stanford.edu, held@stanford.edu, diyiy@cs.stanford.edu