Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Abstract
Phoenix-VAD, an LLM-based model, enables reliable semantic endpoint detection in streaming audio interactions, enhancing human-computer interaction.
Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
Community
Awesome paper! Will the weights/code be open sourced?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper