Samba-asr state-of-the-art speech recognition leveraging structured state-space models
Abstract
We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention mechanisms to capture dependencies, Samba ASR effectively models both local and global temporal dependencies using efficient state-space dynamics, achieving remarkable performance gains. By addressing the limitations of transformers, such as quadratic scaling with input length and difficulty in handling long-range dependencies, Samba ASR achieves superior accuracy and efficiency. Experimental results demonstrate that Samba ASR surpasses existing open-source transformer-based ASR models across various standard benchmarks, establishing it as the new state of the art in ASR. Extensive evaluations on benchmark datasets show significant improvements in Word Error Rate (WER), with competitive performance even in low-resource scenarios. Furthermore, the computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks. Our contributions include: A new Samba ASR architecture demonstrating the superiority of SSMs over transformer-based models for speech sequence processing. A comprehensive evaluation on public benchmarks showcasing state-of-the-art performance. An analysis of computational efficiency, robustness to noise, and sequence generalization. This work highlights the viability of Mamba SSMs as a transformer-free alternative for efficient and accurate ASR. By leveraging state-space modeling advancements, Samba ASR sets a new benchmark for ASR performance and future research.
Community
We propose Samba-ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model
leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of
state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention
mechanisms to capture dependencies, Samba-ASR effectively models both local and global temporal
dependencies using efficient state-space dynamics, achieving remarkable performance gains. By
addressing the limitations of transformers, such as quadratic scaling with input length and difficulty
in handling long-range dependencies, Samba-ASR achieves superior accuracy and efficiency.
Experimental results demonstrate that Samba-ASR surpasses existing open-source transformer-based
ASR models across various standard benchmarks, establishing it as the new state-of-the-art in ASR.
Extensive evaluations on the benchmark dataset show significant improvements in Word Error Rate
(WER), with competitive performance even in low-resource scenarios. Furthermore, the inherent
computational efficiency and parameter optimization of the Mamba architecture make Samba-ASR a
scalable and robust solution for diverse ASR tasks.
Our contributions include:
• A new Samba-ASR architecture for ASR, demonstrating SSM’s superiority over transformerbased models for speech sequence processing. .
• A comprehensive evaluation on public benchmarks showcasing SOTA performance.
• In-depth analysis of computational efficiency, robustness to noise, and sequence generalization.
This work highlights the viability of Mamba-SSMs as a transformer-free alternative for efficient and
accurate ASR. By leveraging the advancements of state-space modeling, Samba-ASR redefines ASR
performance standards and sets a new benchmark for future research in this field.
Nice job! but why so stingy?! 😢
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Late fusion ensembles for speech recognition on diverse input audio representations (2024)
- A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario (2024)
- Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages (2024)
- CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition (2024)
- Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition (2024)
- ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction (2024)
- Comparative Analysis of ASR Methods for Speech Deepfake Detection (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper