OWSM: Open Whisper-style Speech Model

OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits, including ESPnet.

Inference examples can be found on our project page. Our demo is available here.

OWSM v3.1 is an improved version of OWSM v3. It significantly outperforms OWSM v3 in almost all evaluation benchmarks. We do not include any new training data. Instead, we utilize a state-of-the-art speech encoder, E-Branchformer.

This is a base-sized model with 101M parameters and is trained on 180k hours of public speech data. Specifically, it supports the following speech-to-text tasks:

Speech recognition
Any-to-any-language speech translation
Utterance-level alignment
Long-form transcription
Language identification

OWSM series

Encoder-decoder OWSM

Name	Size	Hugging Face Repo
OWSM v3.1 base	101M	https://huggingface.co/espnet/owsm_v3.1_ebf_base
OWSM v3.1 small	367M	https://huggingface.co/espnet/owsm_v3.1_ebf_small
OWSM v3.1 medium	1.02B	https://huggingface.co/espnet/owsm_v3.1_ebf
OWSM v3.2 small	367M	https://huggingface.co/espnet/owsm_v3.2
OWSM v4 base	102M	https://huggingface.co/espnet/owsm_v4_base_102M
OWSM v4 small	370M	https://huggingface.co/espnet/owsm_v4_small_370M
OWSM v4 medium	1.02B	https://huggingface.co/espnet/owsm_v4_medium_1B

CTC-based OWSM

Name	Size	Hugging Face Repo
OWSM-CTC v3.1 medium	1.01B	https://huggingface.co/espnet/owsm_ctc_v3.1_1B
OWSM-CTC v3.2 medium	1.01B	https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B
OWSM-CTC v4 medium	1.01B	https://huggingface.co/espnet/owsm_ctc_v4_1B

Citations

OWSM v4

@inproceedings{owsm-v4,
  title={{OWSM} v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning},
  author={Yifan Peng and Shakeel Muhammad and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2025},
}

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

Initial OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}

Downloads last month: 24

Space using espnet/owsm_v3.1_ebf_base 1

Collection including espnet/owsm_v3.1_ebf_base

OWSM: Fully Open Speech Recognition and Translation Models

Collection

A collection of models related to the Open Whisper-style Speech Models (OWSM) project from CMU: https://www.wavlab.org/activities/2024/owsm/ • 21 items • Updated Mar 8 • 2