Spaces:

pyf98
/

OWSM_v3_demo

Sleeping

App Files Files Community

pyf98 commited on 27 days ago

Commit

a853413

•

1 Parent(s): e4bf2d8

update version

Browse files

Files changed (3) hide show

README.md +10 -8
app.py +8 -23
requirements.txt +4 -4

README.md CHANGED Viewed

@@ -1,19 +1,21 @@
 ---
 title: OWSM Demo
-emoji: 👀
 colorFrom: green
 colorTo: blue
 sdk: gradio
-sdk_version: 4.36.0
 app_file: app.py
 pinned: false
-python_version: 3.9
 models:
-- espnet/owsm_v1
-- espnet/owsm_v2
-- espnet/owsm_v2_ebranchformer
-- espnet/owsm_v3
-- espnet/owsm_v3.1_ebf
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: OWSM Demo
+emoji: 🔊
 colorFrom: green
 colorTo: blue
 sdk: gradio
+sdk_version: 5.4.0
 app_file: app.py
 pinned: false
+python_version: 3.11
+preload_from_hub:
+  - espnet/owsm_v3.1_ebf
 models:
+  - espnet/owsm_v1
+  - espnet/owsm_v2
+  - espnet/owsm_v2_ebranchformer
+  - espnet/owsm_v3
+  - espnet/owsm_v3.1_ebf
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -13,12 +13,9 @@ OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models f
 We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
 For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
-The latest demo uses OWSM v3.1, an improved version of OWSM v3.
-OWSM v3.1 outperforms OWSM v3 in almost all evaluation benchmarks while being faster during inference.
-Note that we do not use extra training data. Instead, we utilize a state-of-the-art speech encoder, [E-Branchformer](https://arxiv.org/abs/2210.00077), to enhance the speech modeling capability.
 OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
-- Speech recognition for 151 languages
 - Any-to-any language speech translation
 - Utterance-level timestamp prediction
 - Long-form transcription
@@ -35,30 +32,18 @@ Please consider citing the following related papers if you find our work helpful
 <p>
 ```
 @inproceedings{peng2023owsm,
   title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
   author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
   booktitle={Proc. ASRU},
   year={2023}
 }
-@inproceedings{peng23b_interspeech,
-  author={Yifan Peng and Kwangyoun Kim and Felix Wu and Brian Yan and Siddhant Arora and William Chen and Jiyang Tang and Suwon Shon and Prashant Sridhar and Shinji Watanabe},
-  title={{A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks}},
-  year=2023,
-  booktitle={Proc. INTERSPEECH},
-}
-@inproceedings{kim2023branchformer,
-  title={E-branchformer: Branchformer with enhanced merging for speech recognition},
-  author={Kim, Kwangyoun and Wu, Felix and Peng, Yifan and Pan, Jing and Sridhar, Prashant and Han, Kyu J and Watanabe, Shinji},
-  booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
-  year={2023},
-}
-@InProceedings{pmlr-v162-peng22a,
-  title = 	 {Branchformer: Parallel {MLP}-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding},
-  author =       {Peng, Yifan and Dalmia, Siddharth and Lane, Ian and Watanabe, Shinji},
-  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
-  year = 	 {2022},
-}
 ```
 </p>

 We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
 For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
+The latest demo uses OWSM v3.1 based on [E-Branchformer](https://arxiv.org/abs/2210.00077).
 OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
+- Speech recognition in 151 languages
 - Any-to-any language speech translation
 - Utterance-level timestamp prediction
 - Long-form transcription
 <p>
 ```
+@inproceedings{peng2024owsm31,
+  title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
+  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
+  booktitle={Proc. INTERSPEECH},
+  year={2024}
+}
 @inproceedings{peng2023owsm,
   title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
   author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
   booktitle={Proc. ASRU},
   year={2023}
 }
 ```
 </p>

requirements.txt CHANGED Viewed

@@ -1,4 +1,4 @@
-torch==2.1.0
-torchaudio
-espnet @ git+https://github.com/espnet/espnet@7bcb169291f5d4a9b1fd00f8bfe554de84e50024
-espnet_model_zoo

+torch == 2.4.1
+torchaudio == 2.4.1
+espnet @ git+https://github.com/espnet/espnet@v.202409
+espnet_model_zoo @ git+https://github.com/espnet/espnet_model_zoo@8b7301923c1c529a126c86c16fc73ce356b94d62