pyf98 commited on
Commit
a853413
β€’
1 Parent(s): e4bf2d8

update version

Browse files
Files changed (3) hide show
  1. README.md +10 -8
  2. app.py +8 -23
  3. requirements.txt +4 -4
README.md CHANGED
@@ -1,19 +1,21 @@
1
  ---
2
  title: OWSM Demo
3
- emoji: πŸ‘€
4
  colorFrom: green
5
  colorTo: blue
6
  sdk: gradio
7
- sdk_version: 4.36.0
8
  app_file: app.py
9
  pinned: false
10
- python_version: 3.9
 
 
11
  models:
12
- - espnet/owsm_v1
13
- - espnet/owsm_v2
14
- - espnet/owsm_v2_ebranchformer
15
- - espnet/owsm_v3
16
- - espnet/owsm_v3.1_ebf
17
  ---
18
 
19
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
  title: OWSM Demo
3
+ emoji: πŸ”Š
4
  colorFrom: green
5
  colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 5.4.0
8
  app_file: app.py
9
  pinned: false
10
+ python_version: 3.11
11
+ preload_from_hub:
12
+ - espnet/owsm_v3.1_ebf
13
  models:
14
+ - espnet/owsm_v1
15
+ - espnet/owsm_v2
16
+ - espnet/owsm_v2_ebranchformer
17
+ - espnet/owsm_v3
18
+ - espnet/owsm_v3.1_ebf
19
  ---
20
 
21
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py CHANGED
@@ -13,12 +13,9 @@ OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models f
13
  We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
14
  For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
15
 
16
- The latest demo uses OWSM v3.1, an improved version of OWSM v3.
17
- OWSM v3.1 outperforms OWSM v3 in almost all evaluation benchmarks while being faster during inference.
18
- Note that we do not use extra training data. Instead, we utilize a state-of-the-art speech encoder, [E-Branchformer](https://arxiv.org/abs/2210.00077), to enhance the speech modeling capability.
19
-
20
  OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
21
- - Speech recognition for 151 languages
22
  - Any-to-any language speech translation
23
  - Utterance-level timestamp prediction
24
  - Long-form transcription
@@ -35,30 +32,18 @@ Please consider citing the following related papers if you find our work helpful
35
  <p>
36
 
37
  ```
 
 
 
 
 
 
38
  @inproceedings{peng2023owsm,
39
  title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
40
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
41
  booktitle={Proc. ASRU},
42
  year={2023}
43
  }
44
- @inproceedings{peng23b_interspeech,
45
- author={Yifan Peng and Kwangyoun Kim and Felix Wu and Brian Yan and Siddhant Arora and William Chen and Jiyang Tang and Suwon Shon and Prashant Sridhar and Shinji Watanabe},
46
- title={{A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks}},
47
- year=2023,
48
- booktitle={Proc. INTERSPEECH},
49
- }
50
- @inproceedings{kim2023branchformer,
51
- title={E-branchformer: Branchformer with enhanced merging for speech recognition},
52
- author={Kim, Kwangyoun and Wu, Felix and Peng, Yifan and Pan, Jing and Sridhar, Prashant and Han, Kyu J and Watanabe, Shinji},
53
- booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
54
- year={2023},
55
- }
56
- @InProceedings{pmlr-v162-peng22a,
57
- title = {Branchformer: Parallel {MLP}-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding},
58
- author = {Peng, Yifan and Dalmia, Siddharth and Lane, Ian and Watanabe, Shinji},
59
- booktitle = {Proceedings of the 39th International Conference on Machine Learning},
60
- year = {2022},
61
- }
62
  ```
63
 
64
  </p>
 
13
  We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
14
  For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
15
 
16
+ The latest demo uses OWSM v3.1 based on [E-Branchformer](https://arxiv.org/abs/2210.00077).
 
 
 
17
  OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
18
+ - Speech recognition in 151 languages
19
  - Any-to-any language speech translation
20
  - Utterance-level timestamp prediction
21
  - Long-form transcription
 
32
  <p>
33
 
34
  ```
35
+ @inproceedings{peng2024owsm31,
36
+ title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
37
+ author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
38
+ booktitle={Proc. INTERSPEECH},
39
+ year={2024}
40
+ }
41
  @inproceedings{peng2023owsm,
42
  title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
43
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
44
  booktitle={Proc. ASRU},
45
  year={2023}
46
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
 
49
  </p>
requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
- torch==2.1.0
2
- torchaudio
3
- espnet @ git+https://github.com/espnet/espnet@7bcb169291f5d4a9b1fd00f8bfe554de84e50024
4
- espnet_model_zoo
 
1
+ torch == 2.4.1
2
+ torchaudio == 2.4.1
3
+ espnet @ git+https://github.com/espnet/espnet@v.202409
4
+ espnet_model_zoo @ git+https://github.com/espnet/espnet_model_zoo@8b7301923c1c529a126c86c16fc73ce356b94d62