Spaces:
Sleeping
Sleeping
update version
Browse files- README.md +10 -8
- app.py +8 -23
- requirements.txt +4 -4
README.md
CHANGED
@@ -1,19 +1,21 @@
|
|
1 |
---
|
2 |
title: OWSM Demo
|
3 |
-
emoji:
|
4 |
colorFrom: green
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
7 |
-
sdk_version: 4.
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
-
python_version: 3.
|
|
|
|
|
11 |
models:
|
12 |
-
- espnet/owsm_v1
|
13 |
-
- espnet/owsm_v2
|
14 |
-
- espnet/owsm_v2_ebranchformer
|
15 |
-
- espnet/owsm_v3
|
16 |
-
- espnet/owsm_v3.1_ebf
|
17 |
---
|
18 |
|
19 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
1 |
---
|
2 |
title: OWSM Demo
|
3 |
+
emoji: π
|
4 |
colorFrom: green
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
7 |
+
sdk_version: 5.4.0
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
python_version: 3.11
|
11 |
+
preload_from_hub:
|
12 |
+
- espnet/owsm_v3.1_ebf
|
13 |
models:
|
14 |
+
- espnet/owsm_v1
|
15 |
+
- espnet/owsm_v2
|
16 |
+
- espnet/owsm_v2_ebranchformer
|
17 |
+
- espnet/owsm_v3
|
18 |
+
- espnet/owsm_v3.1_ebf
|
19 |
---
|
20 |
|
21 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
app.py
CHANGED
@@ -13,12 +13,9 @@ OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models f
|
|
13 |
We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
|
14 |
For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
|
15 |
|
16 |
-
The latest demo uses OWSM v3.1
|
17 |
-
OWSM v3.1 outperforms OWSM v3 in almost all evaluation benchmarks while being faster during inference.
|
18 |
-
Note that we do not use extra training data. Instead, we utilize a state-of-the-art speech encoder, [E-Branchformer](https://arxiv.org/abs/2210.00077), to enhance the speech modeling capability.
|
19 |
-
|
20 |
OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
|
21 |
-
- Speech recognition
|
22 |
- Any-to-any language speech translation
|
23 |
- Utterance-level timestamp prediction
|
24 |
- Long-form transcription
|
@@ -35,30 +32,18 @@ Please consider citing the following related papers if you find our work helpful
|
|
35 |
<p>
|
36 |
|
37 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
@inproceedings{peng2023owsm,
|
39 |
title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
|
40 |
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
|
41 |
booktitle={Proc. ASRU},
|
42 |
year={2023}
|
43 |
}
|
44 |
-
@inproceedings{peng23b_interspeech,
|
45 |
-
author={Yifan Peng and Kwangyoun Kim and Felix Wu and Brian Yan and Siddhant Arora and William Chen and Jiyang Tang and Suwon Shon and Prashant Sridhar and Shinji Watanabe},
|
46 |
-
title={{A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks}},
|
47 |
-
year=2023,
|
48 |
-
booktitle={Proc. INTERSPEECH},
|
49 |
-
}
|
50 |
-
@inproceedings{kim2023branchformer,
|
51 |
-
title={E-branchformer: Branchformer with enhanced merging for speech recognition},
|
52 |
-
author={Kim, Kwangyoun and Wu, Felix and Peng, Yifan and Pan, Jing and Sridhar, Prashant and Han, Kyu J and Watanabe, Shinji},
|
53 |
-
booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
|
54 |
-
year={2023},
|
55 |
-
}
|
56 |
-
@InProceedings{pmlr-v162-peng22a,
|
57 |
-
title = {Branchformer: Parallel {MLP}-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding},
|
58 |
-
author = {Peng, Yifan and Dalmia, Siddharth and Lane, Ian and Watanabe, Shinji},
|
59 |
-
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
|
60 |
-
year = {2022},
|
61 |
-
}
|
62 |
```
|
63 |
|
64 |
</p>
|
|
|
13 |
We reproduce Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
|
14 |
For more details, please check our [website](https://www.wavlab.org/activities/2024/owsm/) or [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023).
|
15 |
|
16 |
+
The latest demo uses OWSM v3.1 based on [E-Branchformer](https://arxiv.org/abs/2210.00077).
|
|
|
|
|
|
|
17 |
OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
|
18 |
+
- Speech recognition in 151 languages
|
19 |
- Any-to-any language speech translation
|
20 |
- Utterance-level timestamp prediction
|
21 |
- Long-form transcription
|
|
|
32 |
<p>
|
33 |
|
34 |
```
|
35 |
+
@inproceedings{peng2024owsm31,
|
36 |
+
title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
|
37 |
+
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
|
38 |
+
booktitle={Proc. INTERSPEECH},
|
39 |
+
year={2024}
|
40 |
+
}
|
41 |
@inproceedings{peng2023owsm,
|
42 |
title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
|
43 |
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
|
44 |
booktitle={Proc. ASRU},
|
45 |
year={2023}
|
46 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
```
|
48 |
|
49 |
</p>
|
requirements.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
torch==2.1
|
2 |
-
torchaudio
|
3 |
-
espnet @ git+https://github.com/espnet/espnet@
|
4 |
-
espnet_model_zoo
|
|
|
1 |
+
torch == 2.4.1
|
2 |
+
torchaudio == 2.4.1
|
3 |
+
espnet @ git+https://github.com/espnet/espnet@v.202409
|
4 |
+
espnet_model_zoo @ git+https://github.com/espnet/espnet_model_zoo@8b7301923c1c529a126c86c16fc73ce356b94d62
|