WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

📂 Project Tree

The structure of WSYue-ASR is organized as follows:

WSYue-ASR
├── sensevoice_small_yue/
│ ├── config.yaml
│ ├── configuration.json
│ └── model.pt
│
├── u2pp_conformer_yue/
│ ├── bpe.model
│ ├── lang_char.txt
│ └── train.yaml
│ └── u2pp_conformer_yue.pt
│
├── whisper_medium_yue/
│ ├── train.yaml
│ └── whisper_medium_yue.py
│
├── .gitattributes
└── README.md

ASR Leaderboard

Model	#Params (M)	In-House		Open-Source					WSYue-eval
Model	#Params (M)	Dialogue	Reading	yue	HK	MDCC	Daily_Use	Commands	Short	Long
w/o LLM
Conformer-Yue⭐	130	16.57	7.82	7.72	11.42	5.73	5.73	8.97	5.05	8.89
Paraformer	220	83.22	51.97	70.16	68.49	47.67	79.31	69.32	73.64	89.00
SenseVoice-small	234	21.08	6.52	8.05	7.34	6.34	5.74	6.65	6.69	9.95
SenseVoice-s-Yue⭐	234	19.19	6.71	6.87	8.68	5.43	5.24	6.93	5.23	8.63
Dolphin-small	372	59.20	7.38	39.69	51.29	26.39	7.21	9.68	32.32	58.20
TeleASR	700	37.18	7.27	7.02	7.88	6.25	8.02	5.98	6.23	11.33
Whisper-medium	769	75.50	68.69	59.44	62.50	62.31	64.41	80.41	80.82	50.96
Whisper-m-Yue⭐	769	18.69	6.86	6.86	11.03	5.49	4.70	8.51	5.05	8.05
FireRedASR-AED-L	1100	73.70	18.72	43.93	43.33	34.53	48.05	49.99	55.37	50.26
Whisper-large-v3	1550	45.09	15.46	12.85	16.36	14.63	17.84	20.70	12.95	26.86
w/ LLM
Qwen2.5-Omni-3B	3000	72.01	7.49	12.59	11.75	38.91	10.59	25.78	67.95	88.46
Kimi-Audio	7000	68.65	24.34	40.90	38.72	30.72	44.29	45.54	50.86	33.49
FireRedASR-LLM-L	8300	73.70	18.72	43.93	43.33	34.53	48.05	49.99	49.87	45.92
Conformer-LLM-Yue⭐	4200	17.22	6.21	6.23	9.52	4.35	4.57	6.98	4.73	7.91

ASR Inference

U2pp_Conformer_Yue

dir=u2pp_conformer_yue
decode_checkpoint=$dir/u2pp_conformer_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention_rescoring \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --ctc_weight 0.5 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

Whisper_Medium_Yue

dir=whisper_medium_yue
decode_checkpoint=$dir/whisper_medium_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --blank_penalty 0.0 \
  --ctc_weight 0.0 \
  --reverse_weight 0.0 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

SenseVoice_Small_Yue

from funasr import AutoModel

model_dir = "sensevoice_small_yue"

model = AutoModel(
        model=model_path,
        device="cuda:0",
    )
res = model.generate(
    wav_path,
    cache={},
    language="yue",
    use_itn=True,
    batch_size=64,
)

ASLP-lab
/

WSYue-ASR

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

📂 Project Tree

ASR Leaderboard

ASR Inference

U2pp_Conformer_Yue

Whisper_Medium_Yue

SenseVoice_Small_Yue

Collection including ASLP-lab/WSYue-ASR

WenetSpeech-Yue