Pre-trained checkpoints for speech representation in Japanese

The models in this repository were pre-trained via self-supervised learning (SSL) for speech representation. The SSL models were built on the fairseq toolkit.

wav2vec2_base_csj.pt
- fairseq checkpoint of wav2vec2.0 model with Base architecture pre-trained on 16kHz sampled speech data of Corpus of Spontaneous Japanese (CSJ)
wav2vec2_base_csj_hf
- converted version of wav2vec2_base_csj.pt compatible with the interface of Hugging Face by using this tool
hubert_base_csj.pt
- fairseq checkpoint of HuBERT model with Base architecture pre-trained on 16kHz sampled speech data of Corpus of Spontaneous Japanese (CSJ)
hubert_base_csj_hf
- converted version of hubert_base_csj.pt compatible with the interface of Hugging Face by using this tool

If you find this helpful, please consider citing the following paper.

@INPROCEEDINGS{ashihara_icassp23,
  author={Takanori Ashihara and Takafumi Moriya and Kohei Matsuura and Tomohiro Tanaka},
  title={Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models},
  booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023}
}