|
# Datasets Format |
|
|
|
Amphion support the following academic datasets (sort alphabetically): |
|
|
|
- [Datasets Format](#datasets-format) |
|
- [AudioCaps](#audiocaps) |
|
- [CSD](#csd) |
|
- [CustomSVCDataset](#customsvcdataset) |
|
- [Hi-Fi TTS](#hifitts) |
|
- [KiSing](#kising) |
|
- [LibriLight](#librilight) |
|
- [LibriTTS](#libritts) |
|
- [LJSpeech](#ljspeech) |
|
- [M4Singer](#m4singer) |
|
- [NUS-48E](#nus-48e) |
|
- [Opencpop](#opencpop) |
|
- [OpenSinger](#opensinger) |
|
- [Opera](#opera) |
|
- [PopBuTFy](#popbutfy) |
|
- [PopCS](#popcs) |
|
- [PJS](#pjs) |
|
- [SVCC](#svcc) |
|
- [VCTK](#vctk) |
|
|
|
The downloading link and the file structure tree of each dataset is displayed as follows. |
|
|
|
> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details. |
|
|
|
## AudioCaps |
|
|
|
AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information. |
|
|
|
Download AudioCaps dataset [here](https://github.com/cdjkim/audiocaps). The file structure looks like below: |
|
|
|
```plaintext |
|
[AudioCaps dataset path] |
|
β£ AudioCpas |
|
β β£ wav |
|
β β β£ ---1_cCGK4M_0_10000.wav |
|
β β β£ ---lTs1dxhU_30000_40000.wav |
|
β β β£ ... |
|
``` |
|
|
|
## CSD |
|
|
|
Download the official CSD dataset [here](https://zenodo.org/records/4785016). The file structure looks like below: |
|
|
|
```plaintext |
|
[CSD dataset path] |
|
β£ english |
|
β£ korean |
|
β£ utterances |
|
β β£ en001a |
|
β β β£ {UtterenceID}.wav |
|
β β£ en001b |
|
β β£ en002a |
|
β β£ en002b |
|
β β£ ... |
|
β£ README |
|
``` |
|
|
|
## CustomSVCDataset |
|
|
|
We support custom dataset for Singing Voice Conversion. Organize your data in the following structure to construct your own dataset: |
|
|
|
```plaintext |
|
[Your Custom Dataset Path] |
|
β£ singer1 |
|
β β£ song1 |
|
β β β£ utterance1.wav |
|
β β β£ utterance2.wav |
|
β β β£ ... |
|
β β£ song2 |
|
β β£ ... |
|
β£ singer2 |
|
β£ ... |
|
``` |
|
|
|
|
|
## Hi-Fi TTS |
|
|
|
Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below: |
|
|
|
```plaintext |
|
[Hi-Fi TTS dataset path] |
|
β£ audio |
|
β β£ 11614_other {Speaker_ID}_{SNR_subset} |
|
β β β£ 10547 {Book_ID} |
|
β β β β£ thousandnights8_04_anonymous_0001.flac |
|
β β β β£ thousandnights8_04_anonymous_0003.flac |
|
β β β β£ thousandnights8_04_anonymous_0004.flac |
|
β β β β£ ... |
|
β β β£ ... |
|
β β£ ... |
|
β£ 92_manifest_clean_dev.json |
|
β£ 92_manifest_clean_test.json |
|
β£ 92_manifest_clean_train.json |
|
β£ ... |
|
β£ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json |
|
β£ ... |
|
β£ books_bandwidth.tsv |
|
β£ LICENSE.txt |
|
β£ readers_books_clean.txt |
|
β£ readers_books_other.txt |
|
β£ README.txt |
|
|
|
``` |
|
|
|
## KiSing |
|
|
|
Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below: |
|
|
|
```plaintext |
|
[KiSing dataset path] |
|
β£ clean |
|
β β£ 421 |
|
β β£ 422 |
|
β β£ ... |
|
``` |
|
|
|
## LibriLight |
|
|
|
Download the official LibriLight dataset [here](https://github.com/facebookresearch/libri-light). The file structure looks like below: |
|
|
|
```plaintext |
|
[LibriTTS dataset path] |
|
β£ small (Subset) |
|
β β£ 100 {Speaker_ID} |
|
β β β£ sea_fairies_0812_librivox_64kb_mp3 {Chapter_ID} |
|
β β β β£ 01_baum_sea_fairies_64kb.flac |
|
β β β β£ 02_baum_sea_fairies_64kb.flac |
|
β β β β£ 03_baum_sea_fairies_64kb.flac |
|
β β β β£ 22_baum_sea_fairies_64kb.flac |
|
β β β β£ 01_baum_sea_fairies_64kb.json |
|
β β β β£ 02_baum_sea_fairies_64kb.json |
|
β β β β£ 03_baum_sea_fairies_64kb.json |
|
β β β β£ 22_baum_sea_fairies_64kb.json |
|
β β β β£ ... |
|
β β β£ ... |
|
β β£ ... |
|
β£ medium (Subset) |
|
β£ ... |
|
``` |
|
|
|
## LibriTTS |
|
|
|
Download the official LibriTTS dataset [here](https://www.openslr.org/60/). The file structure looks like below: |
|
|
|
```plaintext |
|
[LibriTTS dataset path] |
|
β£ BOOKS.txt |
|
β£ CHAPTERS.txt |
|
β£ eval_sentences10.tsv |
|
β£ LICENSE.txt |
|
β£ NOTE.txt |
|
β£ reader_book.tsv |
|
β£ README_librispeech.txt |
|
β£ README_libritts.txt |
|
β£ speakers.tsv |
|
β£ SPEAKERS.txt |
|
β£ dev-clean (Subset) |
|
β β£ 1272{Speaker_ID} |
|
β β β£ 128104 {Chapter_ID} |
|
β β β β£ 1272_128104_000001_000000.normalized.txt |
|
β β β β£ 1272_128104_000001_000000.original.txt |
|
β β β β£ 1272_128104_000001_000000.wav |
|
β β β β£ ... |
|
β β β β£ 1272_128104.book.tsv |
|
β β β β£ 1272_128104.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β£ dev-other (Subset) |
|
β β£ 116 (Speaker) |
|
β β β£ 288045 {Chapter_ID} |
|
β β β β£ 116_288045_000003_000000.normalized.txt |
|
β β β β£ 116_288045_000003_000000.original.txt |
|
β β β β£ 116_288045_000003_000000.wav |
|
β β β β£ ... |
|
β β β β£ 116_288045.book.tsv |
|
β β β β£ 116_288045.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β β£ ... |
|
β£ test-clean (Subset) |
|
β β£ {Speaker_ID} |
|
β β β£ {Chapter_ID} |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav |
|
β β β β£ ... |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β£ test-other |
|
β β£ {Speaker_ID} |
|
β β β£ {Chapter_ID} |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav |
|
β β β β£ ... |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β£ train-clean-100 |
|
β β£ {Speaker_ID} |
|
β β β£ {Chapter_ID} |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav |
|
β β β β£ ... |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β£ train-clean-360 |
|
β β£ {Speaker_ID} |
|
β β β£ {Chapter_ID} |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav |
|
β β β β£ ... |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
β£ train-other-500 |
|
β β£ {Speaker_ID} |
|
β β β£ {Chapter_ID} |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt |
|
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav |
|
β β β β£ ... |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv |
|
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv |
|
β β β£ ... |
|
β β£ ... |
|
``` |
|
|
|
## LJSpeech |
|
|
|
Download the official LJSpeech dataset [here](https://keithito.com/LJ-Speech-Dataset/). The file structure looks like below: |
|
|
|
```plaintext |
|
[LJSpeech dataset path] |
|
β£ metadata.csv |
|
β£ wavs |
|
β β£ LJ001-0001.wav |
|
β β£ LJ001-0002.wav |
|
β β£ ... |
|
β£ README |
|
``` |
|
|
|
## M4Singer |
|
|
|
Download the official M4Singer dataset [here](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view). The file structure looks like below: |
|
|
|
```plaintext |
|
[M4Singer dataset path] |
|
β£ {Singer_1}#{Song_1} |
|
β β£ 0000.mid |
|
β β£ 0000.TextGrid |
|
β β£ 0000.wav |
|
β β£ ... |
|
β£ {Singer_1}#{Song_2} |
|
β£ ... |
|
β£ {Singer_2}#{Song_1} |
|
β£ {Singer_2}#{Song_2} |
|
β£ ... |
|
β meta.json |
|
``` |
|
|
|
## NUS-48E |
|
|
|
Download the official NUS-48E dataset [here](https://drive.google.com/drive/folders/12pP9uUl0HTVANU3IPLnumTJiRjPtVUMx). The file structure looks like below: |
|
|
|
```plaintext |
|
[NUS-48E dataset path] |
|
β£ {SpeakerID} |
|
β β£ read |
|
β β β£ {SongID}.txt |
|
β β β£ {SongID}.wav |
|
β β β£ ... |
|
β β£ sing |
|
β β β£ {SongID}.txt |
|
β β β£ {SongID}.wav |
|
β β β£ ... |
|
β£ ... |
|
β£ README.txt |
|
|
|
``` |
|
|
|
## Opencpop |
|
|
|
Download the official Opencpop dataset [here](https://wenet.org.cn/opencpop/). The file structure looks like below: |
|
|
|
```plaintext |
|
[Opencpop dataset path] |
|
β£ midis |
|
β β£ 2001.midi |
|
β β£ 2002.midi |
|
β β£ 2003.midi |
|
β β£ ... |
|
β£ segments |
|
β β£ wavs |
|
β β β£ 2001000001.wav |
|
β β β£ 2001000002.wav |
|
β β β£ 2001000003.wav |
|
β β β£ ... |
|
β β£ test.txt |
|
β β£ train.txt |
|
β β transcriptions.txt |
|
β£ textgrids |
|
β β£ 2001.TextGrid |
|
β β£ 2002.TextGrid |
|
β β£ 2003.TextGrid |
|
β β£ ... |
|
β£ wavs |
|
β β£ 2001.wav |
|
β β£ 2002.wav |
|
β β£ 2003.wav |
|
β β£ ... |
|
β£ TERMS_OF_ACCESS |
|
β readme.md |
|
``` |
|
|
|
## OpenSinger |
|
|
|
Download the official OpenSinger dataset [here](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view). The file structure looks like below: |
|
|
|
```plaintext |
|
[OpenSinger dataset path] |
|
β£ ManRaw |
|
β β£ {Singer_1}_{Song_1} |
|
β β β£ {Singer_1}_{Song_1}_0.lab |
|
β β β£ {Singer_1}_{Song_1}_0.txt |
|
β β β£ {Singer_1}_{Song_1}_0.wav |
|
β β β£ ... |
|
β β£ {Singer_1}_{Song_2} |
|
β β£ ... |
|
β£ WomanRaw |
|
β£ LICENSE |
|
β README.md |
|
``` |
|
|
|
## Opera |
|
|
|
Download the official Opera dataset [here](http://isophonics.net/SingingVoiceDataset). The file structure looks like below: |
|
|
|
```plaintext |
|
[Opera dataset path] |
|
β£ monophonic |
|
β β£ chinese |
|
β β β£ {Gender}_{SingerID} |
|
β β β β£ {Emotion}_{SongID}.wav |
|
β β β β£ ... |
|
β β β£ ... |
|
β β£ western |
|
β£ polyphonic |
|
β β£ chinese |
|
β β£ western |
|
β£ CrossculturalDataSet.xlsx |
|
``` |
|
|
|
## PopBuTFy |
|
|
|
Download the official PopBuTFy dataset [here](https://github.com/MoonInTheRiver/NeuralSVB). The file structure looks like below: |
|
|
|
```plaintext |
|
[PopBuTFy dataset path] |
|
β£ data |
|
β β£ {SingerID}#singing#{SongName}_Amateur |
|
β β β£ {SingerID}#singing#{SongName}_Amateur_{UtteranceID}.mp3 |
|
β β β£ ... |
|
β β£ {SingerID}#singing#{SongName}_Professional |
|
β β β£ {SingerID}#singing#{SongName}_Professional_{UtteranceID}.mp3 |
|
β β β£ ... |
|
β£ text_labels |
|
β TERMS_OF_ACCESS |
|
``` |
|
|
|
## PopCS |
|
|
|
Download the official PopCS dataset [here](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md). The file structure looks like below: |
|
|
|
```plaintext |
|
[PopCS dataset path] |
|
β£ popcs |
|
β β£ popcs-{SongName} |
|
β β β£ {UtteranceID}_ph.txt |
|
β β β£ {UtteranceID}_wf0.wav |
|
β β β£ {UtteranceID}.TextGrid |
|
β β β£ {UtteranceID}.txt |
|
β β β£ ... |
|
β β£ ... |
|
β TERMS_OF_ACCESS |
|
``` |
|
|
|
## PJS |
|
|
|
Download the official PJS dataset [here](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus). The file structure looks like below: |
|
|
|
```plaintext |
|
[PJS dataset path] |
|
β£ PJS_corpus_ver1.1 |
|
β β£ background_noise |
|
β β£ pjs{SongID} |
|
β β β£ pjs{SongID}_song.wav |
|
β β β£ pjs{SongID}_speech.wav |
|
β β β£ pjs{SongID}.lab |
|
β β β£ pjs{SongID}.mid |
|
β β β£ pjs{SongID}.musicxml |
|
β β β£ pjs{SongID}.txt |
|
β β£ ... |
|
``` |
|
|
|
## SVCC |
|
|
|
Download the official SVCC dataset [here](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset). The file structure looks like below: |
|
|
|
```plaintext |
|
[SVCC dataset path] |
|
β£ Data |
|
β β£ CDF1 |
|
β β β£ 10001.wav |
|
β β β£ 10002.wav |
|
β β β£ ... |
|
β β£ CDM1 |
|
β β£ IDF1 |
|
β β£ IDM1 |
|
β README.md |
|
``` |
|
|
|
## VCTK |
|
|
|
Download the official VCTK dataset [here](https://datashare.ed.ac.uk/handle/10283/3443). The file structure looks like below: |
|
|
|
```plaintext |
|
[VCTK dataset path] |
|
β£ txt |
|
β β£ {Speaker_1} |
|
β β β£ {Speaker_1}_001.txt |
|
β β β£ {Speaker_1}_002.txt |
|
β β β£ ... |
|
β β£ {Speaker_2} |
|
β β£ ... |
|
β£ wav48_silence_trimmed |
|
β β£ {Speaker_1} |
|
β β β£ {Speaker_1}_001_mic1.flac |
|
β β β£ {Speaker_1}_001_mic2.flac |
|
β β β£ {Speaker_1}_002_mic1.flac |
|
β β β£ ... |
|
β β£ {Speaker_2} |
|
β β£ ... |
|
β£ speaker-info.txt |
|
β update.txt |
|
``` |
|
|