--- license: cc-by-nc-sa-4.0 --- ## Description These models are used to separate reverb and delay effects in vocals. In addition, **these models also have the ability to remove most of the harmonies.** I added random high cut after the reverberation and delay effects in the dataset, so these model's handling of high frequencies is not particularly aggressive.
You can try listening to the performance of these models [here](https://huggingface.co/Sucial/Dereverb-Echo_Mel_Band_Roformer/tree/main/example)! ## How to use the model? Try it with [ZFTurbo's Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training) ## Models ### 1. Fused Models I used [a model fusion script](https://huggingface.co/Sucial/Dereverb-Echo_Mel_Band_Roformer/blob/main/scripts/model_fusion.py) to fuse three models with the same model structure. The three models and their corresponding fusion ratios are as follows:
**0.5 * dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt + 0.25 * de_big_reverb_mbr_ep_362.ckpt + 0.25 * de_super_big_reverb_mbr_ep_346.ckpt**
Therefore, the fused model has the ability to remove both small and large reverberations simultaneously. However, I did not carefully adjust the fusion ratio of each model. If any experts are willing to help me adjust it carefully, I would be very grateful! config: the same as v2 models and big reverb models: [config_dereverb_echo_mbr_v2.yaml](./config_dereverb_echo_mbr_v2.yaml)
fused_model: [dereverb_echo_mbr_fused_0.5_v2_0.25_big_0.25_super.ckpt](./dereverb_echo_mbr_fused_0.5_v2_0.25_big_0.25_super.ckpt) ### 2. Big reverb Models There are two models for removing large reverberation in total: [de_big_reverb_mbr_ep_362.ckpt](./de_big_reverb_mbr_ep_362.ckpt) and [de_super_big_reverb_mbr_ep_346.ckpt](./de_super_big_reverb_mbr_ep_346.ckpt). In general, for large reverberations, using the `de_big_reverb_mbr` model is sufficient. The `de_super_big_reverb_mbr` model is trained for extremely large reverberations and is generally less commonly used. The configuration files of these two models and the v2 model share the same configuration file. And they are all finetuned from `dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt`. config: [config_dereverb_echo_mbr_v2.yaml](./config_dereverb_echo_mbr_v2.yaml)
Model_de_big_reverb: [de_big_reverb_mbr_ep_362.ckpt](./de_big_reverb_mbr_ep_362.ckpt)
Model_de_super_big_reverb: [de_super_big_reverb_mbr_ep_346.ckpt](./de_super_big_reverb_mbr_ep_346.ckpt) In order to better validate the model's performance, I have added two indicators, `f0_fitness` and `uv_fitness`, as follows:
Calculate the F0 and voiced/unvoiced (UV) fitness between a reference and an estimated audio signal. These two metrics are only of reference value for vocals.
The F0 fitness measures how similar the fundamental frequency (F0) of the reference and estimated signals are, while the UV fitness evaluates the accuracy of voiced/unvoiced detection between the two signals. Both are computed by extracting F0 and UV information using pitch analysis and then calculating the Pearson correlation between the corresponding F0 and UV sequences. The F0 fitness can also be used to compare the completeness of the extracted fundamental frequency (F0) for human voice signals. The values of these two metrics are both -1 to 1, and the closer the value is to 1, the better the fit. For these two models, I used different validation sets for verification (so SDR has no practical reference significance), and the validation results are as follows: ``` de_big_reverb_mbr_ep_362.ckpt Num overlap: 2 Instr dry sdr: 14.0030 (Std: 2.9492) Instr dry bleedless: 43.6501 (Std: 10.1362) Instr dry fullness: 21.7776 (Std: 5.9445) Instr dry f0_fitness: 0.8405 (Std: 0.1520) Instr dry uv_fitness: 0.9759 (Std: 0.0162) de_super_big_reverb_mbr_ep_346.ckpt Num overlap: 2 Instr dry sdr: 11.3164 (Std: 2.4877) Instr dry bleedless: 43.3989 (Std: 10.7918) Instr dry fullness: 17.5554 (Std: 4.0178) Instr dry f0_fitness: 0.7845 (Std: 0.1864) Instr dry uv_fitness: 0.9662 (Std: 0.0172) ``` ### 3. V2 Models Config: [config_dereverb_echo_mbr_v2.yaml](./config_dereverb_echo_mbr_v2.yaml)
Model: [dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt](./dereverb_echo_mbr_v2_sdr_dry_13.4843.ckpt)
Instr dry sdr: 13.4843 (Std: 4.8675) Finetuned from: `dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt`
Used 1000+ songs to Finetune. ### 4. V1 Models Configs: [config_dereverb-echo_mel_band_roformer.yaml](./config_dereverb-echo_mel_band_roformer.yaml)
Model: [dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt](./dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt)
Instr dry sdr: 13.1507, Instr other sdr: 6.8830, Metric avg sdr: 10.0169 Instruments: [dry, other]
Finetuned from: `model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt`
Datasets: - Training datasets: 270 songs from [opencpop](https://github.com/wenet-e2e/opencpop) and [GTSinger](https://github.com/GTSinger/GTSinger) - Validation datasets: 30 songs from my own collection - All random reverbs and delay effects are generated by [this python script](./scripts/create_reverb_delay.py) and sorted into the mustb18 dataset format. ## Thanks - Mel-Band-Roformer [[Paper](https://arxiv.org/abs/2310.01809), [Repository](https://github.com/lucidrains/BS-RoFormer)] - [ZFTurbo](https://github.com/ZFTurbo)'s training code [[Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)] - [CN17161](https://github.com/CN17161) provided GPUs. - [Glucy-2](https://github.com/Glucy-2) provided technical assistance.