|
--- |
|
license: mit |
|
language: |
|
- ru |
|
pipeline_tag: audio-to-audio |
|
tags: |
|
- VoiceConversion |
|
--- |
|
|
|
# DINO-HuVITS |
|
|
|
## Info |
|
Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770). |
|
В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754). |
|
|
|
## Quick start |
|
|
|
```python |
|
import librosa |
|
import torch |
|
|
|
from dino_huvits import DinoHuVits |
|
|
|
|
|
model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS") |
|
model = model.eval() |
|
|
|
content, _ = librosa.load("<content-path>", sr=16000) |
|
reference, _ = librosa.load("<reference-paht>", sr=16000) |
|
|
|
content = torch.from_numpy(content).unsqueeze(0) |
|
lengths = torch.tensor([content.shape[1]], dtype=torch.long) |
|
reference = torch.from_numpy(reference).unsqueeze(0) |
|
|
|
with torch.no_grad(): |
|
output, _ = model(content, lengths, reference) |
|
``` |
|
|
|
## Datasets |
|
- [Common Voice](https://commonvoice.mozilla.org/ru) |
|
- [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning) |
|
- [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning) |
|
- [VoxTube](https://github.com/IDRnD/VoxTube) |
|
- [Golos](https://github.com/sberdevices/golos) |
|
- [OpenSTT](https://github.com/snakers4/open_stt) |
|
- [Sova](https://github.com/sovaai/sova-dataset) |