File size: 1,529 Bytes
fe6ff1b c828e59 fe6ff1b c828e59 38a69d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
license: mit
language:
- ru
pipeline_tag: audio-to-audio
tags:
- VoiceConversion
---
# DINO-HuVITS
## Info
Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770).
В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754).
## Quick start
```python
import librosa
import torch
from dino_huvits import DinoHuVits
model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS")
model = model.eval()
content, _ = librosa.load("<content-path>", sr=16000)
reference, _ = librosa.load("<reference-paht>", sr=16000)
content = torch.from_numpy(content).unsqueeze(0)
lengths = torch.tensor([content.shape[1]], dtype=torch.long)
reference = torch.from_numpy(reference).unsqueeze(0)
with torch.no_grad():
output, _ = model(content, lengths, reference)
```
## Datasets
- [Common Voice](https://commonvoice.mozilla.org/ru)
- [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
- [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
- [VoxTube](https://github.com/IDRnD/VoxTube)
- [Golos](https://github.com/sberdevices/golos)
- [OpenSTT](https://github.com/snakers4/open_stt)
- [Sova](https://github.com/sovaai/sova-dataset) |