File size: 1,529 Bytes
fe6ff1b
 
 
c828e59
fe6ff1b
c828e59
 
38a69d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
language:
  - ru
pipeline_tag: audio-to-audio
tags:
  - VoiceConversion
---

# DINO-HuVITS

## Info
Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770).
В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754).

## Quick start

```python
import librosa
import torch

from dino_huvits import DinoHuVits


model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS")
model = model.eval()

content, _ = librosa.load("<content-path>", sr=16000)
reference, _ = librosa.load("<reference-paht>", sr=16000)

content = torch.from_numpy(content).unsqueeze(0)
lengths = torch.tensor([content.shape[1]], dtype=torch.long)
reference = torch.from_numpy(reference).unsqueeze(0)

with torch.no_grad():
    output, _ = model(content, lengths, reference)
```

## Datasets
- [Common Voice](https://commonvoice.mozilla.org/ru)
- [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
- [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
- [VoxTube](https://github.com/IDRnD/VoxTube)
- [Golos](https://github.com/sberdevices/golos)
- [OpenSTT](https://github.com/snakers4/open_stt)
- [Sova](https://github.com/sovaai/sova-dataset)