doc: added Readme
Browse files
README.md
CHANGED
@@ -3,4 +3,42 @@ license: mit
|
|
3 |
language:
|
4 |
- ru
|
5 |
pipeline_tag: audio-to-audio
|
6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
language:
|
4 |
- ru
|
5 |
pipeline_tag: audio-to-audio
|
6 |
+
---
|
7 |
+
|
8 |
+
# DINO-HuVITS
|
9 |
+
|
10 |
+
## Info
|
11 |
+
Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770).
|
12 |
+
В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754).
|
13 |
+
|
14 |
+
## Quick start
|
15 |
+
|
16 |
+
```python
|
17 |
+
import librosa
|
18 |
+
import torch
|
19 |
+
|
20 |
+
from dino_huvits import DinoHuVits
|
21 |
+
|
22 |
+
|
23 |
+
model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS")
|
24 |
+
model = model.eval()
|
25 |
+
|
26 |
+
content, _ = librosa.load("<content-path>", sr=16000)
|
27 |
+
reference, _ = librosa.load("<reference-paht>", sr=16000)
|
28 |
+
|
29 |
+
content = torch.from_numpy(content).unsqueeze(0)
|
30 |
+
lengths = torch.tensor([content.shape[1]], dtype=torch.long)
|
31 |
+
reference = torch.from_numpy(reference).unsqueeze(0)
|
32 |
+
|
33 |
+
with torch.no_grad():
|
34 |
+
output, _ = model(content, lengths, reference)
|
35 |
+
```
|
36 |
+
|
37 |
+
## Datasets
|
38 |
+
- [Common Voice](https://commonvoice.mozilla.org/ru)
|
39 |
+
- [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
|
40 |
+
- [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
|
41 |
+
- [VoxTube](https://github.com/IDRnD/VoxTube)
|
42 |
+
- [Golos](https://github.com/sberdevices/golos)
|
43 |
+
- [OpenSTT](https://github.com/snakers4/open_stt)
|
44 |
+
- [Sova](https://github.com/sovaai/sova-dataset)
|