SazerLife
/

DINO-HuVITS

VoiceConversion

Inference Endpoints

Model card Files Files and versions Community

DINO-HuVITS / README.md

SazerLife's picture

Update README.md

c828e59 verified about 1 month ago

|

1.53 kB

	---
	license: mit
	language:
	- ru
	pipeline_tag: audio-to-audio
	tags:
	- VoiceConversion
	---

	# DINO-HuVITS

	## Info
	Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770).
	В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754).

	## Quick start

	```python
	import librosa
	import torch

	from dino_huvits import DinoHuVits


	model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS")
	model = model.eval()

	content, _ = librosa.load("<content-path>", sr=16000)
	reference, _ = librosa.load("<reference-paht>", sr=16000)

	content = torch.from_numpy(content).unsqueeze(0)
	lengths = torch.tensor([content.shape[1]], dtype=torch.long)
	reference = torch.from_numpy(reference).unsqueeze(0)

	with torch.no_grad():
	output, _ = model(content, lengths, reference)
	```

	## Datasets
	- [Common Voice](https://commonvoice.mozilla.org/ru)
	- [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
	- [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning)
	- [VoxTube](https://github.com/IDRnD/VoxTube)
	- [Golos](https://github.com/sberdevices/golos)
	- [OpenSTT](https://github.com/snakers4/open_stt)
	- [Sova](https://github.com/sovaai/sova-dataset)