--- license: mit language: - ru pipeline_tag: audio-to-audio tags: - VoiceConversion --- # DINO-HuVITS ## Info Разработка данной модели вдохновлена статьёй [DINO-VITS](https://arxiv.org/abs/2311.09770). В основе лежит архитектура VITS, в которой оригинальный `PosteriorEncoder` был заменён на модель [HuBERT Base](https://arxiv.org/abs/2106.07447), а обучение `SpeakerEncoder` происходило с помощью функции потерь [DINO](https://arxiv.org/abs/2304.05754). ## Quick start ```python import librosa import torch from dino_huvits import DinoHuVits model = DinoHuVits.from_pretrained("SazerLife/DINO-HuVITS") model = model.eval() content, _ = librosa.load("", sr=16000) reference, _ = librosa.load("", sr=16000) content = torch.from_numpy(content).unsqueeze(0) lengths = torch.tensor([content.shape[1]], dtype=torch.long) reference = torch.from_numpy(reference).unsqueeze(0) with torch.no_grad(): output, _ = model(content, lengths, reference) ``` ## Datasets - [Common Voice](https://commonvoice.mozilla.org/ru) - [VoxForge](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning) - [M-AILABS](https://github.com/vlomme/Multi-Tacotron-Voice-Cloning) - [VoxTube](https://github.com/IDRnD/VoxTube) - [Golos](https://github.com/sberdevices/golos) - [OpenSTT](https://github.com/snakers4/open_stt) - [Sova](https://github.com/sovaai/sova-dataset)