Bark-voice-cloning
Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.
This can be used for many things, including speech transfer and voice cloning.
The models in this repo
- quantifier_hubert_base_ls960.pth (the model trained on literature for 4 epochs)
- quantifier_hubert_base_ls960_14.pth (the model trained on literature for 10 more epochs, based on the previous)
- quantifier_V1_hubert_base_ls960_23.pth (a larger model, trained for more epochs on the same dataset)
(Please use the model manager from the code repo for easy downloading of models)
Voice cloning
Voice cloning is creating a new voice for text-to-speech.
Process:
- Load your wav audio file into your pytorch application
- For the fine prompt extract discrete representations. (These are used by bark to know about the voice), make sure to
.squeeze()
the resulting codes. - For the coarse prompt do
fine_prompt[:2, :]
, to get the coarse prompt from a fine prompt. - For the semantics, load a HuBERT model without Kmeans (I personally use the audiolm-pytorch implementation's hubertwithkmeans, but i edited it to skip kmeans.)
- Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
- Save these files in an npz with
numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse)
. This is your speaker file containing your cloned voice.
Voice masking
Voice masking is replacing a voice in an audio clip for speech-to-speech.
Random
Replacing a voice in an audio clip with a voice generated by bark.
process:
- Extract semantics from the audio clip using HuBERT and this model
- Run
semantic_to_waveform
frombark.api
with the extracted semantics - The previous step returns the generated audio.
Transfer
Replacing a voice with a voice from another audio clip.
process:
- Create a speaker file using the steps under the voice cloning section
- Extract the semantics from the clip with the text you want spoken
- Run
semantics_to_waveform
frombark.api
with the extracted semantics, and the speaker prompt that you created on step 1. - The previous step returns the generated audio.
Disclaimer
I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.