The voice sample used is the same as XTTS. F5 has so far been unstable, being unemotional/monotone/depressed and mispronouncing words (_awestruck_).
If you have suggestions please give feedback in the following thread:
mrfakename/E2-F5-TTS#32
Join the community of Machine Learners and AI enthusiasts.
Sign Up@Pendrokar I mean the sample itself is pretty unemotional, so a good voice cloning model would have to be unemotional as well.
The unstability is an issue though which even I don't like, it can be somewhat solved by the silence trimmer but not fully.
True about the narration style sample, but that still did not stop XTTS in surpassing F5. Both use the same sample.
This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for this—the voice used in the Arena is also in-distribution.
It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.
https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134
True, a sample from the original dataset would probably be the best. My attempt to try to fetch one from Emilia dataset was unsuccessful as HF dataset viewer can only show the German samples. Emilia's homepage has a ASMR-y example prompt given.