Warmstarted from the "Chills" single-speaker male model (not available on HF as of right now), then trained for 25 (de facto 50) epochs. Batch size 16, learning rate (√2)e-3 for the first 15(?) epochs and (5√2)e-4 for the next 10.
Dataset: NST Norwegian Speech Synthesis (CC0), augmented like so:
- Make a copy of the dataset.
- Join the two shortest clips of the copy with 100ms of silence between them, then replace them with the joined version. Repeat until the shortest clip is at least 6 seconds long.
- Shuffle the original together with the copy.