--- language: sw license: apache-2.0 tags: - tensorflowtts - audio - text-to-speech - mel-to-wav inference: false datasets: - bookbot/sw-TZ-Victoria - bookbot/sw-TZ-Victoria-syllables-word - bookbot/sw-TZ-Victoria-v2 - bookbot/sw-TZ-VictoriaNeural-upsampled-48kHz --- # MB-MelGAN HiFi PostNets SW v4 MB-MelGAN HiFi PostNets SW v4 is a mel-to-wav model based on the [MB-MelGAN](https://arxiv.org/abs/2005.05106) architecture with [HiFi-GAN](https://arxiv.org/abs/2010.05646) discriminator. This model was trained from scratch on trained on real and synthetic audio datasets. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of [LightSpeech MFA SW v4](https://huggingface.co/bookbot/lightspeech-mfa-sw-v4). The list of speakers include: - sw-TZ-Victoria - sw-TZ-Victoria-syllables-word - sw-TZ-Victoria-v2 - sw-TZ-VictoriaNeural-upsampled-48kHz This model was trained using the [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this [Github Fork](https://github.com/bookbot-hive/TensorFlowTTS), as well as the [Training metrics](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/tensorboard) logged via Tensorboard. ## Model | Model | Config | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps | | ------------------------------- | ----------------------------------------------------------------------------------------- | ------- | -------------- | -------------------- | ------ | | `mb-melgan-hifi-postnets-sw-v4` | [Link](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v4/blob/main/config.yml) | 44.1K | 20-11025 | 2048 / 512 / None | 1M | ## Training Procedure
Feature Extraction Setting sampling_rate: 44100 hop_size: 512 # Hop size. format: "npy"
Generator Network Architecture Setting model_type: "multiband_melgan_generator" multiband_melgan_generator_params: out_channels: 4 # Number of output channels (number of subbands). kernel_size: 7 # Kernel size of initial and final conv layers. filters: 384 # Initial number of channels for conv layers. upsample_scales: [8, 4, 4] # List of Upsampling scales. stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. is_weight_norm: false # Use weight-norm or not.
Discriminator Network Architecture Setting multiband_melgan_discriminator_params: out_channels: 1 # Number of output channels. scales: 3 # Number of multi-scales. downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling. downsample_pooling_params: # Parameters of the above pooling function. pool_size: 4 strides: 2 kernel_sizes: [5, 3] # List of kernel size. filters: 16 # Number of channels of the initial conv layer. max_downsample_filters: 512 # Maximum number of channels of downsampling layers. downsample_scales: [4, 4, 4] # List of downsampling scales. nonlinear_activation: "LeakyReLU" # Nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function. alpha: 0.2 is_weight_norm: false # Use weight-norm or not. hifigan_discriminator_params: out_channels: 1 # Number of output channels (number of subbands). period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales. n_layers: 5 # Number of layer of each period discriminator. kernel_size: 5 # Kernel size. strides: 3 # Strides filters: 8 # In Conv filters of each period discriminator filter_scales: 4 # Filter scales. max_filters: 512 # maximum filters of period discriminator's conv. is_weight_norm: false # Use weight-norm or not.
STFT Loss Setting stft_loss_params: fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss. frame_steps: [120, 240, 50] # List of hop size for STFT-based loss frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss. subband_stft_loss_params: fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss. frame_steps: [30, 60, 10] # List of hop size for STFT-based loss frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
Adversarial Loss Setting lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
Data Loader Setting batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1. eval_batch_size: 16 batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size. batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size. remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps. allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory. is_shuffle: true # shuffle dataset after each epoch.
Optimizer & Scheduler Setting generator_optimizer_params: lr_fn: "PiecewiseConstantDecay" lr_params: boundaries: [100000, 150000, 400000, 500000, 600000, 700000] values: [0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001] amsgrad: false discriminator_optimizer_params: lr_fn: "PiecewiseConstantDecay" lr_params: boundaries: [100000, 200000, 300000, 400000, 500000] values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001] amsgrad: false gradient_accumulation_steps: 1
Interval Setting discriminator_train_start_steps: 200000 # steps begin training discriminator train_max_steps: 1000000 # Number of training steps. save_interval_steps: 20000 # Interval steps to save checkpoint. eval_interval_steps: 5000 # Interval steps to evaluate the network. log_interval_steps: 200 # Interval steps to record the training log.
Other Setting num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
## How to Use ```py import soundfile as sf import tensorflow as tf from tensorflow_tts.inference import TFAutoModel, AutoProcessor lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4") processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4") mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v4") text, speaker_name = "Hello World.", "sw-TZ-Victoria" input_ids = processor.text_to_sequence(text) mel, _, _ = lightspeech.inference( input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0), speaker_ids=tf.convert_to_tensor( [processor.speakers_map[speaker_name]], dtype=tf.int32 ), speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32), f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32), energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32), ) audio = mb_melgan.inference(mel)[0, :, 0] sf.write("./audio.wav", audio, 44100, "PCM_16") ``` ## Disclaimer Do consider the biases which came from pre-training datasets that may be carried over into the results of this model. ## Authors MB-MelGAN HiFi PostNets SW v4 was trained and evaluated by [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on local machines. ## Framework versions - TensorFlowTTS 1.8 - TensorFlow 2.12.0