Spaces:

projecte-aina
/

matxa-alvocat-tts-ca

Running

martillopartbsc commited on Apr 19

Commit

1c00c50

•

1 Parent(s): c5e2c91

Alex changes (#9)

- Alex changes (0f00fe24acfdc5543db161d6c2e048e876ce3ca3)

Co-authored-by: Martí Llopart Font <martillopartbsc@users.noreply.huggingface.co>

Files changed (1) hide show

about.md CHANGED Viewed

@@ -207,13 +207,17 @@ Together, these technologies form a comprehensive TTS solution tailored to the n
 ## The model in detail
-**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS.
-On the one hand, the encoder part is based on a text encoder and a phoneme duration prediction. Together, they predict averaged acoustic features.
-On the other hand, the decoder has essentially a U-Net backbone inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), which is based on the Transformer architecture.
-In the latter, by replacing 2D CNNs by 1D CNNs, a large reduction in memory consumption and fast synthesis is achieved.
-**Matcha-TTS** is a non-autorregressive model trained with optimal-transport conditional flow matching (OT-CFM).
-This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps than models trained using score matching.
 ## Adaptation to Catalan

 ## The model in detail
+**Matcha-TTS** is a non-autorregressive encoder-decoder model designed for fast acoustic modelling in TTS.
+The encoder part processes input sequences of phonemes and, together with a phoneme duration predictor, outputs averaged acoustic features. And the decoder,
+which is essentially a U-Net backbone based on the Transfomer architecture, predicts the refined spectrogram.
+The model is trained with optimal-transport conditional flow matching.
+This yields an ODE-based decoder capable of generating high output quality in fewer synthesis steps.
+**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features.
+Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain.
+Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
+The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the acoustic output of several TTS models.
+This version is tailored for the Catalan language, as it was trained only on Catalan speech datasets.
 ## Adaptation to Catalan