|
--- |
|
library_name: transformers |
|
tags: |
|
- speech |
|
- text-to-speech |
|
datasets: |
|
- openslr/librispeech_asr |
|
- MushanW/GLOBE |
|
- MikhailT/hifi-tts |
|
--- |
|
|
|
# Model Card for Model ID |
|
Voicera is a AR text-to-speech model trained on ~1000hrs of speech data. |
|
speech is converted to discrete tokens using "Multi-Scale Neural Audio Codec (SNAC)" model |
|
**NB: This is not a SOTA model, and not accuarate enough for production usecase** |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
"Voicera" is a text-to-speech (TTS) model designed for generating speech from written text. |
|
It uses a GPT-2 type architecture, which helps in creating natural and expressive speech. |
|
The model converts audio into tokens using the "Multi-Scale Neural Audio Codec (SNAC)" model, allowing it to understand and produce speech sounds. |
|
Voicera aims to provide clear and understandable speech, focusing on natural pronunciation and intonation. |
|
It's a project to explore TTS technology and improve audio output quality. |
|
|
|
|
|
|
|
- **Developed by:** Lwasinam Dilli |
|
- **Funded by :** Lwasinam Dilli |
|
- **Model type:** GPT2-Transformer architecture |
|
- **License:** Free and Open to use I guess :) |
|
|
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [Github](https://github.com/Lwasinam/voicera) |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo :** [Demos](https://lwasinam.github.io/) |
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
There are three models, We have the base model and two other finetuned on jenny and expresso datasets |
|
The best of all currently is the Jenny finetune |
|
Here are colab link to all 3 respectively |
|
1. [Base Model](https://colab.research.google.com/drive/10nPKliRs1C3ofv2J16_HGDlmzfd-yBtj#scrollTo=r17orAuZ45Q2) |
|
2. [Jenny-Finetune](https://colab.research.google.com/drive/1MSzGGqIhGYVCn76alsX9oBzwC4EtOQSR#scrollTo=Oz0DG-MtovBw) |
|
3. [Expresso-Finetune](https://colab.research.google.com/drive/1wzwSOtpT1CpEMvbcjvvgEKQZoQa5bX2p#scrollTo=YrBUwCNYmmUW&uniqifier=1) |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
Training data consist of clean subset of Hifi, Libri-Speech, Libri-TTs and Globe datasets |
|
|
|
### Training Procedure |
|
During training, audio tokens are generated from snac model and concatenated with text tokens, They are all trained in an autoregressive manner |
|
but since we're interested in just audio tokens, text token loss is reduced by 0.1. |
|
|
|
#### Preprocessing |
|
|
|
Hugging Face had pretty much all the datasets I needed. I just had to filter out audio more than 10secs due to compute restraints |
|
|
|
|
|
#### Training Hyperparameters |
|
- Weight decay 0.1 |
|
- batch_size 1 with grad_accumulation of 32 |
|
- Scheduler : CosineAnnealingWarmRestart with minimum learning rate of 1e-7 and Num of steps for Warm Restart being 500 |
|
|
|
|
|
## Evaluation |
|
|
|
I should probably work on this, the loss went down and the output got better :) |
|
|
|
### Results |
|
Check out the demo page her -> [Demo](https://lwasinam.github.io/) |
|
|
|
#### Summary |
|
|
|
|
|
|
|
- **Hardware Type:** Tesla P100 |
|
- **Hours used:** 300+hrs |
|
- **Cloud Provider:** Kaggle :) |
|
|
|
|
|
|
|
|
|
## Citation [optional] |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@software{Betker_TorToiSe_text-to-speech_2022, |
|
author = {Betker, James}, |
|
month = apr, |
|
title = {{TorToiSe text-to-speech}}, |
|
url = {https://github.com/neonbjb/tortoise-tts}, |
|
version = {2.0}, |
|
year = {2022} |
|
} |
|
|
|
@software{Siuzdak_SNAC_Multi-Scale_Neural_2024, |
|
author = {Siuzdak, Hubert}, |
|
month = feb, |
|
title = {{SNAC: Multi-Scale Neural Audio Codec}}, |
|
url = {https://github.com/hubertsiuzdak/snac}, |
|
year = {2024} |
|
} |
|
|
|
``` |
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
Lwasinam Dilli |