Bingsu's picture
Create README.md
0a42ee9
metadata
language: ko
tags:
  - text-to-speech
license: other

Torchaudio_Tacotron2_kss

torchaudio Tacotron2 model, trained on kss dataset.

License

  • code: MIT License
  • pytorch_model.bin weights: CC BY-NC-SA 4.0 (license of the kss dataset)

Requirements

pip install torch torchaudio transformers phonemizer

and you have to install espeak-ng

If you are using Windows, you need to set additional environment variables. see: https://github.com/bootphon/phonemizer/issues/44

Usage

import torch
from transformers import AutoModel, AutoTokenizer

repo = "Bingsu/torchaudio_tacotron2_kss"
model = AutoModel.from_pretrained(
    repo,
    trust_remote_code=True,
    revision="589d6557e8b4bb347f49de74270541063ba9c2bc"
    )
tokenizer = AutoTokenizer.from_pretrained(repo)
model.eval()
vocoder = torch.hub.load("seungwonpark/melgan:aca59909f6dd028ec808f987b154535a7ca3400c", "melgan", trust_repo=True, pretrained=False)
url = "https://huggingface.co/Bingsu/torchaudio_tacotron2_kss/resolve/main/melgan.pt"
state_dict = torch.hub.load_state_dict_from_url(url)
vocoder.load_state_dict(state_dict)

vocoder is same as original seungwonpark/melgan, but the weights are on the cuda, so I brought them separately.

text = "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค ํƒ€์ฝ”ํŠธ๋ก 2์ž…๋‹ˆ๋‹ค."
inp = tokenizer(text, return_tensors="pt", return_length=True, return_attention_mask=False)
with torch.inference_mode():
    out = model(**inp)
    audio = vocoder(out[0])
import IPython.display as ipd

ipd.Audio(audio[0].numpy(), rate=22050)