esnya commited on
Commit
2ab9d6a
β€’
1 Parent(s): d3913c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -1
README.md CHANGED
@@ -6,4 +6,91 @@ tags:
6
  - jvs
7
  - pyopenjtalk
8
  - speech-to-text
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - jvs
7
  - pyopenjtalk
8
  - speech-to-text
9
+ ---
10
+
11
+ # SpeechT5 (TTS task) for Japanese
12
+ SpeechT5 model fine-tuned for speech synthesis (text-to-speech) on [JVS]("https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus").
13
+ Trained from [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts).
14
+ Modified tokenizer powered by [Open Jtalk](https://open-jtalk.sp.nitech.ac.jp/)
15
+
16
+ # Model description
17
+ See [original model card](https://huggingface.co/microsoft/speecht5_tts#model-description)
18
+ My modified codes licensed under MIT Licence.
19
+
20
+ # Usage
21
+ Install requirements
22
+ ```bash
23
+ pip install transformers sentencepiece pyopnjtalk # or pyopenjtalk-prebuilt
24
+ ```
25
+
26
+ Download a modified code.
27
+ ```bash
28
+ curl -O https://huggingface.co/esnya/japanese_speecht5_tts/main/speecht5_openjtalk_tokenizer.py
29
+ ```
30
+
31
+ (`SpeechToTextPipeline` is not released yet.)
32
+ ```py
33
+ import numpy as np
34
+ from transformers import (
35
+ SpeechT5ForTextToSpeech,
36
+ SpeechT5HifiGan,
37
+ SpeechT5FeatureExtractor,
38
+ SpeechT5Processor,
39
+ )
40
+ from speecht5_openjtalk_tokenizer import SpeechT5OpenjtalkTokenizer
41
+ import soundfile
42
+ import torch
43
+
44
+ model_name = "esnya/japanese_speecht5_tts"
45
+ with torch.no_grad():
46
+
47
+ model = SpeechT5ForTextToSpeech.from_pretrained(
48
+ model_name, device_map="cuda", torch_dtype=torch.bfloat16
49
+ )
50
+
51
+ tokenizer = SpeechT5OpenjtalkTokenizer.from_pretrained(model_name)
52
+ feature_extractor = SpeechT5FeatureExtractor.from_pretrained(model_name)
53
+ processor = SpeechT5Processor(feature_extractor, tokenizer)
54
+ vocoder = SpeechT5HifiGan.from_pretrained(
55
+ "microsoft/speecht5_hifigan", device_map="cuda", torch_dtype=torch.bfloat16
56
+ )
57
+
58
+ input = "εΎθΌ©γ―ηŒ«γ§γ‚γ‚‹γ€‚εε‰γ―γΎγ η„‘γ„γ€‚γ©γ“γ§η”Ÿγ‚ŒγŸγ‹γ¨γ‚“γ¨θ¦‹ε½“γŒγ€γ‹γ¬γ€‚"
59
+ input_ids = processor(text=input, return_tensors="pt").input_ids.to(model.device)
60
+
61
+ speaker_embeddings = np.random.uniform(
62
+ -1, 1, (1, 16)
63
+ ) # (batch_size, speaker_embedding_dim = 16), first dimension means male (-1.0) / female (1.0)
64
+ speaker_embeddings = torch.FloatTensor(speaker_embeddings).to(
65
+ device=model.device, dtype=model.dtype
66
+ )
67
+
68
+ waveform = model.generate_speech(
69
+ input_ids,
70
+ speaker_embeddings,
71
+ vocoder=vocoder,
72
+ )
73
+
74
+ waveform = waveform / waveform.abs().max() # normalize
75
+ waveform = waveform.reshape(-1).cpu().float().numpy()
76
+
77
+ soundfile.write(
78
+ "output.wav",
79
+ waveform,
80
+ vocoder.config.sampling_rate,
81
+ )
82
+ ```
83
+
84
+ # Background
85
+
86
+ The motivation behind developing this model stems from the noticeable lack of Japanese generation models in SpeechT5 TTS, or their scarcity at best. Additionally, the g2p functionality of Open Jtalk (pyopenjtalk) enabled us to achieve a vocabulary closely resembling English models. It's important to note that the special modifications and enhancements were primarily applied to the tokenizer. Unlike the default setup, our modified tokenizer separately extracts and retains characters other than phonation to ensure more accurate text-to-speech conversion.
87
+
88
+ # Limitations
89
+
90
+ One known issue with this model is that when multiple sentences are fed into it, the latter parts may result in extended silences. As a temporary solution, until this is rectified, it is recommended to split and generate each sentence individually.
91
+
92
+ # License
93
+ Model inherits [JVS Corpus](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus).
94
+
95
+ # See also
96
+ - Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari, "JVS corpus: free Japanese multi-speaker voice corpus," arXiv preprint, 1908.06248, Aug. 2019.