frappuccino commited on
Commit
6d5dbc5
1 Parent(s): aca3797

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -1,3 +1,105 @@
1
  ---
2
  license: mit
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ru
5
+ pipeline_tag: text-to-speech
6
  ---
7
+
8
+ # VITS2 Text-to-Speech on Natasha Dataset
9
+
10
+ ## Model Details
11
+
12
+ ### Model Description
13
+
14
+ This model is an implementation of VITS2, a single-stage text-to-speech system, trained on the Natasha dataset for the Russian language. VITS2 improves upon the previous VITS model by addressing issues such as unnaturalness, computational efficiency, and dependence on phoneme conversion. The model leverages adversarial learning and architecture design for enhanced quality and efficiency.
15
+
16
+ - **Developed by:** Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim
17
+ - **Shared by:** [Your Name or Organization]
18
+ - **Model type:** Text-to-Speech
19
+ - **Language(s) (NLP):** Russian
20
+ - **License:** MIT
21
+ - **Finetuned from model:** VITS pretrained model
22
+
23
+ ### Model Sources
24
+
25
+ - **Repository:** [VITS2 PyTorch Implementation](https://github.com/p0p4k/vits2_pytorch)
26
+ - **Paper:** [VITS2 paper](https://arxiv.org/abs/2307.16430)
27
+
28
+ ## Uses
29
+
30
+ ### Direct Use
31
+
32
+ The model can be used to convert text into speech directly. Given a text input in Russian, it will produce a corresponding audio output.
33
+
34
+ ### Downstream Use
35
+
36
+ Potential downstream applications include voice assistants, audiobook generation, voiceovers for animations or videos, and any other application where text-to-speech conversion in Russian is required.
37
+
38
+ ### Out-of-Scope Use
39
+
40
+ The model is specifically trained for the Russian language and might not produce satisfactory results for other languages.
41
+
42
+ ## Bias, Risks, and Limitations
43
+
44
+ The performance and bias of the model can be influenced by the Natasha dataset it was trained on. If the dataset lacks diversity in terms of dialects, accents, or styles, the generated speech might also reflect these limitations.
45
+
46
+ ### Recommendations
47
+
48
+ Users should evaluate the model's performance in their specific application context and be aware of potential biases or limitations.
49
+
50
+ ## How to Get Started with the Model
51
+
52
+ To use the model, users can follow the guidelines and scripts provided in the [VITS2 PyTorch Implementation repository](https://github.com/p0p4k/vits2_pytorch).
53
+
54
+ ## Training Details
55
+
56
+ ### Training Data
57
+
58
+ The model was trained on the Natasha dataset, which is a collection of Russian speech recordings.
59
+
60
+ ### Training Procedure
61
+
62
+ #### Preprocessing
63
+
64
+ Text and audio preprocessing steps, as mentioned in the repository README, were followed.
65
+
66
+ #### Training Hyperparameters
67
+
68
+ - **Training regime:** This can be filled with details such as learning rate, batch size, optimizer used, etc.
69
+
70
+
71
+ #### Summary
72
+
73
+ The VITS2 model demonstrates improved performance over previous TTS models, offering more natural and efficient speech synthesis.
74
+
75
+ ## Environmental Impact
76
+
77
+ You can fill in the details regarding the environmental impact, based on the compute resources used for training.
78
+
79
+ ## Technical Specifications
80
+
81
+ ### Model Architecture and Objective
82
+
83
+ The VITS2 architecture comprises of various improvements over the original VITS, including but not limited to speaker-conditioned text encoder, mel spectrogram posterior encoder, and transformer blocks in the normalizing flow.
84
+
85
+ ### Compute Infrastructure
86
+
87
+ #### Hardware
88
+
89
+ Single Nvidia RTX 4090
90
+
91
+ #### Software
92
+
93
+ - Python >= 3.11
94
+ - PyTorch version 2.0.0
95
+
96
+
97
+ **APA:**
98
+
99
+ Kong, J., Park, J., Kim, B., Kim, J., Kong, D., & Kim, S. (Year). VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. [Journal/Conference Name], [pages].
100
+
101
+ ## Model Card Contact
102
+
103
+ https://t.me/voicestuff
104
+
105
+ https://t.me/frappuccino