File size: 4,394 Bytes
c690509
 
 
 
cb6f154
d99f46e
 
4bbb864
 
d99f46e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3097fd9
179ba80
d99f46e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3097fd9
179ba80
d99f46e
2447dfa
d99f46e
 
 
 
 
1648f5c
d99f46e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: mit
language:
- ar
pipeline_tag: text-to-speech
---

# ArTST
SpeechT5 for Arabic (TTS task)

Here we use the pretained weights from ArTST and fine-tuned using huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech).

ArTST was first released in [this repository](https://github.com/mbzuai-nlp/ArTST ), [pretrained weights](https://huggingface.co/MBZUAI/ArTST/blob/main/pretrain_checkpoint.pt).

# Uses
## ๐Ÿค— Transformers Usage

You can run ArTST TTS locally with the ๐Ÿค— Transformers library.

1. First install the ๐Ÿค— [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):

```
pip install --upgrade pip
pip install --upgrade transformers sentencepiece datasets[audio]
```
2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!

```python
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf

synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.

speech = synthesiser("ู„ุฃู†ู‡ ู„ุง ูŠุฑู‰ ุฃู†ู‡ ุนู„ู‰ ุงู„ุณูู‡ ุซู… ู…ู† ุจุนุฏ ุฐู„ูƒ ุญุฏูŠุซ ู…ู†ุชุดุฑ", forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
```
3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.

```python
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset

processor = SpeechT5Processor.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
model = SpeechT5ForTextToSpeech.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(text="ู„ุฃู†ู‡ ู„ุง ูŠุฑู‰ ุฃู†ู‡ ุนู„ู‰ ุงู„ุณูู‡ ุซู… ู…ู† ุจุนุฏ ุฐู„ูƒ ุญุฏูŠุซ ู…ู†ุชุดุฑ", return_tensors="pt")

# load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)

speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

sf.write("speech.wav", speech.numpy(), samplerate=16000)
```


# Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```bibtex
@inproceedings{toyin-etal-2023-artst,
    title = "{A}r{TST}: {A}rabic Text and Speech Transformer",
    author = "Toyin, Hawau  and
      Djanibekov, Amirbek  and
      Kulkarni, Ajinkya  and
      Aldarmaki, Hanan",
    editor = "Sawaf, Hassan  and
      El-Beltagy, Samhaa  and
      Zaghouani, Wajdi  and
      Magdy, Walid  and
      Abdelali, Ahmed  and
      Tomeh, Nadi  and
      Abu Farha, Ibrahim  and
      Habash, Nizar  and
      Khalifa, Salam  and
      Keleg, Amr  and
      Haddad, Hatem  and
      Zitouni, Imed  and
      Mrini, Khalil  and
      Almatham, Rawan",
    booktitle = "Proceedings of ArabicNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.arabicnlp-1.5",
    pages = "41--51"
}
@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}
```