Messam174
/

speecht5_finetuned_essam2_ar

@@ -17,7 +17,74 @@ should probably proofread and complete it, then remove this comment. -->
 This model is a fine-tuned version of [MBZUAI/speecht5_tts_clartts_ar](https://huggingface.co/MBZUAI/speecht5_tts_clartts_ar) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.3333
 ## Model description
 More information needed

 This model is a fine-tuned version of [MBZUAI/speecht5_tts_clartts_ar](https://huggingface.co/MBZUAI/speecht5_tts_clartts_ar) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.3333
+# Uses
+## 🤗 Transformers Usage
+You can run ArTST TTS locally with the 🤗 Transformers library.
+1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):
+```
+pip install --upgrade pip
+pip install --upgrade transformers sentencepiece datasets[audio]
+```
+2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!
+```python
+from transformers import pipeline
+from datasets import load_dataset
+import soundfile as sf
+synthesiser = pipeline("text-to-speech", "("Messam174/speecht5_finetuned_essam2_ar")
+embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
+speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
+# You can replace this embedding with your own as well.
+speech = synthesiser("السلام عليكم ورحمة الله وبركاته حياكم الله جميعا", forward_params={"speaker_embeddings": speaker_embedding})
+# ArTST is trained without diacritics.
+sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
+```
+3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
+```python
+from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
+from datasets import load_dataset
+import torch
+import soundfile as sf
+from pydub import AudioSegment
+# Check if GPU is available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(f"Using device: {device}")
+# Load processor, model, and vocoder
+processor = SpeechT5Processor.from_pretrained("Messam174/speecht5_finetuned_essam2_ar")
+model = SpeechT5ForTextToSpeech.from_pretrained("Messam174/speecht5_finetuned_essam2_ar").to(device)
+vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
+# Prepare inputs
+inputs = processor(
+    text="السلام عليكم ورحمة الله وبركاته حياكم الله جميعا", return_tensors="pt"
+).to(device)
+# Load xvector containing speaker's voice characteristics from a dataset
+embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
+speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0).to(device)
+# Generate speech
+with torch.no_grad():  # Disable gradient computation for inference
+    speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
+# Save the output as WAV
+wav_file = "speech.wav"
+sf.write(wav_file, speech.cpu().numpy(), samplerate=16000)
+print(f"Speech saved to '{wav_file}'")
+```
 ## Model description
 More information needed