|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: MBZUAI/speecht5_tts_clartts_ar |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: speecht5_finetuned_essam2_ar |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# speecht5_finetuned_essam2_ar |
|
|
|
This model is a fine-tuned version of [MBZUAI/speecht5_tts_clartts_ar](https://huggingface.co/MBZUAI/speecht5_tts_clartts_ar) on an unknown dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.3333 |
|
# Uses |
|
## ๐ค Transformers Usage |
|
|
|
You can run ArTST TTS locally with the ๐ค Transformers library. |
|
|
|
1. First install the ๐ค [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional): |
|
|
|
``` |
|
pip install --upgrade pip |
|
pip install --upgrade transformers sentencepiece datasets[audio] |
|
``` |
|
2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code! |
|
|
|
```python |
|
from transformers import pipeline |
|
from datasets import load_dataset |
|
import soundfile as sf |
|
|
|
synthesiser = pipeline("text-to-speech", "("Messam174/speecht5_finetuned_essam2_ar") |
|
|
|
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation") |
|
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0) |
|
# You can replace this embedding with your own as well. |
|
|
|
speech = synthesiser("ุงูุณูุงู
ุนูููู
ูุฑุญู
ุฉ ุงููู ูุจุฑูุงุชู ุญูุงูู
ุงููู ุฌู
ูุนุง", forward_params={"speaker_embeddings": speaker_embedding}) |
|
# ArTST is trained without diacritics. |
|
|
|
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"]) |
|
``` |
|
3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control. |
|
|
|
```python |
|
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan |
|
from datasets import load_dataset |
|
import torch |
|
import soundfile as sf |
|
from pydub import AudioSegment |
|
|
|
# Check if GPU is available |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
print(f"Using device: {device}") |
|
|
|
# Load processor, model, and vocoder |
|
processor = SpeechT5Processor.from_pretrained("Messam174/speecht5_finetuned_essam2_ar") |
|
model = SpeechT5ForTextToSpeech.from_pretrained("Messam174/speecht5_finetuned_essam2_ar").to(device) |
|
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device) |
|
|
|
# Prepare inputs |
|
inputs = processor( |
|
text="ุงูุณูุงู
ุนูููู
ูุฑุญู
ุฉ ุงููู ูุจุฑูุงุชู ุญูุงูู
ุงููู ุฌู
ูุนุง", return_tensors="pt" |
|
).to(device) |
|
|
|
# Load xvector containing speaker's voice characteristics from a dataset |
|
embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation") |
|
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0).to(device) |
|
|
|
# Generate speech |
|
with torch.no_grad(): # Disable gradient computation for inference |
|
speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder) |
|
|
|
# Save the output as WAV |
|
wav_file = "speech.wav" |
|
sf.write(wav_file, speech.cpu().numpy(), samplerate=16000) |
|
print(f"Speech saved to '{wav_file}'") |
|
|
|
|
|
|
|
``` |
|
## Model description |
|
|
|
More information needed |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.0001 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 2 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 8 |
|
- total_train_batch_size: 32 |
|
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 100 |
|
- training_steps: 500 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:------:|:----:|:---------------:| |
|
| 0.3806 | 0.3742 | 100 | 0.3452 | |
|
| 0.3873 | 0.7484 | 200 | 0.3487 | |
|
| 0.3788 | 1.1225 | 300 | 0.3441 | |
|
| 0.3676 | 1.4967 | 400 | 0.3380 | |
|
| 0.3668 | 1.8709 | 500 | 0.3333 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.46.3 |
|
- Pytorch 2.5.1+cu121 |
|
- Datasets 3.2.0 |
|
- Tokenizers 0.20.3 |
|
|