--- base_model: unsloth/orpheus-3b-0.1-ft model_type: llama library_name: transformers pipeline_tag: text-to-speech tags: - text-to-speech - tts - sanskrit - audio-generation - text-generation-inference - transformers - unsloth - llama - trl - fine-tuned - devanagari language: - en - sa datasets: - ai4bharat/Kathbath metrics: null widget: - text: नमस्ते example_title: Greeting - text: संस्कृत एक प्राचीन भाषा है। example_title: Ancient Language - text: ॐ शान्ति शान्ति शान्तिः example_title: Peace Mantra model-index: - name: Sanskrit TTS Model results: - task: type: text-to-speech name: Text-to-Speech dataset: type: ai4bharat/Kathbath name: Kathbath metrics: - type: sota name: State-of-the-Art value: Achieved SOTA on Kathbath dataset --- [](https://colab.research.google.com/github/rakshverma/SamskritaBharati/blob/main/Runing_In_Colab.ipynb) # Sanskrit Text-to-Speech Model ## Model Overview **Model ID:** rverma0631/Sanskrit_TTS **Base Model:** unsloth/orpheus-3b-0.1-ft **License:** Apache 2.0 **Language:** English **Primary Dataset:** ai4bharat/Kathbath This fine-tuned Language Model (LLaMA) specializes in Sanskrit text-to-speech synthesis and has been optimized using Unsloth and Hugging Face's TRL library for enhanced training efficiency. ## Performance Metrics Our Sanskrit TTS model has achieved **state-of-the-art (SOTA)** performance on the **Kathbath dataset** developed by [AI4Bharat](https://ai4bharat.iitm.ac.in/), establishing new benchmarks for Sanskrit speech synthesis quality. ## Installation Requirements ### Environment Detection and Base Setup ```bash # Environment detection python3 -c " import os print('colab' if 'COLAB_' in ''.join(os.environ.keys()) else 'local') " # Install core dependencies pip install snac ``` ### Google Colab Installation For Google Colab environments, execute the following installation sequence: ```bash # Install Colab-specific dependencies pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo pip install sentencepiece protobuf 'datasets>=3.4.1,<4.0.0' huggingface_hub hf_transfer pip install --no-deps unsloth # Environment cleanup (recommended for clean installation) pip uninstall torch torchvision torchaudio unsloth unsloth_zoo transformers -y pip cache purge # Install PyTorch with CUDA 12.1 support pip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121 # Install latest Unsloth from source pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" # Additional dependencies pip install librosa pip install -U datasets ``` ## Implementation Guide ### Complete Implementation Code ```python import gradio as gr import torch from unsloth import FastLanguageModel from IPython.display import display, Audio import numpy as np # Global model variables model = None tokenizer = None snac_model = None device = None def load_models(): """Initialize and load all required models for Sanskrit TTS inference.""" global model, tokenizer, snac_model, device device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Loading models on: {device}") # Load the fine-tuned Sanskrit TTS model model, tokenizer = FastLanguageModel.from_pretrained( "rverma0631/Sanskrit_TTS", max_seq_length=2048, dtype=None, load_in_4bit=False, ) model = model.to(device) FastLanguageModel.for_inference(model) # Load SNAC model for audio generation try: from snac import SNAC snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval() except ImportError: print("Warning: SNAC model import failed. Make sure SNAC is installed.") snac_model.to("cpu") print("Models loaded successfully!") def redistribute_codes(code_list): """Redistribute generated codes into hierarchical layers for audio synthesis.""" layer_1 = [] layer_2 = [] layer_3 = [] for i in range((len(code_list)+1)//7): layer_1.append(code_list[7*i]) layer_2.append(code_list[7*i+1]-4096) layer_3.append(code_list[7*i+2]-(2*4096)) layer_3.append(code_list[7*i+3]-(3*4096)) layer_2.append(code_list[7*i+4]-(4*4096)) layer_3.append(code_list[7*i+5]-(5*4096)) layer_3.append(code_list[7*i+6]-(6*4096)) codes = [torch.tensor(layer_1).unsqueeze(0), torch.tensor(layer_2).unsqueeze(0), torch.tensor(layer_3).unsqueeze(0)] audio_hat = snac_model.decode(codes) return audio_hat def sanskrit_tts_inference(sanskrit_text, chosen_voice=""): """ Generate Sanskrit speech from input text using the fine-tuned model. Args: sanskrit_text (str): Input Sanskrit text in Devanagari script chosen_voice (str): Voice selection parameter (optional) Returns: tuple: (audio_data, status_message) """ if not sanskrit_text.strip(): return None, "Please enter some Sanskrit text." try: prompts = [sanskrit_text] chosen_voice = 1070 # Prepare prompts with voice selection prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts] # Tokenize input prompts all_input_ids = [] for prompt in prompts_: input_ids = tokenizer(prompt, return_tensors="pt").input_ids all_input_ids.append(input_ids) # Define special tokens start_token = torch.tensor([[ 128259]], dtype=torch.int64) end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # Construct modified input sequences all_modified_input_ids = [] for input_ids in all_input_ids: modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) all_modified_input_ids.append(modified_input_ids) # Apply padding and create attention masks all_padded_tensors = [] all_attention_masks = [] max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids]) for modified_input_ids in all_modified_input_ids: padding = max_length - modified_input_ids.shape[1] padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1) attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1) all_padded_tensors.append(padded_tensor) all_attention_masks.append(attention_mask) # Batch tensors for inference all_padded_tensors = torch.cat(all_padded_tensors, dim=0) all_attention_masks = torch.cat(all_attention_masks, dim=0) input_ids = all_padded_tensors.to(device) attention_mask = all_attention_masks.to(device) # Generate audio codes using the model generated_ids = model.generate( input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=1200, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.1, num_return_sequences=1, eos_token_id=128258, use_cache=True ) # Post-process generated tokens token_to_find = 128257 token_to_remove = 128258 token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True) if len(token_indices[1]) > 0: last_occurrence_idx = token_indices[1][-1].item() cropped_tensor = generated_ids[:, last_occurrence_idx+1:] else: cropped_tensor = generated_ids mask = cropped_tensor != token_to_remove processed_rows = [] for row in cropped_tensor: masked_row = row[row != token_to_remove] processed_rows.append(masked_row) # Convert tokens to audio codes code_lists = [] for row in processed_rows: row_length = row.size(0) new_length = (row_length // 7) * 7 trimmed_row = row[:new_length] trimmed_row = [t - 128266 for t in trimmed_row] code_lists.append(trimmed_row) # Generate audio samples my_samples = [] for code_list in code_lists: samples = redistribute_codes(code_list) my_samples.append(samples) if len(my_samples) > 0: audio_sample = my_samples[0].detach().squeeze().to("cpu").numpy() return (24000, audio_sample), f"✅ Generated audio for: {sanskrit_text}" else: return None, "❌ Failed to generate audio - no valid codes produced." except Exception as e: return None, f"❌ Error during inference: {str(e)}" # Initialize models print("Loading models... This may take a moment.") load_models() # Create Gradio interface with gr.Blocks(title="Sanskrit Text-to-Speech") as demo: gr.Markdown(""" # 🕉️ Sanskrit Text-to-Speech Enter Sanskrit text in Devanagari script and generate speech using your fine-tuned model. """) with gr.Row(): with gr.Column(): sanskrit_input = gr.Textbox( label="Sanskrit Text", placeholder="Enter Sanskrit text in Devanagari script...", lines=3, value="नमस्ते" ) generate_btn = gr.Button("🎵 Generate Speech", variant="primary") with gr.Column(): audio_output = gr.Audio( label="Generated Sanskrit Speech", type="numpy" ) status_output = gr.Textbox( label="Status", lines=2, interactive=False ) # Example inputs for demonstration gr.Examples( examples=[ ["नमस्ते"], ["संस्कृत एक प्राचीन भाषा है"], ["ॐ शान्ति शान्ति शान्तिः"], ["सर्वे भवन्तु सुखिनः"], ], inputs=[sanskrit_input], outputs=[audio_output, status_output], fn=sanskrit_tts_inference, cache_examples=False ) # Connect interface components generate_btn.click( fn=sanskrit_tts_inference, inputs=[sanskrit_input], outputs=[audio_output, status_output] ) # Launch the application if __name__ == "__main__": demo.launch( share=True, server_name="0.0.0.0", server_port=7860, show_error=True ) ``` ## 🔊 Demo Outputs
| 🔉 नमस्ते | |
| 📜 संस्कृत एक प्राचीन भाषा है | |
| 🕉️ ॐ शान्ति शान्ति शान्तिः | |
| 🌍 सर्वे भवन्तु सुखिनः |
](https://github.com/unslothai/unsloth)
## Technical Specifications
- **Model Type:** Fine-tuned Language Model for Text-to-Speech
- **Architecture:** LLaMA-based with LoRA adaptation
- **Audio Output:** 24kHz sampling rate
- **Maximum Sequence Length:** 2048 tokens
- **Supported Script:** Devanagari (Sanskrit)
- **Training Framework:** Unsloth + Hugging Face TRL
## Usage Requirements
- **Hardware:** CUDA-compatible GPU
- **Dependencies:** PyTorch 2.4.1+, Transformers, SNAC audio codec
- **Python Version:** 3.7+