Hindi/English Text-to-Speech Pipeline

Overview

This project delivers a bilingual text-to-speech (TTS) experience that accepts English or Hinglish text, detects Hindi tokens, transliterates them to Devanagari, and renders speech with an XTTS voice cloned model. The interactive Gradio UI defined in inference.py is the primary entry point for end users.

Key capabilities:

  • Language identification & transliteration – hing_bert_module wraps a fine-tuned Hing-BERT token classifier, dictionary overrides, and Devanagari transliteration helpers.
  • Speech synthesis – Coqui XTTS (fine-tuned checkpoint under xtts_Hindi_FineTuned/) generates audio from the processed text using reference speaker WAVs.
  • User interface – Gradio Blocks app exposes text input, language/voice choices, advanced sampling controls, and returns generated audio plus metadata.

Repository layout

β”œβ”€β”€ inference.py                # Gradio UI and XTTS generation pipeline
β”œβ”€β”€ hing_bert_module/           # Token classifier, transliteration utilities, and assets
β”‚   β”œβ”€β”€ hing-bert-lid/          # Hugging Face model weights & tokenizer files (local)
β”‚   └── dictionary.txt          # Mythology/Sanskrit dictionary overrides
β”œβ”€β”€ xtts_Hindi_FineTuned/       # Fine-tuned XTTS checkpoint and reference voices
β”œβ”€β”€ imp_scripts/
β”‚   └── test_inference.py       # Console-driven TTS workflow (optional)
β”œβ”€β”€ text_processor.py           # Standalone CLI for token tagging & transliteration (optional)
β”œβ”€β”€ translitor.py               # Standalone CLI transliterator (optional)
β”œβ”€β”€ requirements.txt            # Minimal dependency lock for runtime
└── README.md                   # Project documentation (this file)

Prerequisites

  • Windows 10/11 (project paths are Windows-oriented, though code is portable).
  • Python 3.10 (matching the fine-tuned environment used for XTTS + Hing-BERT).
  • CUDA-capable GPU recommended for low-latency inference (CPU is supported but slower).
  • Fine-tuned XTTS assets placed under xtts_Hindi_FineTuned/ (includes config.json, checkpoints, and speakers/Reference_*.wav).

Quickstart

  1. Create & activate a virtual environment

    python -m venv xtts_env_win
    .\xtts_env_win\Scripts\Activate.ps1
    
  2. Install dependencies

    pip install --upgrade pip
    pip install -r requirements.txt
    
  3. Launch the Gradio app

    python inference.py
    

    The UI will start at http://0.0.0.0:7860 (Gradio also provides an optional public share URL). Enter text, choose voice/language, tweak advanced settings, and click Generate Speech.

How it works

  1. Text preprocessing (hing_bert_module.process_text)

    • Loads the Hing-BERT model from hing_bert_module/hing-bert-lid/.
    • Classifies tokens as Hindi or English and applies heuristics to boost Hindi detection.
    • Uses dictionary lookups + Hindi transliteration model to convert detected Hindi words into Devanagari.
    • Reconstructs the final text string for speech synthesis and logs outputs to final_output.txt.
  2. Speech synthesis (TTSGenerator in inference.py)

    • Initializes Coqui XTTS with the supplied fine-tuned checkpoint and reference speakers.
    • Generates audio using parameters from the UI (temperature, top-k/p, speed).
    • Writes audio to a temp WAV file and reports processing stats.

Optional tools

  • imp_scripts/test_inference.py: menu-driven CLI for batch experimentation and audio preview without Gradio.
  • text_processor.py / translitor.py: utility scripts for inspection or debugging of language detection & transliteration.

Maintenance tips

  • Keep requirements.txt in sync with the active environment (pip freeze and prune to essentials as needed).
  • Do not commit virtual environments (xtts_env_win/) or large checkpoints beyond repository policy.
  • Periodically review hing_bert_module/dictionary.txt for custom transliteration entries.

Troubleshooting

  • Model load errors – ensure xtts_Hindi_FineTuned/ contains the expected files and paths referenced in TTSGenerator.reference_voices.
  • Missing dependencies – rerun pip install -r requirements.txt; verify CUDA compatibility for torch/torchaudio builds.
  • Unicode output in terminals – scripts handle Windows UTF-8 console settings; if characters still render incorrectly, set PYTHONUTF8=1 or use UTF-8 capable shells.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support