Hindi/English Text-to-Speech Pipeline
Overview
This project delivers a bilingual text-to-speech (TTS) experience that accepts English or Hinglish text, detects Hindi tokens, transliterates them to Devanagari, and renders speech with an XTTS voice cloned model. The interactive Gradio UI defined in inference.py is the primary entry point for end users.
Key capabilities:
- Language identification & transliteration β
hing_bert_modulewraps a fine-tuned Hing-BERT token classifier, dictionary overrides, and Devanagari transliteration helpers. - Speech synthesis β Coqui XTTS (fine-tuned checkpoint under
xtts_Hindi_FineTuned/) generates audio from the processed text using reference speaker WAVs. - User interface β Gradio Blocks app exposes text input, language/voice choices, advanced sampling controls, and returns generated audio plus metadata.
Repository layout
βββ inference.py # Gradio UI and XTTS generation pipeline
βββ hing_bert_module/ # Token classifier, transliteration utilities, and assets
β βββ hing-bert-lid/ # Hugging Face model weights & tokenizer files (local)
β βββ dictionary.txt # Mythology/Sanskrit dictionary overrides
βββ xtts_Hindi_FineTuned/ # Fine-tuned XTTS checkpoint and reference voices
βββ imp_scripts/
β βββ test_inference.py # Console-driven TTS workflow (optional)
βββ text_processor.py # Standalone CLI for token tagging & transliteration (optional)
βββ translitor.py # Standalone CLI transliterator (optional)
βββ requirements.txt # Minimal dependency lock for runtime
βββ README.md # Project documentation (this file)
Prerequisites
- Windows 10/11 (project paths are Windows-oriented, though code is portable).
- Python 3.10 (matching the fine-tuned environment used for XTTS + Hing-BERT).
- CUDA-capable GPU recommended for low-latency inference (CPU is supported but slower).
- Fine-tuned XTTS assets placed under
xtts_Hindi_FineTuned/(includesconfig.json, checkpoints, andspeakers/Reference_*.wav).
Quickstart
Create & activate a virtual environment
python -m venv xtts_env_win .\xtts_env_win\Scripts\Activate.ps1Install dependencies
pip install --upgrade pip pip install -r requirements.txtLaunch the Gradio app
python inference.pyThe UI will start at
http://0.0.0.0:7860(Gradio also provides an optional public share URL). Enter text, choose voice/language, tweak advanced settings, and click Generate Speech.
How it works
Text preprocessing (
hing_bert_module.process_text)- Loads the Hing-BERT model from
hing_bert_module/hing-bert-lid/. - Classifies tokens as Hindi or English and applies heuristics to boost Hindi detection.
- Uses dictionary lookups + Hindi transliteration model to convert detected Hindi words into Devanagari.
- Reconstructs the final text string for speech synthesis and logs outputs to
final_output.txt.
- Loads the Hing-BERT model from
Speech synthesis (
TTSGeneratorininference.py)- Initializes Coqui XTTS with the supplied fine-tuned checkpoint and reference speakers.
- Generates audio using parameters from the UI (temperature, top-k/p, speed).
- Writes audio to a temp WAV file and reports processing stats.
Optional tools
imp_scripts/test_inference.py: menu-driven CLI for batch experimentation and audio preview without Gradio.text_processor.py/translitor.py: utility scripts for inspection or debugging of language detection & transliteration.
Maintenance tips
- Keep
requirements.txtin sync with the active environment (pip freezeand prune to essentials as needed). - Do not commit virtual environments (
xtts_env_win/) or large checkpoints beyond repository policy. - Periodically review
hing_bert_module/dictionary.txtfor custom transliteration entries.
Troubleshooting
- Model load errors β ensure
xtts_Hindi_FineTuned/contains the expected files and paths referenced inTTSGenerator.reference_voices. - Missing dependencies β rerun
pip install -r requirements.txt; verify CUDA compatibility for torch/torchaudio builds. - Unicode output in terminals β scripts handle Windows UTF-8 console settings; if characters still render incorrectly, set
PYTHONUTF8=1or use UTF-8 capable shells.