Sinhala Language Model Research - SmolLM2 Fine-tuning Attempt

⚠️ EXPERIMENTAL MODEL - NOT FOR PRODUCTION USE

Model Description

  • Base Model: HuggingFaceTB/SmolLM2-1.7B
  • Fine-tuning Method: QLoRA (4-bit quantization with LoRA)
  • Target Language: Sinhala (සිංහල)
  • Status: Research prototype with significant limitations

Research Context

This model represents an undergraduate research attempt to adapt SmolLM2-1.7B for Sinhala language generation. Part of thesis: "Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability" (NSBM Green University, 2025).

Training Details

Dataset

  • Size: 427,000 raw examples → 406,532 after cleaning
  • Sources:
    • YouTube comments (32%)
    • Web scraped content (35%)
    • Translated instructions (23%)
    • Curated texts (10%)
  • Data Quality: Mixed (social media, news, translated content)
  • Processing: Custom cleaning pipeline removing URLs, emails, duplicates

Training Configuration

  • Hardware: NVIDIA RTX 4090 (24GB VRAM) via Vast.ai
  • Training Time: 48 hours
  • Total Cost: $19.20 (budget-constrained research)
  • Framework: Unsloth for memory efficiency
  • LoRA Parameters:
    • Rank (r): 16
    • Alpha: 16
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • Trainable parameters: 8.4M/1.7B (99.5% reduction)

Hyperparameters

  • Learning rate: 2e-5
  • Batch size: 8 (gradient accumulation: 1)
  • Max sequence length: 2048 (reduced to 512 for memory)
  • Mixed precision: FP16
  • Optimizer: adamw_8bit

Evaluation Results

Quantitative Metrics

  • Perplexity: 218,443 (target was <50) ❌
  • BLEU Score: 0.0000 ❌
  • Training Loss: 1.847 (converged)
  • Task Completion Rate:
    • General conversation: 0%
    • Mathematics: 100% (but output corrupted)
    • Cultural context: 0%
    • Instruction following: 33%

Critical Issues Discovered

⚠️ Tokenizer Incompatibility: The model exhibits catastrophic tokenizer-model mismatch, generating English vocabulary tokens ("Drum", "Chiefs", "RESP") instead of Sinhala text. This represents a fundamental architectural incompatibility between SmolLM2's tokenizer and Sinhala script.

Sample Outputs (Showing Failure Pattern)

Input: "ඔබේ නම කුමක්ද?"
Expected: "මගේ නම [name] වේ"
Actual: "Drum Chiefs RESP frontend(direction..."

Research Contributions

Despite technical failure, this research provides:

  1. Dataset: 427,000 curated Sinhala examples (largest publicly available)
  2. Pipeline: Reproducible training framework for low-resource languages
  3. Discovery: Documentation of critical tokenizer challenges for non-Latin scripts
  4. Methodology: Budget-conscious approach ($30 total) for LLM research

Limitations & Warnings

  • Does NOT generate coherent Sinhala text
  • Tokenizer fundamentally incompatible with Sinhala
  • Not suitable for any production use
  • Useful only as research artifact and negative result documentation

Intended Use

This model is shared for:

  • Academic transparency and reproducibility
  • Documentation of challenges in low-resource language AI
  • Foundation for future research improvements
  • Example of tokenizer-model compatibility issues

Recommendations for Future Work

  1. Use multilingual base models (mT5, XLM-R, BLOOM)
  2. Develop Sinhala-specific tokenizer
  3. Increase dataset to 1M+ examples
  4. Consider character-level or byte-level models

How to Reproduce Issues

# This will demonstrate the tokenizer problem
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("path/to/model")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

input_text = "ශ්‍රී ලංකාව"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
# Output will be gibberish English tokens

Citation

@thesis{dharmasiri2025sinhala,
  title={Developing a Fluent Sinhala Language Model: Enhancing AI's Cultural and Linguistic Adaptability},
  author={Dharmasiri, H.M.A.H.},
  year={2025},
  school={NSBM Green University},
  note={Undergraduate thesis documenting challenges in low-resource language AI}
}

Ethical Considerations

  • Model outputs are not reliable for Sinhala generation
  • Should not be used for any decision-making
  • Shared for research transparency only

License

MIT License - for research and educational purposes

Downloads last month
7
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Captainsl/SinhalaLLM

Quantized
(35)
this model
Quantizations
1 model