You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ViLegalBERT

Model Description

ViLegalBERT is a Vietnamese legal language model developed through continual pretraining of vinai/phobert-base-v2 on an extensive Vietnamese legal corpus. This model is specifically designed for Vietnamese legal text understanding and representation, offering enhanced performance in legal domain tasks while maintaining the robust foundational capabilities of the PhoBERT architecture.

⚠️ Important Notice: This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

Model Details

Architecture & Specifications

  • Base Model: vinai/phobert-base-v2
  • Model Type: Masked Language Model (RoBERTa-based)
  • Parameters: 135M total parameters
  • Architecture: BERT-base with RoBERTa optimizations
  • Attention: Multi-head attention with 12 attention heads
  • Hidden Size: 768
  • Intermediate Size: 3072
  • Number of Layers: 12
  • Context Length: 256 tokens (training sequence length)
  • Vocabulary Size: 64,000 subword tokens
  • License: CC BY-NC-SA 4.0

Training Details

  • Training Method: Continual Pretraining with Masked Language Modeling
  • Training Data: 17GB Vietnamese legal corpus (same as ntphuc149/ViLegalQwen2.5-1.5B-Base)
  • Training Objective: Masked Language Modeling (MLM)
  • MLM Probability: 0.15
  • Training Steps: 500,000 steps
  • Text Preprocessing: Vietnamese word segmentation using pyvi/ViTokenizer before BPE tokenization
  • Optimization:
    • Optimizer: AdamW
    • Learning Rate: 1e-4 with linear scheduling
    • Batch Size: 16 per device
    • Gradient Accumulation Steps: 4
    • Warmup Steps: 10,000
    • Weight Decay: 0.015
  • Hardware: NVIDIA A100 GPU
  • Training Framework: Hugging Face Transformers + PyTorch
  • Training Duration: ~133 hours

Legal Corpus Composition

The training corpus was compiled through systematic crawling and curation of Vietnamese legal documents from the same sources as ViLegalQwen2.5-1.5B-Base:

Data Sources:

Corpus Statistics:

  • Training Corpus: 17GB of curated Vietnamese legal texts
  • Document Types: Laws, decrees, circulars, decisions, regulations, and legal interpretations
  • Coverage: Comprehensive Vietnamese legal framework from multiple authoritative sources
  • Preprocessing: Word segmentation applied before subword tokenization for better Vietnamese language understanding

Performance & Capabilities

Strengths

  • Legal Domain Expertise: Enhanced understanding of Vietnamese legal terminology and concepts through continual pretraining
  • Word-Level Understanding: Benefits from Vietnamese word segmentation preprocessing, improving comprehension of Vietnamese legal text structure
  • Contextual Representations: Produces high-quality contextualized embeddings for Vietnamese legal text
  • Efficient Architecture: Compact 135M parameter model suitable for resource-constrained environments
  • Vietnamese Language: Native-level Vietnamese legal language processing with proper word boundary recognition

Evaluation

This model has been trained through continual pretraining on Vietnamese legal texts using masked language modeling.

Comprehensive evaluation results on downstream Vietnamese legal NLP tasks coming soon.

Usage

Loading the Model

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model_name = "ntphuc149/ViLegalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage for getting embeddings
text = "Điều 1. Phạm vi điều chỉnh của Bộ luật này"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings
last_hidden_states = outputs.last_hidden_state
pooled_output = outputs.pooler_output

Masked Language Modeling

from transformers import pipeline

# Create fill-mask pipeline
fill_mask = pipeline("fill-mask", model="ntphuc149/ViLegalBERT", tokenizer="ntphuc149/ViLegalBERT")

# Example: predicting masked words in legal context
text = "Theo quy định của <mask> luật này, người vi phạm sẽ bị xử lý."
predictions = fill_mask(text)
print(predictions)

Fine-tuning for Downstream Tasks

This is a base model that requires fine-tuning for optimal performance in downstream tasks. Recommended applications include:

  • Legal Document Classification: Categorizing legal documents by type, domain, or jurisdiction
  • Legal Named Entity Recognition: Identifying legal entities, dates, regulations, and references
  • Legal Text Similarity: Computing semantic similarity between legal documents
  • Legal Information Retrieval: Enhancing search and retrieval of relevant legal information
  • Legal Question Answering: Building legal QA systems with proper fine-tuning

Model Limitations & Considerations

Limitations

  • Domain Specificity: Optimized for Vietnamese legal domain; may underperform on general Vietnamese text
  • Base Model Nature: Requires fine-tuning for optimal task-specific performance
  • Context Constraints: Trained with 256-token sequences; performance may degrade with longer contexts
  • Training Data Bias: Performance may reflect biases present in Vietnamese legal corpus
  • Temporal Limitations: Training data has a temporal cutoff; may not reflect the most recent legal changes

Ethical Considerations

  • Not Legal Advice: This model should NOT be used to provide actual legal advice
  • Professional Review Required: All model outputs should be reviewed by qualified legal professionals
  • Bias Awareness: Users should be aware of potential biases in legal interpretation
  • Responsible Use: Model should be used responsibly within appropriate legal and ethical frameworks

Safety Measures

  • Human Oversight: Always require human legal expert oversight
  • Output Verification: Verify all generated content against authoritative legal sources
  • Regulatory Compliance: Ensure usage complies with local AI and legal practice regulations

Citation

If you use ViLegalBERT in your research or applications, please cite:

@misc{vilegalbert-2025,
  title={ViLegalBERT: A Pretrained Language Model for Vietnamese Legal Texts.},
  author={Tien-Manh Tran, Manh-Cuong Phan, Truong-Phuc Nguyen},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ntphuc149/ViLegalBERT}
}

@inproceedings{phobert,
  title={PhoBERT: Pre-trained language models for Vietnamese},
  author={Nguyen, Dat Quoc and Nguyen, Anh Tuan},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2020},
  pages={1037--1042},
  year={2020}
}

Contact & Support

For questions, suggestions, or collaboration opportunities:


Disclaimer: This model is provided for research and educational purposes. It should not replace professional legal advice or consultation. Users are responsible for ensuring compliance with applicable laws and regulations in their jurisdiction.

Downloads last month
135
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ntphuc149/ViLegalBERT

Finetuned
(263)
this model

Collection including ntphuc149/ViLegalBERT