You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ViLegalBERT

Model Description

ViLegalBERT is a Vietnamese legal language model developed through continual pretraining of vinai/phobert-base-v2 on an extensive Vietnamese legal corpus. This model is specifically designed for Vietnamese legal text understanding and representation, offering enhanced performance in legal domain tasks while maintaining the robust foundational capabilities of the PhoBERT architecture.

⚠️ Important Notice: This is a base model that requires fine-tuning for optimal performance in downstream tasks. We strongly recommend applying supervised fine-tuning (SFT), instruction tuning, or other post-training techniques before production use.

Model Details

Architecture & Specifications

Base Model: vinai/phobert-base-v2
Model Type: Masked Language Model (RoBERTa-based)
Parameters: 135M total parameters
Architecture: BERT-base with RoBERTa optimizations
Attention: Multi-head attention with 12 attention heads
Hidden Size: 768
Intermediate Size: 3072
Number of Layers: 12
Context Length: 256 tokens (training sequence length)
Vocabulary Size: 64,000 subword tokens
License: CC BY-NC-SA 4.0

Training Details

Training Method: Continual Pretraining with Masked Language Modeling
Training Data: 17GB Vietnamese legal corpus (same as ntphuc149/ViLegalQwen2.5-1.5B-Base)
Training Objective: Masked Language Modeling (MLM)
MLM Probability: 0.15
Training Steps: 500,000 steps
Text Preprocessing: Vietnamese word segmentation using pyvi/ViTokenizer before BPE tokenization
Optimization:
- Optimizer: AdamW
- Learning Rate: 1e-4 with linear scheduling
- Batch Size: 16 per device
- Gradient Accumulation Steps: 4
- Warmup Steps: 10,000
- Weight Decay: 0.015
Hardware: NVIDIA A100 GPU
Training Framework: Hugging Face Transformers + PyTorch
Training Duration: ~133 hours

Legal Corpus Composition

The training corpus was compiled through systematic crawling and curation of Vietnamese legal documents from the same sources as ViLegalQwen2.5-1.5B-Base:

Data Sources:

vbpl.vn - National Legal Document Database
thuvienphapluat.vn - Legal Library Vietnam
luatvietnam.vn - Vietnam Law Portal
lawnet.vn - Professional Legal Network

Corpus Statistics:

Training Corpus: 17GB of curated Vietnamese legal texts
Document Types: Laws, decrees, circulars, decisions, regulations, and legal interpretations
Coverage: Comprehensive Vietnamese legal framework from multiple authoritative sources
Preprocessing: Word segmentation applied before subword tokenization for better Vietnamese language understanding

Performance & Capabilities

Strengths

Legal Domain Expertise: Enhanced understanding of Vietnamese legal terminology and concepts through continual pretraining
Word-Level Understanding: Benefits from Vietnamese word segmentation preprocessing, improving comprehension of Vietnamese legal text structure
Contextual Representations: Produces high-quality contextualized embeddings for Vietnamese legal text
Efficient Architecture: Compact 135M parameter model suitable for resource-constrained environments
Vietnamese Language: Native-level Vietnamese legal language processing with proper word boundary recognition

Evaluation

This model has been trained through continual pretraining on Vietnamese legal texts using masked language modeling.

Comprehensive evaluation results on downstream Vietnamese legal NLP tasks coming soon.

Usage

Loading the Model

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model_name = "ntphuc149/ViLegalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage for getting embeddings
text = "Điều 1. Phạm vi điều chỉnh của Bộ luật này"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings
last_hidden_states = outputs.last_hidden_state
pooled_output = outputs.pooler_output

Masked Language Modeling

from transformers import pipeline

# Create fill-mask pipeline
fill_mask = pipeline("fill-mask", model="ntphuc149/ViLegalBERT", tokenizer="ntphuc149/ViLegalBERT")

# Example: predicting masked words in legal context
text = "Theo quy định của <mask> luật này, người vi phạm sẽ bị xử lý."
predictions = fill_mask(text)
print(predictions)

Fine-tuning for Downstream Tasks

This is a base model that requires fine-tuning for optimal performance in downstream tasks. Recommended applications include:

Legal Document Classification: Categorizing legal documents by type, domain, or jurisdiction
Legal Named Entity Recognition: Identifying legal entities, dates, regulations, and references
Legal Text Similarity: Computing semantic similarity between legal documents
Legal Information Retrieval: Enhancing search and retrieval of relevant legal information
Legal Question Answering: Building legal QA systems with proper fine-tuning

Model Limitations & Considerations

Limitations

Domain Specificity: Optimized for Vietnamese legal domain; may underperform on general Vietnamese text
Base Model Nature: Requires fine-tuning for optimal task-specific performance
Context Constraints: Trained with 256-token sequences; performance may degrade with longer contexts
Training Data Bias: Performance may reflect biases present in Vietnamese legal corpus
Temporal Limitations: Training data has a temporal cutoff; may not reflect the most recent legal changes

Ethical Considerations

Not Legal Advice: This model should NOT be used to provide actual legal advice
Professional Review Required: All model outputs should be reviewed by qualified legal professionals
Bias Awareness: Users should be aware of potential biases in legal interpretation
Responsible Use: Model should be used responsibly within appropriate legal and ethical frameworks

Safety Measures

Human Oversight: Always require human legal expert oversight
Output Verification: Verify all generated content against authoritative legal sources
Regulatory Compliance: Ensure usage complies with local AI and legal practice regulations

Citation

If you use ViLegalBERT in your research or applications, please cite:

@misc{vilegalbert-2025,
  title={ViLegalBERT: A Pretrained Language Model for Vietnamese Legal Texts.},
  author={Tien-Manh Tran, Manh-Cuong Phan, Truong-Phuc Nguyen},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ntphuc149/ViLegalBERT}
}

@inproceedings{phobert,
  title={PhoBERT: Pre-trained language models for Vietnamese},
  author={Nguyen, Dat Quoc and Nguyen, Anh Tuan},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2020},
  pages={1037--1042},
  year={2020}
}

Contact & Support

For questions, suggestions, or collaboration opportunities:

GitHub: https://github.com/ntphuc149
Email: nguyentruongphuc_12421tn@utehy.edu.vn
Issues: Please report issues on the model's discussion page

Disclaimer: This model is provided for research and educational purposes. It should not replace professional legal advice or consultation. Users are responsible for ensuring compliance with applicable laws and regulations in their jurisdiction.

Downloads last month: 61

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for ntphuc149/ViLegalBERT

Base model

vinai/phobert-base-v2

Finetuned

(287)

this model

Collection including ntphuc149/ViLegalBERT

ViLegal LMs

Collection

Lightweight Language Models for Vietnamese Legal Texts • 3 items • Updated Sep 18 • 3