TELEN: Temporal Evolving Legal Embedding Network
Vietnamese legal text embedding with meta-learning for continuous adaptation to new laws.
Overview
TELEN introduces a novel embedding architecture designed specifically for Vietnamese legal text retrieval in RAG (Retrieval-Augmented Generation) systems. Unlike conventional static embedding models, TELEN generates embeddings that adapt dynamically to the current state of the legal corpus β enabling seamless integration of new laws without retraining.
Key Innovations
HyperNetwork-Driven Projection β Instead of fixed projection weights, a HyperNetwork generates the embedding projection function from the current legal corpus state. When new laws are published, the embedding space adapts automatically.
Legal Concept Graph (LCG) β An evolving knowledge graph where nodes represent legal entities (laws, key terms) and edges encode cross-references, agency hierarchy, temporal sequences, and semantic similarity.
State-Adaptive Embeddings β Embeddings are not static vectors but are modulated by a learned "legal state vector" that summarizes the entire legal landscape at any point in time.
Architecture
Legal Text
β
Bi-Encoder (bkai-foundation-models/vietnamese-bi-encoder)
β
Raw Representation [768-dim]
β
βββββββββββββββββββββββββββββββββββββββ
β HyperNetwork(state_vector) β ΞW, Ξb β β Generated, not learned!
β Adapted Projection = Base + ΞWΒ·x + Ξb β
βββββββββββββββββββββββββββββββββββββββ
β
Legal Concept Graph (GNN)
β state_vector
State Encoder β current legal corpus
β
L2-Normalized Embedding [768-dim]
Benchmark Results
Test set: 1,406 Vietnamese legal articles from 2021 (held-out, unseen during training)
| Model | NDCG@3 | NDCG@5 | NDCG@10 | MRR@3 | MRR@5 | MRR@10 |
|---|---|---|---|---|---|---|
| BM25 (lexical) | 0.6753 | 0.7173 | 0.7250 | 0.6683 | 0.6928 | 0.6990 |
| PhoBERT-base-v2 (monolingual dense) | 0.5866 | 0.6360 | 0.6505 | 0.5657 | 0.5970 | 0.6059 |
| multilingual-E5-base (multilingual dense) | 0.4675 | 0.4888 | 0.5157 | 0.4327 | 0.4452 | 0.4573 |
| BAAI/bge-m3 (multilingual dense, 1024d) | 0.4668 | 0.5129 | 0.5452 | 0.4407 | 0.4657 | 0.4802 |
| DEk21 (legal dense) | 0.7900 | 0.8127 | 0.8344 | 0.7660 | 0.7785 | 0.7865 |
| TELEN (adaptive dense) | 0.9036 | 0.9138 | 0.9132 | 0.8830 | 0.8878 | 0.8878 |
| TELEN + CE re-rank (adaptive dense) | 0.9346 | 0.9339 | 0.9238 | 0.9199 | 0.9223 | 0.9223 |
Key insight: Multilingual SOTA models (multilingual-E5, BGE-M3) score below even BM25 on Vietnamese legal text, confirming that domain and language specialization trumps generic multilingual pre-training for legal retrieval.
Relative Improvement
| Baseline | NDCG@3 | NDCG@5 | NDCG@10 | MRR@10 |
|---|---|---|---|---|
| vs multilingual-E5 | +93.3% | +86.9% | +77.1% | +94.1% |
| vs BGE-M3 | +93.6% | +78.2% | +67.5% | +84.9% |
| vs PhoBERT | +59.3% | +46.8% | +42.0% | +52.2% |
| vs DEk21 | +18.3% | +14.9% | +10.7% | +17.3% |
Quick Start
Installation
pip install -r requirements.txt
Inference
from inference import TELENInference
# Load model
model = TELENInference()
# Encode legal texts
texts = [
"Δiα»u 1: ThΓ΄ng tΖ° nΓ y quy Δα»nh vα» quαΊ£n lΓ½ thuαΊΏ giΓ‘ trα» gia tΔng...",
"Δiα»u 2: Δα»i tượng Γ‘p dα»₯ng lΓ cΓ‘c tα» chα»©c, cΓ‘ nhΓ’n kinh doanh...",
]
embeddings = model.encode(texts) # β [2, 768] normalized vectors
# Compute similarity
similarity = model.similarity(texts[0], texts[1])
print(f"Cosine similarity: {similarity:.4f}")
# Retrieve similar documents
results = model.retrieve(texts[0], corpus, top_k=10)
Training
# Train TELEN from scratch
python train.py
# Train cross-encoder re-ranker (optional, boosts MRR ~4%)
python train_ce.py
Evaluation
# Full benchmark (TELEN vs BM25/PhoBERT/mE5/BGE-M3/DEk21)
python eval.py
# TELEN + Cross-encoder re-ranking (MRR-optimized)
python eval_rerank.py
Training Details
Dataset
- Source: another-symato/VMTEB-Zalo-legel-retrieval-wseg on HuggingFace
- Content: 61,425 Vietnamese legal articles (ThΓ΄ng tΖ°, Nghα» Δα»nh, LuαΊt, PhΓ‘p lα»nh)
- Period: 1999β2021
- Format: Word-segmented Vietnamese text (underscore-separated compound words)
Training Pipeline
| Stage | Description | Epochs | Trainable Params |
|---|---|---|---|
| 1. Contrastive Pretraining | Triplet + InfoNCE loss on same-law article pairs | 5 | ~1M (projection head) |
| 2. Meta-Training | HyperNetwork learns to adapt embedding space for future laws | 50 (early stop) | ~4M (HyperNetwork + State Encoder) |
Hyperparameters
| Parameter | Value |
|---|---|
| Backbone | bkai-foundation-models/vietnamese-bi-encoder |
| Embedding dimension | 768 |
| Adaptation rank | 64 |
| GNN layers | 3 |
| Meta N-way, K-shot | 16-way, 5-shot |
| Negatives per query | 256 (50% hard + 50% random) |
| Temperature | 0.05 |
| Optimizer | AdamW + CosineAnnealingWarmRestarts |
Hardware
- GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
- Training time: ~8 hours (5 contrastive + 50 meta epochs)
Continuous Adaptation
When a new law is published, TELEN adapts without retraining:
# New law arrives
new_articles = [
"Δiα»u 1: LuαΊt mα»i vα» trΓ tuα» nhΓ’n tαΊ‘o...",
"Δiα»u 2: CΓ‘c nguyΓͺn tαΊ―c Γ‘p dα»₯ng AI trong xΓ©t xα»...",
]
# Update concept graph (milliseconds)
model.add_new_law("123/2025/l-ai", new_articles)
# Embedding space automatically adapts via HyperNetwork
# All subsequent query embeddings reflect the new legal landscape
embeddings = model.encode(["Δiα»u 1: ..."])
Project Structure
law-embedding/
βββ dataset/
β βββ train-00000-of-00001.parquet # Training data (61K legal articles)
βββ src/
β βββ data.py # Data loading utilities
β βββ telern/
β βββ config.py # Configuration
β βββ model.py # TELEN architecture
β βββ concept_graph.py # Legal Concept Graph + GNN
β βββ hypernetwork.py # HyperNetwork + StateEncoder
β βββ evaluate.py # Evaluation metrics & baselines
βββ data/checkpoints/telen/
β βββ telen_best.pt # Pretrained model weights
βββ train.py # Training script
βββ train_ce.py # Cross-encoder training (optional)
βββ eval.py # Evaluation script
βββ inference.py # Inference API
βββ requirements.txt
βββ README.md
Citation
@misc{telen2025,
title={TELEN: Temporal Evolving Legal Embedding Network for Vietnamese Law},
author={dangdinh},
year={2026},
publisher={Huggingface},
}
License
MIT License β see LICENSE file for details.
Acknowledgments
bkai-foundation-models/vietnamese-bi-encoderβ backbone bi-encoderhuyydangg/DEk21_hcmute_embeddingβ baseline comparison -vinai/phobert-base-v2β used in cross-encoder re-ranker
Model tree for haidang2405/telen
Base model
bkai-foundation-models/vietnamese-bi-encoder