TELEN: Temporal Evolving Legal Embedding Network

Vietnamese legal text embedding with meta-learning for continuous adaptation to new laws.

Python 3.10+ PyTorch License


Overview

TELEN introduces a novel embedding architecture designed specifically for Vietnamese legal text retrieval in RAG (Retrieval-Augmented Generation) systems. Unlike conventional static embedding models, TELEN generates embeddings that adapt dynamically to the current state of the legal corpus β€” enabling seamless integration of new laws without retraining.

Key Innovations

  1. HyperNetwork-Driven Projection β€” Instead of fixed projection weights, a HyperNetwork generates the embedding projection function from the current legal corpus state. When new laws are published, the embedding space adapts automatically.

  2. Legal Concept Graph (LCG) β€” An evolving knowledge graph where nodes represent legal entities (laws, key terms) and edges encode cross-references, agency hierarchy, temporal sequences, and semantic similarity.

  3. State-Adaptive Embeddings β€” Embeddings are not static vectors but are modulated by a learned "legal state vector" that summarizes the entire legal landscape at any point in time.


Architecture

Legal Text
    ↓
Bi-Encoder (bkai-foundation-models/vietnamese-bi-encoder)
    ↓
Raw Representation [768-dim]
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HyperNetwork(state_vector) β†’ Ξ”W, Ξ”b β”‚  ← Generated, not learned!
β”‚  Adapted Projection = Base + Ξ”WΒ·x + Ξ”b β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
Legal Concept Graph (GNN)
    ↓  state_vector
State Encoder ← current legal corpus
    ↓
L2-Normalized Embedding [768-dim]

Benchmark Results

Test set: 1,406 Vietnamese legal articles from 2021 (held-out, unseen during training)

Model NDCG@3 NDCG@5 NDCG@10 MRR@3 MRR@5 MRR@10
BM25 (lexical) 0.6753 0.7173 0.7250 0.6683 0.6928 0.6990
PhoBERT-base-v2 (monolingual dense) 0.5866 0.6360 0.6505 0.5657 0.5970 0.6059
multilingual-E5-base (multilingual dense) 0.4675 0.4888 0.5157 0.4327 0.4452 0.4573
BAAI/bge-m3 (multilingual dense, 1024d) 0.4668 0.5129 0.5452 0.4407 0.4657 0.4802
DEk21 (legal dense) 0.7900 0.8127 0.8344 0.7660 0.7785 0.7865
TELEN (adaptive dense) 0.9036 0.9138 0.9132 0.8830 0.8878 0.8878
TELEN + CE re-rank (adaptive dense) 0.9346 0.9339 0.9238 0.9199 0.9223 0.9223

Key insight: Multilingual SOTA models (multilingual-E5, BGE-M3) score below even BM25 on Vietnamese legal text, confirming that domain and language specialization trumps generic multilingual pre-training for legal retrieval.

Relative Improvement

Baseline NDCG@3 NDCG@5 NDCG@10 MRR@10
vs multilingual-E5 +93.3% +86.9% +77.1% +94.1%
vs BGE-M3 +93.6% +78.2% +67.5% +84.9%
vs PhoBERT +59.3% +46.8% +42.0% +52.2%
vs DEk21 +18.3% +14.9% +10.7% +17.3%

Quick Start

Installation

pip install -r requirements.txt

Inference

from inference import TELENInference

# Load model
model = TELENInference()

# Encode legal texts
texts = [
    "Điều 1: ThΓ΄ng tΖ° nΓ y quy Δ‘α»‹nh về quαΊ£n lΓ½ thuαΊΏ giΓ‘ trα»‹ gia tΔƒng...",
    "Điều 2: Đối tượng Γ‘p dα»₯ng lΓ  cΓ‘c tα»• chα»©c, cΓ‘ nhΓ’n kinh doanh...",
]
embeddings = model.encode(texts)  # β†’ [2, 768] normalized vectors

# Compute similarity
similarity = model.similarity(texts[0], texts[1])
print(f"Cosine similarity: {similarity:.4f}")

# Retrieve similar documents
results = model.retrieve(texts[0], corpus, top_k=10)

Training

# Train TELEN from scratch
python train.py

# Train cross-encoder re-ranker (optional, boosts MRR ~4%)
python train_ce.py

Evaluation

# Full benchmark (TELEN vs BM25/PhoBERT/mE5/BGE-M3/DEk21)
python eval.py

# TELEN + Cross-encoder re-ranking (MRR-optimized)
python eval_rerank.py

Training Details

Dataset

  • Source: another-symato/VMTEB-Zalo-legel-retrieval-wseg on HuggingFace
  • Content: 61,425 Vietnamese legal articles (ThΓ΄ng tΖ°, Nghα»‹ Δ‘α»‹nh, LuαΊ­t, PhΓ‘p lệnh)
  • Period: 1999–2021
  • Format: Word-segmented Vietnamese text (underscore-separated compound words)

Training Pipeline

Stage Description Epochs Trainable Params
1. Contrastive Pretraining Triplet + InfoNCE loss on same-law article pairs 5 ~1M (projection head)
2. Meta-Training HyperNetwork learns to adapt embedding space for future laws 50 (early stop) ~4M (HyperNetwork + State Encoder)

Hyperparameters

Parameter Value
Backbone bkai-foundation-models/vietnamese-bi-encoder
Embedding dimension 768
Adaptation rank 64
GNN layers 3
Meta N-way, K-shot 16-way, 5-shot
Negatives per query 256 (50% hard + 50% random)
Temperature 0.05
Optimizer AdamW + CosineAnnealingWarmRestarts

Hardware

  • GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
  • Training time: ~8 hours (5 contrastive + 50 meta epochs)

Continuous Adaptation

When a new law is published, TELEN adapts without retraining:

# New law arrives
new_articles = [
    "Điều 1: LuαΊ­t mα»›i về trΓ­ tuệ nhΓ’n tαΊ‘o...",
    "Điều 2: CΓ‘c nguyΓͺn tαΊ―c Γ‘p dα»₯ng AI trong xΓ©t xα»­...",
]

# Update concept graph (milliseconds)
model.add_new_law("123/2025/l-ai", new_articles)

# Embedding space automatically adapts via HyperNetwork
# All subsequent query embeddings reflect the new legal landscape
embeddings = model.encode(["Điều 1: ..."])

Project Structure

law-embedding/
β”œβ”€β”€ dataset/
β”‚   └── train-00000-of-00001.parquet   # Training data (61K legal articles)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data.py                    # Data loading utilities
β”‚   └── telern/
β”‚       β”œβ”€β”€ config.py              # Configuration
β”‚       β”œβ”€β”€ model.py               # TELEN architecture
β”‚       β”œβ”€β”€ concept_graph.py       # Legal Concept Graph + GNN
β”‚       β”œβ”€β”€ hypernetwork.py        # HyperNetwork + StateEncoder
β”‚       └── evaluate.py            # Evaluation metrics & baselines
β”œβ”€β”€ data/checkpoints/telen/
β”‚   └── telen_best.pt              # Pretrained model weights
β”œβ”€β”€ train.py                       # Training script
β”œβ”€β”€ train_ce.py                    # Cross-encoder training (optional)
β”œβ”€β”€ eval.py                        # Evaluation script
β”œβ”€β”€ inference.py                   # Inference API
β”œβ”€β”€ requirements.txt
└── README.md

Citation

@misc{telen2025,
  title={TELEN: Temporal Evolving Legal Embedding Network for Vietnamese Law},
  author={dangdinh},
  year={2026},
  publisher={Huggingface},
}

License

MIT License β€” see LICENSE file for details.

Acknowledgments

  • bkai-foundation-models/vietnamese-bi-encoder β€” backbone bi-encoder
  • huyydangg/DEk21_hcmute_embedding β€” baseline comparison - vinai/phobert-base-v2 β€” used in cross-encoder re-ranker
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for haidang2405/telen

Finetuned
(55)
this model

Dataset used to train haidang2405/telen