TELEN: Temporal Evolving Legal Embedding Network

Vietnamese legal text embedding with meta-learning for continuous adaptation to new laws.

Overview

TELEN introduces a novel embedding architecture designed specifically for Vietnamese legal text retrieval in RAG (Retrieval-Augmented Generation) systems. Unlike conventional static embedding models, TELEN generates embeddings that adapt dynamically to the current state of the legal corpus — enabling seamless integration of new laws without retraining.

Key Innovations

HyperNetwork-Driven Projection — Instead of fixed projection weights, a HyperNetwork generates the embedding projection function from the current legal corpus state. When new laws are published, the embedding space adapts automatically.
Legal Concept Graph (LCG) — An evolving knowledge graph where nodes represent legal entities (laws, key terms) and edges encode cross-references, agency hierarchy, temporal sequences, and semantic similarity.
State-Adaptive Embeddings — Embeddings are not static vectors but are modulated by a learned "legal state vector" that summarizes the entire legal landscape at any point in time.

Architecture

Legal Text
    ↓
Bi-Encoder (bkai-foundation-models/vietnamese-bi-encoder)
    ↓
Raw Representation [768-dim]
    ↓
┌─────────────────────────────────────┐
│  HyperNetwork(state_vector) → ΔW, Δb │  ← Generated, not learned!
│  Adapted Projection = Base + ΔW·x + Δb │
└─────────────────────────────────────┘
    ↓
Legal Concept Graph (GNN)
    ↓  state_vector
State Encoder ← current legal corpus
    ↓
L2-Normalized Embedding [768-dim]

Benchmark Results

Test set: 1,406 Vietnamese legal articles from 2021 (held-out, unseen during training)

Model	NDCG@3	NDCG@5	NDCG@10	MRR@3	MRR@5	MRR@10
BM25 (lexical)	0.6753	0.7173	0.7250	0.6683	0.6928	0.6990
PhoBERT-base-v2 (monolingual dense)	0.5866	0.6360	0.6505	0.5657	0.5970	0.6059
multilingual-E5-base (multilingual dense)	0.4675	0.4888	0.5157	0.4327	0.4452	0.4573
BAAI/bge-m3 (multilingual dense, 1024d)	0.4668	0.5129	0.5452	0.4407	0.4657	0.4802
DEk21 (legal dense)	0.7900	0.8127	0.8344	0.7660	0.7785	0.7865
TELEN (adaptive dense)	0.9036	0.9138	0.9132	0.8830	0.8878	0.8878
TELEN + CE re-rank (adaptive dense)	0.9346	0.9339	0.9238	0.9199	0.9223	0.9223

Key insight: Multilingual SOTA models (multilingual-E5, BGE-M3) score below even BM25 on Vietnamese legal text, confirming that domain and language specialization trumps generic multilingual pre-training for legal retrieval.

Relative Improvement

Baseline	NDCG@3	NDCG@5	NDCG@10	MRR@10
vs multilingual-E5	+93.3%	+86.9%	+77.1%	+94.1%
vs BGE-M3	+93.6%	+78.2%	+67.5%	+84.9%
vs PhoBERT	+59.3%	+46.8%	+42.0%	+52.2%
vs DEk21	+18.3%	+14.9%	+10.7%	+17.3%

Quick Start

Installation

pip install -r requirements.txt

Inference

from inference import TELENInference

# Load model
model = TELENInference()

# Encode legal texts
texts = [
    "Điều 1: Thông tư này quy định về quản lý thuế giá trị gia tăng...",
    "Điều 2: Đối tượng áp dụng là các tổ chức, cá nhân kinh doanh...",
]
embeddings = model.encode(texts)  # → [2, 768] normalized vectors

# Compute similarity
similarity = model.similarity(texts[0], texts[1])
print(f"Cosine similarity: {similarity:.4f}")

# Retrieve similar documents
results = model.retrieve(texts[0], corpus, top_k=10)

Training

# Train TELEN from scratch
python train.py

# Train cross-encoder re-ranker (optional, boosts MRR ~4%)
python train_ce.py

Evaluation

# Full benchmark (TELEN vs BM25/PhoBERT/mE5/BGE-M3/DEk21)
python eval.py

# TELEN + Cross-encoder re-ranking (MRR-optimized)
python eval_rerank.py

Training Details

Dataset

Source: another-symato/VMTEB-Zalo-legel-retrieval-wseg on HuggingFace
Content: 61,425 Vietnamese legal articles (Thông tư, Nghị định, Luật, Pháp lệnh)
Period: 1999–2021
Format: Word-segmented Vietnamese text (underscore-separated compound words)

Training Pipeline

Stage	Description	Epochs	Trainable Params
1. Contrastive Pretraining	Triplet + InfoNCE loss on same-law article pairs	5	~1M (projection head)
2. Meta-Training	HyperNetwork learns to adapt embedding space for future laws	50 (early stop)	~4M (HyperNetwork + State Encoder)

Hyperparameters

Parameter	Value
Backbone	`bkai-foundation-models/vietnamese-bi-encoder`
Embedding dimension	768
Adaptation rank	64
GNN layers	3
Meta N-way, K-shot	16-way, 5-shot
Negatives per query	256 (50% hard + 50% random)
Temperature	0.05
Optimizer	AdamW + CosineAnnealingWarmRestarts

Hardware

GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
Training time: ~8 hours (5 contrastive + 50 meta epochs)

Continuous Adaptation

When a new law is published, TELEN adapts without retraining:

# New law arrives
new_articles = [
    "Điều 1: Luật mới về trí tuệ nhân tạo...",
    "Điều 2: Các nguyên tắc áp dụng AI trong xét xử...",
]

# Update concept graph (milliseconds)
model.add_new_law("123/2025/l-ai", new_articles)

# Embedding space automatically adapts via HyperNetwork
# All subsequent query embeddings reflect the new legal landscape
embeddings = model.encode(["Điều 1: ..."])

Project Structure

law-embedding/
├── dataset/
│   └── train-00000-of-00001.parquet   # Training data (61K legal articles)
├── src/
│   ├── data.py                    # Data loading utilities
│   └── telern/
│       ├── config.py              # Configuration
│       ├── model.py               # TELEN architecture
│       ├── concept_graph.py       # Legal Concept Graph + GNN
│       ├── hypernetwork.py        # HyperNetwork + StateEncoder
│       └── evaluate.py            # Evaluation metrics & baselines
├── data/checkpoints/telen/
│   └── telen_best.pt              # Pretrained model weights
├── train.py                       # Training script
├── train_ce.py                    # Cross-encoder training (optional)
├── eval.py                        # Evaluation script
├── inference.py                   # Inference API
├── requirements.txt
└── README.md

Citation

@misc{telen2025,
  title={TELEN: Temporal Evolving Legal Embedding Network for Vietnamese Law},
  author={dangdinh},
  year={2026},
  publisher={Huggingface},
}

License

MIT License — see LICENSE file for details.

Acknowledgments

bkai-foundation-models/vietnamese-bi-encoder — backbone bi-encoder
huyydangg/DEk21_hcmute_embedding — baseline comparison - vinai/phobert-base-v2 — used in cross-encoder re-ranker

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for haidang2405/telen

Base model

bkai-foundation-models/vietnamese-bi-encoder

Finetuned

(55)

this model

haidang2405
/

telen