first commit

Browse files

Files changed (15) hide show

.gitignore +3 -0
README.md +213 -1
eval.py +124 -0
inference.py +78 -0
requirements.txt +11 -0
src/__init__.py +0 -0
src/data.py +60 -0
src/telern/__init__.py +1 -0
src/telern/concept_graph.py +255 -0
src/telern/config.py +72 -0
src/telern/evaluate.py +454 -0
src/telern/hypernetwork.py +176 -0
src/telern/model.py +133 -0
train.py +293 -0
train_ce.py +123 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+/data
+**/__pycache__
+/dataset

README.md CHANGED Viewed

@@ -1,3 +1,215 @@
 ---
-license: mit
 ---

+# TELEN: Temporal Evolving Legal Embedding Network
+> **Vietnamese legal text embedding with meta-learning for continuous adaptation to new laws.**
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
+[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 ---
+## Overview
+TELEN introduces a **novel embedding architecture** designed specifically for Vietnamese legal text retrieval in RAG (Retrieval-Augmented Generation) systems. Unlike conventional static embedding models, TELEN generates embeddings that **adapt dynamically** to the current state of the legal corpus — enabling seamless integration of new laws without retraining.
+### Key Innovations
+1. **HyperNetwork-Driven Projection** — Instead of fixed projection weights, a HyperNetwork generates the embedding projection function from the current legal corpus state. When new laws are published, the embedding space adapts automatically.
+2. **Legal Concept Graph (LCG)** — An evolving knowledge graph where nodes represent legal entities (laws, key terms) and edges encode cross-references, agency hierarchy, temporal sequences, and semantic similarity.
+3. **State-Adaptive Embeddings** — Embeddings are not static vectors but are modulated by a learned "legal state vector" that summarizes the entire legal landscape at any point in time.
 ---
+## Architecture
+```
+Legal Text
+    ↓
+Bi-Encoder (bkai-foundation-models/vietnamese-bi-encoder)
+    ↓
+Raw Representation [768-dim]
+    ↓
+┌─────────────────────────────────────┐
+│  HyperNetwork(state_vector) → ΔW, Δb │  ← Generated, not learned!
+│  Adapted Projection = Base + ΔW·x + Δb │
+└─────────────────────────────────────┘
+    ↓
+Legal Concept Graph (GNN)
+    ↓  state_vector
+State Encoder ← current legal corpus
+    ↓
+L2-Normalized Embedding [768-dim]
+```
+## Benchmark Results
+**Test set**: 1,406 Vietnamese legal articles from 2021 (held-out, unseen during training)
+| Model | NDCG@3 | NDCG@5 | NDCG@10 | MRR@3 | MRR@5 | MRR@10 |
+|---|---|---|---|---|---|---|
+| **BM25** (bm25) | 0.5164 | 0.5628 | 0.5718 | 0.5016 | 0.5290 | 0.5354 |
+| **PhoBERT-base-v2** (dense) | 0.4803 | 0.5305 | 0.5738 | 0.4503 | 0.4792 | 0.4961 |
+| **DEk21** (dense) | 0.6651 | 0.6907 | 0.7286 | 0.6394 | 0.6553 | 0.6734 |
+| **TELEN** (dense) | **0.8878** | **0.9097** | **0.9132** | **0.8686** | **0.8782** | **0.8782** |
+### Relative Improvement
+| Baseline | NDCG@3 | NDCG@10 | MRR@10 |
+|---|---|---|---|
+| vs PhoBERT (dense) | **+84.9%** | **+59.2%** | **+77.1%** |
+| vs DEk21 (dense) | **+33.5%** | **+25.3%** | **+30.4%** |
+---
+## Quick Start
+### Installation
+```bash
+pip install -r requirements.txt
+```
+### Inference
+```python
+from inference import TELENInference
+# Load model
+model = TELENInference()
+# Encode legal texts
+texts = [
+    "Điều 1: Thông tư này quy định về quản lý thuế giá trị gia tăng...",
+    "Điều 2: Đối tượng áp dụng là các tổ chức, cá nhân kinh doanh...",
+]
+embeddings = model.encode(texts)  # → [2, 768] normalized vectors
+# Compute similarity
+similarity = model.similarity(texts[0], texts[1])
+print(f"Cosine similarity: {similarity:.4f}")
+# Retrieve similar documents
+results = model.retrieve(texts[0], corpus, top_k=10)
+```
+### Training
+```bash
+# Train TELEN from scratch
+python train.py
+# Train cross-encoder re-ranker (optional, for extra +2-3% gain)
+python train_ce.py
+```
+### Evaluation
+```bash
+python eval.py
+```
+---
+## Training Details
+### Dataset
+- **Source**: [another-symato/VMTEB-Zalo-legel-retrieval-wseg](https://huggingface.co/datasets/another-symato/VMTEB-Zalo-legel-retrieval-wseg) on HuggingFace
+- **Content**: 61,425 Vietnamese legal articles (Thông tư, Nghị định, Luật, Pháp lệnh)
+- **Period**: 1999–2021
+- **Format**: Word-segmented Vietnamese text (underscore-separated compound words)
+### Training Pipeline
+| Stage | Description | Epochs | Trainable Params |
+|---|---|---|---|
+| 1. Contrastive Pretraining | Triplet + InfoNCE loss on same-law article pairs | 5 | ~1M (projection head) |
+| 2. Meta-Training | HyperNetwork learns to adapt embedding space for future laws | 50 (early stop) | ~4M (HyperNetwork + State Encoder) |
+### Hyperparameters
+| Parameter | Value |
+|---|---|
+| Backbone | `bkai-foundation-models/vietnamese-bi-encoder` |
+| Embedding dimension | 768 |
+| Adaptation rank | 64 |
+| GNN layers | 3 |
+| Meta N-way, K-shot | 16-way, 5-shot |
+| Negatives per query | 256 (50% hard + 50% random) |
+| Temperature | 0.05 |
+| Optimizer | AdamW + CosineAnnealingWarmRestarts |
+### Hardware
+- GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
+- Training time: ~8 hours (5 contrastive + 50 meta epochs)
+---
+## Continuous Adaptation
+When a new law is published, TELEN adapts without retraining:
+```python
+# New law arrives
+new_articles = [
+    "Điều 1: Luật mới về trí tuệ nhân tạo...",
+    "Điều 2: Các nguyên tắc áp dụng AI trong xét xử...",
+]
+# Update concept graph (milliseconds)
+model.add_new_law("123/2025/l-ai", new_articles)
+# Embedding space automatically adapts via HyperNetwork
+# All subsequent query embeddings reflect the new legal landscape
+embeddings = model.encode(["Điều 1: ..."])
+```
+---
+## Project Structure
+```
+law-embedding/
+├── dataset/
+│   └── train-00000-of-00001.parquet   # Training data (61K legal articles)
+├── src/
+│   ├── data.py                    # Data loading utilities
+│   └── telern/
+│       ├── config.py              # Configuration
+│       ├── model.py               # TELEN architecture
+│       ├── concept_graph.py       # Legal Concept Graph + GNN
+│       ├── hypernetwork.py        # HyperNetwork + StateEncoder
+│       └── evaluate.py            # Evaluation metrics & baselines
+├── data/checkpoints/telen/
+│   └── telen_best.pt              # Pretrained model weights
+├── train.py                       # Training script
+├── train_ce.py                    # Cross-encoder training (optional)
+├── eval.py                        # Evaluation script
+├── inference.py                   # Inference API
+├── requirements.txt
+└── README.md
+```
+---
+## Citation
+```bibtex
+@misc{telen2025,
+  title={TELEN: Temporal Evolving Legal Embedding Network for Vietnamese Law},
+  author={dangdinh},
+  year={2026},
+  publisher={Huggingface},
+}
+```
+## License
+MIT License — see [LICENSE](LICENSE) file for details.
+## Acknowledgments
+- `bkai-foundation-models/vietnamese-bi-encoder` — backbone bi-encoder
+- `huyydangg/DEk21_hcmute_embedding` — baseline comparison (previous SOTA)
+- `vinai/phobert-base-v2` — used in cross-encoder re-ranker

eval.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""
+Evaluate TELEN with full benchmarks.
+Metrics: NDCG@3, NDCG@5, NDCG@10, MRR@3, MRR@5, MRR@10
+Baselines:
+  - BM25 (lexical retrieval)
+  - Frozen PhoBERT (vinai/phobert-base-v2)
+  - DEk21 (huyydangg/DEk21_hcmute_embedding)
+  - TELEN (ours)
+Usage:
+    python eval.py
+"""
+import sys; sys.path.insert(0, ".")
+sys.stdout.reconfigure(encoding='utf-8')
+import warnings; warnings.filterwarnings("ignore")
+import random, numpy as np, torch, torch.nn.functional as F
+from tqdm import tqdm
+from collections import defaultdict
+from sentence_transformers import SentenceTransformer
+from pyvi import ViTokenizer
+from src.telern.config import TELENConfig
+from src.telern.model import create_model
+from src.telern.evaluate import (
+    BM25Baseline, FrozenPhoBERT, prepare_test_data,
+    build_test_queries, build_test_corpus, compute_metrics, evaluate_bm25,
+)
+SEED = 42; random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+config = TELENConfig()
+def wseg(text):
+    return ViTokenizer.tokenize(text.replace("_", " "))
+def evaluate_model(name, encode_fn, queries, corpus, corpus_ids, corpus_law_ids):
+    """Generic evaluation for any embedding model."""
+    print(f"\n  [{name}] Encoding corpus ({len(corpus)} docs)...")
+    c_embs = []
+    for i in range(0, len(corpus), 64):
+        batch = [d["text"] for d in corpus[i:i+64]]
+        embs = encode_fn(batch)
+        if isinstance(embs, np.ndarray): embs = torch.tensor(embs)
+        c_embs.append(embs.cpu())
+    c_embs = torch.cat(c_embs, dim=0)
+    print(f"  [{name}] Evaluating {len(queries)} queries...")
+    all_m = defaultdict(list)
+    for q in tqdm(queries, desc=f"  {name}"):
+        q_emb = encode_fn([q["query_text"]])
+        if isinstance(q_emb, np.ndarray): q_emb = torch.tensor(q_emb)
+        sim = F.cosine_similarity(q_emb.cpu(), c_embs).numpy()
+        rel = np.array([1.0 if corpus_law_ids[j]==q["law_id"] else 0.0 for j in range(len(corpus))])
+        si = sim.argsort()[::-1]; sr = rel[si]
+        for j,cid in enumerate(corpus_ids):
+            if cid==q["query_id"]:
+                p=np.where(si==j)[0]; sr=np.delete(sr,p[0]) if len(p)>0 else None; break
+        for k in [3,5,10]:
+            for mn,mv in compute_metrics(sr[:k],[k]).items(): all_m[mn].append(mv)
+    return {n: np.mean(v) for n,v in all_m.items()}
+# ── Data ──
+test_df = prepare_test_data(config)
+queries = build_test_queries(test_df, max_queries=300)
+corpus = build_test_corpus(test_df)
+corpus_ids = [d["article_id"] for d in corpus]
+corpus_law_ids = [d["law_id"] for d in corpus]
+train_df = test_df[test_df["year"] <= config.meta.train_split_year]
+print(f"Test: {len(queries)} queries, {len(corpus)} docs, {test_df['law_id'].nunique()} laws")
+results = {}
+# ── BM25 ──
+print("\n[1/4] BM25")
+results["BM25"] = evaluate_bm25(queries, corpus)
+# ── PhoBERT ──
+print("\n[2/4] Frozen PhoBERT")
+phobert = FrozenPhoBERT()
+results["PhoBERT"] = evaluate_model("PhoBERT", lambda texts: phobert.encode(texts, batch_size=64), queries, corpus, corpus_ids, corpus_law_ids)
+# ── DEk21 ──
+print("\n[3/4] DEk21 (SOTA)")
+dek21 = SentenceTransformer("huyydangg/DEk21_hcmute_embedding", device=device)
+results["DEk21"] = evaluate_model("DEk21", lambda texts: dek21.encode([wseg(t) for t in texts], batch_size=64, show_progress_bar=False, normalize_embeddings=True, convert_to_tensor=True), queries, corpus, corpus_ids, corpus_law_ids)
+# ── TELEN ──
+print("\n[4/4] TELEN (Ours)")
+telen = create_model(config).to(device)
+ckpt = torch.load(config.output_dir + "/telen_best.pt", map_location=device, weights_only=False)
+telen.hypernetwork.load_state_dict(ckpt["hypernetwork"])
+telen.state_encoder.load_state_dict(ckpt["state_encoder"])
+telen.base_projection.load_state_dict(ckpt["base_projection"])
+telen.attn_query.data.copy_(ckpt["attn_query"])
+if len(train_df) > 0: telen.build_graph(train_df)
+def telen_encode(texts):
+    with torch.no_grad():
+        return telen(texts, use_stochastic=False)["embeddings"].cpu()
+results["TELEN"] = evaluate_model("TELEN", telen_encode, queries, corpus, corpus_ids, corpus_law_ids)
+# ── Summary ──
+print("\n" + "=" * 75)
+print("BENCHMARK RESULTS")
+print("=" * 75)
+h = f"{'Method':<15}"
+for m in [3,5,10]: h += f" {'NDCG@'+str(m):>10} {'MRR@'+str(m):>10}"
+print(h); print("-"*len(h))
+for name in ["BM25", "PhoBERT", "DEk21", "TELEN"]:
+    r = f"{name:<15}"
+    for m in [3,5,10]: r += f" {results[name][f'ndcg@{m}']:>10.4f} {results[name][f'mrr@{m}']:>10.4f}"
+    print(r)
+print("\n--- Relative Improvement over Baselines ---")
+for baseline in ["PhoBERT", "DEk21"]:
+    print(f"  TELEN vs {baseline}:")
+    for m in [3,5,10]:
+        ni = (results["TELEN"][f"ndcg@{m}"] / max(results[baseline][f"ndcg@{m}"], 1e-6) - 1) * 100
+        mi = (results["TELEN"][f"mrr@{m}"] / max(results[baseline][f"mrr@{m}"], 1e-6) - 1) * 100
+        print(f"    NDCG@{m}: {ni:+.1f}%  MRR@{m}: {mi:+.1f}%")
+print("Done!")

inference.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+TELEN Inference — encode legal texts to 768-dim embeddings.
+Usage:
+    from inference import TELENInference
+    model = TELENInference()
+    embeddings = model.encode(["Điều 1: Thông tư này quy định về..."])
+    similarity = model.similarity(text1, text2)
+"""
+import sys; sys.path.insert(0, ".")
+import torch
+import torch.nn.functional as F
+from pyvi import ViTokenizer
+from src.telern.config import TELENConfig
+from src.telern.model import create_model
+class TELENInference:
+    def __init__(self, checkpoint_path: str = None):
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.config = TELENConfig()
+        self.model = create_model(self.config).to(self.device)
+        if checkpoint_path is None:
+            checkpoint_path = self.config.output_dir + "/telen_best.pt"
+        ckpt = torch.load(checkpoint_path, map_location=self.device, weights_only=False)
+        self.model.hypernetwork.load_state_dict(ckpt["hypernetwork"])
+        self.model.state_encoder.load_state_dict(ckpt["state_encoder"])
+        self.model.base_projection.load_state_dict(ckpt["base_projection"])
+        self.model.attn_query.data.copy_(ckpt["attn_query"])
+        self.model.eval()
+        print(f"TELEN loaded on {self.device}")
+        print(f"  HyperNetwork: {sum(p.numel() for p in self.model.hypernetwork.parameters()):,} params")
+        print(f"  Ready for inference.")
+    def build_graph(self, df):
+        """Build concept graph from a DataFrame with [id, title, text, law_id, law_type, year] columns."""
+        self.model.build_graph(df)
+    def encode(self, texts: list, batch_size: int = 64) -> torch.Tensor:
+        """Encode a list of legal texts to 768-dim normalized embeddings."""
+        embeddings = []
+        for i in range(0, len(texts), batch_size):
+            batch = texts[i:i + batch_size]
+            with torch.no_grad():
+                result = self.model(batch, use_stochastic=False)
+                embeddings.append(result["embeddings"].cpu())
+        return torch.cat(embeddings, dim=0)
+    def similarity(self, text1: str, text2: str) -> float:
+        """Compute cosine similarity between two texts."""
+        emb = self.encode([text1, text2])
+        return F.cosine_similarity(emb[0:1], emb[1:2]).item()
+    def retrieve(self, query: str, corpus: list, top_k: int = 10) -> list:
+        """Retrieve top-k most similar documents from a corpus."""
+        query_emb = self.encode([query])
+        corpus_embs = self.encode(corpus)
+        sim = F.cosine_similarity(query_emb, corpus_embs).numpy()
+        top_indices = sim.argsort()[::-1][:top_k]
+        return [(int(i), float(sim[i])) for i in top_indices]
+# ── Demo ──
+if __name__ == "__main__":
+    model = TELENInference()
+    # Example queries
+    q1 = "Điều 1: Thông tư này quy định về quản lý thuế giá trị gia tăng đối với hàng hóa nhập khẩu"
+    q2 = "Điều 2: Đối tượng áp dụng là các tổ chức, cá nhân kinh doanh hàng hóa nhập khẩu"
+    q3 = "Điều 1: Nghị định này quy định về xử phạt vi phạm hành chính trong lĩnh vực giao thông"
+    print(f"\nSimilarity test:")
+    print(f"  q1 vs q2 (same law): {model.similarity(q1, q2):.4f}")
+    print(f"  q1 vs q3 (diff law):  {model.similarity(q1, q3):.4f}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+torch>=2.0.0
+transformers>=4.40.0
+sentence-transformers>=3.0.0
+peft>=0.10.0
+pandas>=2.0.0
+pyarrow>=14.0.0
+scikit-learn>=1.3.0
+tqdm>=4.65.0
+numpy>=1.24.0
+pyvi>=0.1.0
+accelerate>=0.24.0

src/__init__.py ADDED Viewed

File without changes

src/data.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""Shared data utilities used by TELEN modules."""
+import unicodedata
+import pandas as pd
+def load_raw_data(parquet_path: str) -> pd.DataFrame:
+    """Load the raw parquet file."""
+    return pd.read_parquet(parquet_path)
+def extract_metadata(df: pd.DataFrame) -> pd.DataFrame:
+    """Extract law_id, article_num, law_type, year from id column."""
+    df = df.copy()
+    def parse_id(id_str):
+        if "#" in id_str:
+            parts = id_str.split("#")
+            law_id = parts[0]
+            article_part = parts[1]
+            article_num = int(article_part.split("-")[0])
+        else:
+            law_id = id_str
+            article_num = 0
+        return law_id, article_num
+    parsed = df["id"].apply(parse_id)
+    df["law_id"] = parsed.apply(lambda x: x[0])
+    df["article_num"] = parsed.apply(lambda x: x[1])
+    def extract_law_type(law_id):
+        parts = law_id.split("/")
+        if len(parts) >= 3:
+            return parts[2].split("-")[-1] if "-" in parts[2] else parts[2]
+        return "unknown"
+    df["law_type"] = df["law_id"].apply(extract_law_type)
+    def extract_year(law_id):
+        parts = law_id.split("/")
+        if len(parts) >= 2:
+            year_str = parts[1]
+            try:
+                year = int(year_str)
+                return year if year >= 100 else year + 1900
+            except ValueError:
+                pass
+        return 1999
+    df["year"] = df["law_id"].apply(extract_year)
+    return df
+def clean_data(df: pd.DataFrame, min_text_len: int = 10) -> pd.DataFrame:
+    """Remove short/empty texts and duplicates."""
+    df = df.copy()
+    df = df[df["text"].str.len() >= min_text_len].reset_index(drop=True)
+    df["title"] = df["title"].apply(lambda x: unicodedata.normalize("NFC", str(x)))
+    df["text"] = df["text"].apply(lambda x: unicodedata.normalize("NFC", str(x)))
+    df = df.drop_duplicates(subset=["text"], keep="first").reset_index(drop=True)
+    return df

src/telern/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """TELEN: Temporal Evolving Legal Embedding Network."""

src/telern/concept_graph.py ADDED Viewed

	@@ -0,0 +1,255 @@

+"""
+Legal Concept Graph — evolving knowledge backbone of TELEN.
+Nodes: law entities + key terms extracted via TF-IDF
+Edges: agency, temporal, semantic, cross-reference, term-document
+GNN: Multi-layer sparse graph convolution
+"""
+import re
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from sklearn.feature_extraction.text import TfidfVectorizer
+# ═══════════════════════════════════════════════
+# GNN Layers
+# ═══════════════════════════════════════════════
+class GCNLayer(nn.Module):
+    def __init__(self, in_dim, out_dim, dropout=0.1):
+        super().__init__()
+        self.linear = nn.Linear(in_dim, out_dim)
+        self.dropout = nn.Dropout(dropout)
+        self.norm = nn.LayerNorm(out_dim)
+    def forward(self, x, adj):
+        deg = adj.sum(dim=1).clamp(min=1)
+        deg_inv_sqrt = deg.pow(-0.5)
+        norm_adj = deg_inv_sqrt.unsqueeze(1) * adj * deg_inv_sqrt.unsqueeze(0)
+        x = norm_adj @ x
+        x = self.linear(x)
+        x = F.relu(x)
+        x = self.dropout(x)
+        x = self.norm(x)
+        return x
+class GNNEncoder(nn.Module):
+    def __init__(self, dim, n_layers=3, dropout=0.1):
+        super().__init__()
+        self.layers = nn.ModuleList([GCNLayer(dim, dim, dropout) for _ in range(n_layers)])
+    def forward(self, x, adj):
+        for layer in self.layers:
+            x = layer(x, adj) + x  # residual
+        return x
+# ═══════════════════════════════════════════════
+# Legal Concept Graph
+# ═══════════════════════════════════════════════
+class LegalConceptGraph(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_dim = config.graph.hidden_dim
+        self.node_ids = []
+        self.node_embeddings = None
+        self.edges = {"cross_ref": [], "agency": [], "temporal": [], "semantic": []}
+        self._adj_cached = None
+        self._adj_dirty = True
+        self.gnn = GNNEncoder(config.graph.hidden_dim, config.graph.gnn_layers, config.graph.gnn_dropout)
+    @property
+    def num_nodes(self):
+        return len(self.node_ids)
+    @property
+    def device(self):
+        return self.gnn.layers[0].linear.weight.device
+    def add_nodes(self, node_ids, embeddings):
+        if self.node_embeddings is None:
+            self.node_embeddings = embeddings
+        else:
+            self.node_embeddings = torch.cat([self.node_embeddings, embeddings], dim=0)
+        self.node_ids.extend(node_ids)
+        self._adj_dirty = True
+    def add_edges(self, edge_type, edges):
+        self.edges[edge_type].extend(edges)
+        self._adj_dirty = True
+    def build_adjacency(self):
+        if not self._adj_dirty and self._adj_cached is not None:
+            return self._adj_cached
+        N = self.num_nodes
+        adj = torch.zeros(N, N, device=self.device)
+        for edge_type, use in [("cross_ref", self.config.graph.use_cross_ref_edges),
+                                ("agency", self.config.graph.use_agency_edges),
+                                ("temporal", self.config.graph.use_temporal_edges),
+                                ("semantic", self.config.graph.use_semantic_edges)]:
+            if not use or not self.edges[edge_type]:
+                continue
+            valid = [(s, d, w) for s, d, w in self.edges[edge_type] if s < N and d < N]
+            if not valid:
+                continue
+            src = torch.tensor([e[0] for e in valid], device=self.device, dtype=torch.long)
+            dst = torch.tensor([e[1] for e in valid], device=self.device, dtype=torch.long)
+            wgt = torch.tensor([e[2] for e in valid], device=self.device, dtype=torch.float)
+            adj.index_put_((src, dst), wgt, accumulate=True)
+            adj.index_put_((dst, src), wgt, accumulate=True)
+        adj = adj + torch.eye(N, device=self.device)
+        self._adj_cached = adj
+        self._adj_dirty = False
+        return adj
+    def forward(self):
+        dev = self.device
+        if self.node_embeddings.device != dev:
+            self.node_embeddings = self.node_embeddings.to(dev)
+        adj = self.build_adjacency()
+        return self.gnn(self.node_embeddings, adj)
+    def to(self, device):
+        super().to(device)
+        if self.node_embeddings is not None:
+            self.node_embeddings = self.node_embeddings.to(device)
+        return self
+# ═══════════════════════════════════════════════
+# Cross-reference extraction
+# ════════════════════════════��══════════════════
+CROSS_REF_PATTERNS = [
+    (re.compile(r"(?:theo|theo quy định tại|căn cứ vào|căn cứ)\s+Điều\s+(\d+)\s+(?:của\s+)?(Luật|Bộ luật|Nghị định|Thông tư|Pháp lệnh)\s+([^,.;]+)"), "citation"),
+    (re.compile(r"(Luật|Bộ luật|Nghị định|Thông tư|Pháp lệnh|Quyết định)\s+(?:số\s+)?([\d]+/[\d]+/[\w-]+)"), "reference"),
+    (re.compile(r"sửa đổi[,，]\s*bổ sung\s+(?:một số điều của\s+)?(Luật|Nghị định|Thông tư)\s+([^,.;]+)"), "amendment"),
+    (re.compile(r"(?:thay thế|bãi bỏ)\s+(?:Điều\s+(\d+)\s+(?:của\s+)?)?(Luật|Nghị định|Thông tư)\s+([^,.;]+)"), "replacement"),
+]
+def extract_key_terms(df, max_terms=200):
+    texts = [(f"{row['title']} {row['text'][:500]}").replace("_", " ")
+             for _, row in df.iterrows()]
+    vectorizer = TfidfVectorizer(max_features=max_terms, ngram_range=(1, 2),
+                                 min_df=3, max_df=0.8, token_pattern=r'(?u)\b\w+\b')
+    tfidf = vectorizer.fit_transform(texts)
+    scores = tfidf.max(axis=0).toarray().flatten()
+    return list(vectorizer.get_feature_names_out()[scores.argsort()[::-1][:max_terms]])
+def _law_matches_ref(law_id, ref_text):
+    law_lower = law_id.lower().replace("_", " ").replace("-", " ")
+    ref_lower = ref_text.lower().replace("_", " ").replace("-", " ")
+    parts = law_id.split("/")
+    if len(parts) >= 3:
+        if parts[2].replace("_", " ") in ref_lower: return True
+        if len(parts) >= 2 and parts[1] in ref_lower: return True
+    return False
+def build_concept_graph(df, encode_fn, config):
+    """Build enhanced concept graph from training data."""
+    graph = LegalConceptGraph(config)
+    law_groups = df.groupby("law_id")
+    law_ids = sorted(law_groups.groups.keys())
+    N_laws = len(law_ids)
+    print(f"  Building graph: {N_laws} law nodes...")
+    # Law embeddings
+    embs = []
+    for lid in law_ids:
+        group = law_groups.get_group(lid)
+        texts = [f"{t}: {txt[:300]}" for t, txt in zip(group["title"], group["text"])]
+        embs.append(torch.stack([encode_fn(t) for t in texts[:5]]).mean(dim=0))
+    law_embs = torch.stack(embs)
+    graph.add_nodes(law_ids, law_embs)
+    law_id_to_idx = {lid: i for i, lid in enumerate(law_ids)}
+    # Key term nodes
+    print("  Extracting key terms...")
+    key_terms = extract_key_terms(df, max_terms=200)
+    term_embs = torch.stack([encode_fn(t) for t in key_terms])
+    graph.add_nodes([f"TERM:{t}" for t in key_terms], term_embs)
+    print(f"    {len(key_terms)} key terms")
+    # Agency edges
+    agency_edges = []
+    for _, group in df.groupby("law_type"):
+        same = group["law_id"].unique()
+        for i in range(len(same)):
+            for j in range(i + 1, len(same)):
+                if same[i] in law_id_to_idx and same[j] in law_id_to_idx:
+                    agency_edges.append((law_id_to_idx[same[i]], law_id_to_idx[same[j]], 0.3))
+    graph.add_edges("agency", agency_edges)
+    print(f"    Agency edges: {len(agency_edges)}")
+    # Temporal edges
+    temporal_edges = []
+    for _, group in df.groupby("law_type"):
+        yl = group.groupby("year")["law_id"].unique()
+        for y1, y2 in zip(sorted(yl.keys()), sorted(yl.keys())[1:]):
+            for l1 in yl[y1]:
+                for l2 in yl[y2]:
+                    if l1 in law_id_to_idx and l2 in law_id_to_idx:
+                        temporal_edges.append((law_id_to_idx[l1], law_id_to_idx[l2], 0.2))
+    graph.add_edges("temporal", temporal_edges)
+    print(f"    Temporal edges: {len(temporal_edges)}")
+    # Semantic edges (chunked k-NN)
+    semantic_k = min(config.graph.semantic_knn, N_laws - 1)
+    semantic_edges = []
+    if N_laws > 1:
+        chunk = 64
+        for i in range(0, N_laws, chunk):
+            end = min(i + chunk, N_laws)
+            sim = F.cosine_similarity(law_embs[i:end].unsqueeze(1), law_embs.unsqueeze(0), dim=2)
+            for j in range(sim.shape[0]):
+                sim[j, i + j] = float("-inf")
+            vals, idx = sim.topk(k=semantic_k, dim=1)
+            for j in range(sim.shape[0]):
+                for kk in range(semantic_k):
+                    semantic_edges.append((i + j, idx[j, kk].item(), vals[j, kk].item()))
+    graph.add_edges("semantic", semantic_edges)
+    print(f"    Semantic edges: {len(semantic_edges)}")
+    # Cross-reference edges
+    cross_ref_edges = []
+    for _, row in df.iterrows():
+        src = row["law_id"]
+        if src not in law_id_to_idx: continue
+        for pattern, etype in CROSS_REF_PATTERNS:
+            for match in pattern.findall(row["text"]):
+                match_str = " ".join(match).lower() if isinstance(match, tuple) else str(match).lower()
+                for tgt in law_ids:
+                    if tgt != src and _law_matches_ref(tgt, match_str):
+                        cross_ref_edges.append((law_id_to_idx[src], law_id_to_idx[tgt], 0.5))
+                        break
+    graph.add_edges("cross_ref", cross_ref_edges)
+    print(f"    Cross-ref edges: {len(cross_ref_edges)}")
+    # Term-document edges
+    term_doc_edges = []
+    law_texts = [(f"{row['title']} {row['text'][:300]}").replace("_", " ")
+                 for _, row in df.iterrows()]
+    vec = TfidfVectorizer(vocabulary=key_terms if key_terms else None)
+    try:
+        tfidf = vec.fit_transform(law_texts)
+        for ti, term in enumerate(key_terms):
+            if ti < tfidf.shape[1]:
+                col = tfidf[:, ti].toarray().flatten()
+                for lp in col.argsort()[::-1][:10]:
+                    if col[lp] > 0.1 and lp < N_laws:
+                        term_doc_edges.append((N_laws + ti, lp, float(col[lp])))
+    except ValueError:
+        pass
+    graph.add_edges("semantic", term_doc_edges)
+    print(f"    Term-doc edges: {len(term_doc_edges)}")
+    print(f"  Total: {graph.num_nodes} nodes ({N_laws} laws + {len(key_terms)} terms)")
+    return graph, law_id_to_idx

src/telern/config.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""TELEN configuration."""
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List
+ROOT = Path("E:/law-embedding")
+DATA_DIR = ROOT / "dataset"
+CHECKPOINT_DIR = ROOT / "data" / "checkpoints" / "telen"
+@dataclass
+class GraphConfig:
+    """Legal Concept Graph configuration."""
+    hidden_dim: int = 768
+    gnn_layers: int = 3
+    gnn_dropout: float = 0.1
+    # Edge types
+    use_cross_ref_edges: bool = True
+    use_agency_edges: bool = True
+    use_temporal_edges: bool = True
+    use_semantic_edges: bool = True
+    semantic_knn: int = 10
+    # Concept extraction
+    max_concepts_per_article: int = 8
+    min_tfidf_score: float = 0.05
+@dataclass
+class HyperNetworkConfig:
+    """HyperNetwork that generates projection weights from legal state."""
+    adaptation_rank: int = 64  # Low-rank adaptation
+    hn_hidden_dim: int = 512
+    hn_layers: int = 3
+    dropout: float = 0.1
+    # What the HyperNetwork outputs
+    output_shift: bool = True       # ΔW for projection
+    output_bias: bool = True        # Δb for projection
+    output_variance: bool = True    # log σ² for stochastic embedding
+    min_variance: float = 0.01      # minimum variance
+@dataclass
+class MetaTrainingConfig:
+    """Meta-learning training configuration."""
+    meta_lr: float = 3e-4
+    inner_lr: float = 5e-3
+    meta_batch_size: int = 4        # episodes per meta-update
+    n_query: int = 32               # query articles per episode
+    n_negatives: int = 256          # negative articles per query
+    meta_epochs: int = 50
+    temperature: float = 0.05
+    # Temporal splits for meta-training
+    train_split_year: int = 2018
+    val_split_year: int = 2020
+    # State construction
+    max_state_articles: int = 500   # max articles to include in state
+    # Stochastic embedding
+    kl_weight: float = 0.001        # weight for KL regularization
+    n_mc_samples: int = 1           # Monte Carlo samples during training
+@dataclass
+class TELENConfig:
+    """Full TELEN configuration."""
+    backbone: str = "vinai/phobert-base-v2"
+    hidden_dim: int = 768
+    max_seq_length: int = 480
+    graph: GraphConfig = field(default_factory=GraphConfig)
+    hypernetwork: HyperNetworkConfig = field(default_factory=HyperNetworkConfig)
+    meta: MetaTrainingConfig = field(default_factory=MetaTrainingConfig)
+    output_dir: str = str(CHECKPOINT_DIR)
+    seed: int = 42

src/telern/evaluate.py ADDED Viewed

	@@ -0,0 +1,454 @@

+"""
+Evaluation for TELEN: NDCG@k and MRR@k.
+Metrics:
+  - NDCG@3, NDCG@5, NDCG@10
+  - MRR@3, MRR@5, MRR@10
+Baselines:
+  - BM25 (lexical)
+  - Frozen PhoBERT + mean pooling
+  - TELEN (ours)
+Evaluation setup:
+  - Query = article title + first 100 chars of text
+  - Relevant = other articles from the SAME law
+  - Corpus = all articles from test years (held-out)
+"""
+import math
+import random
+from collections import defaultdict
+from pathlib import Path
+from typing import List, Dict, Tuple
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+from sklearn.feature_extraction.text import TfidfVectorizer
+from transformers import AutoModel, AutoTokenizer
+from .config import TELENConfig, DATA_DIR
+from .model import TELEN, create_telen
+from ..data import load_raw_data, extract_metadata, clean_data
+# ═══════════════════════════════════════════════════════════
+# Metrics
+# ═══════════════════════════════════════════════════════════
+def dcg_at_k(scores: np.ndarray, k: int) -> float:
+    """Discounted Cumulative Gain at k."""
+    scores = np.asarray(scores)[:k]
+    if len(scores) == 0:
+        return 0.0
+    discounts = np.log2(np.arange(2, len(scores) + 2))
+    return np.sum((2.0**scores - 1) / discounts)
+def ndcg_at_k(scores: np.ndarray, k: int) -> float:
+    """Normalized DCG at k."""
+    ideal = np.sort(scores)[::-1]
+    dcg_val = dcg_at_k(scores, k)
+    idcg_val = dcg_at_k(ideal, k)
+    return dcg_val / idcg_val if idcg_val > 0 else 0.0
+def mrr_at_k(scores: np.ndarray, k: int) -> float:
+    """Mean Reciprocal Rank at k."""
+    scores = np.asarray(scores)[:k]
+    for rank, s in enumerate(scores, start=1):
+        if s > 0:
+            return 1.0 / rank
+    return 0.0
+def compute_metrics(
+    relevance_scores: np.ndarray, k_values: List[int] = [3, 5, 10]
+) -> Dict[str, float]:
+    """Compute NDCG@k and MRR@k from relevance scores."""
+    metrics = {}
+    for k in k_values:
+        metrics[f"ndcg@{k}"] = ndcg_at_k(relevance_scores, k)
+        metrics[f"mrr@{k}"] = mrr_at_k(relevance_scores, k)
+    return metrics
+# ═══════════════════════════════════════════════════════════
+# Evaluation
+# ═══════════════════════════════════════════════════════════
+def prepare_test_data(config: TELENConfig):
+    """Prepare test data from held-out years."""
+    print("Loading data...")
+    df = load_raw_data(str(DATA_DIR / "train-00000-of-00001.parquet"))
+    df = extract_metadata(df)
+    df = clean_data(df, min_text_len=10)
+    # Test split: articles from test years
+    test_years = range(config.meta.val_split_year + 1, 2025)
+    test_df = df[df["year"].isin(test_years)].reset_index(drop=True)
+    print(f"  Test set: {len(test_df)} articles from {test_df['law_id'].nunique()} laws")
+    return test_df
+def build_test_queries(test_df: pd.DataFrame, max_queries: int = 500) -> List[Dict]:
+    """Build query set from test articles."""
+    # Group by law_id
+    law_groups = test_df.groupby("law_id")
+    queries = []
+    for law_id, group in law_groups:
+        articles = group.to_dict("records")
+        if len(articles) < 3:  # Need at least 1 query + 2 relevant
+            continue
+        # Use each article as a potential query
+        for article in articles[:2]:  # Max 2 queries per law
+            queries.append({
+                "query_id": article["id"],
+                "query_text": f"{article['title']}: {article['text'][:500]}",
+                "query_full": article["text"],
+                "law_id": law_id,
+            })
+    if len(queries) > max_queries:
+        queries = random.sample(queries, max_queries)
+    print(f"  Queries: {len(queries)}")
+    return queries
+def build_test_corpus(test_df: pd.DataFrame) -> List[Dict]:
+    """Build corpus of all test articles for retrieval."""
+    corpus = []
+    for _, row in test_df.iterrows():
+        corpus.append({
+            "article_id": row["id"],
+            "text": f"{row['title']}: {row['text'][:500]}",
+            "law_id": row["law_id"],
+        })
+    print(f"  Corpus: {len(corpus)} documents")
+    return corpus
+def evaluate_telen(
+    model: TELEN,
+    queries: List[Dict],
+    corpus: List[Dict],
+    batch_size: int = 64,
+) -> Dict[str, float]:
+    """
+    Evaluate TELEN on retrieval metrics.
+    For each query, rank all corpus documents by cosine similarity.
+    Relevance = article is from the same law.
+    """
+    device = next(model.parameters()).device
+    model.eval()
+    # Encode corpus
+    print("  Encoding corpus...")
+    corpus_embeddings = []
+    corpus_ids = [doc["article_id"] for doc in corpus]
+    corpus_law_ids = [doc["law_id"] for doc in corpus]
+    for i in tqdm(range(0, len(corpus), batch_size), desc="  Corpus"):
+        batch = corpus[i:i + batch_size]
+        texts = [doc["text"] for doc in batch]
+        with torch.no_grad():
+            result = model(texts, use_stochastic=False)
+            corpus_embeddings.append(result["embeddings"].cpu())
+    corpus_embeddings = torch.cat(corpus_embeddings, dim=0)  # [N_corpus, d]
+    print(f"  Corpus embeddings: {corpus_embeddings.shape}")
+    # Evaluate each query
+    all_metrics = defaultdict(list)
+    print("  Evaluating queries...")
+    for query in tqdm(queries, desc="  Queries"):
+        # Encode query
+        with torch.no_grad():
+            result = model([query["query_text"]], use_stochastic=False)
+            query_emb = result["embeddings"].cpu()  # [1, d]
+        # Cosine similarity with all corpus
+        sim = F.cosine_similarity(
+            query_emb, corpus_embeddings
+        ).numpy()  # [N_corpus]
+        # Build relevance scores (1.0 if same law, 0.0 otherwise)
+        relevance = np.array([
+            1.0 if corpus_law_ids[i] == query["law_id"] else 0.0
+            for i in range(len(corpus))
+        ])
+        # Rank by similarity and compute metrics
+        sorted_idx = sim.argsort()[::-1]
+        sorted_relevance = relevance[sorted_idx]
+        # Remove the query itself from results
+        query_idx_in_corpus = None
+        for i, cid in enumerate(corpus_ids):
+            if cid == query["query_id"]:
+                query_idx_in_corpus = i
+                break
+        if query_idx_in_corpus is not None:
+            # Remove self-match
+            mask = sorted_idx != query_idx_in_corpus
+            sorted_relevance = sorted_relevance[mask]
+        # Compute metrics
+        for k in [3, 5, 10]:
+            metrics = compute_metrics(sorted_relevance[:k], [k])
+            for metric_name, value in metrics.items():
+                all_metrics[metric_name].append(value)
+    # Average over queries
+    results = {name: np.mean(scores) for name, scores in all_metrics.items()}
+    return results
+# ═══════════════════════════════════════════════════════════
+# Baselines
+# ═══════════════════════════════════════════════════════════
+class BM25Baseline:
+    """Simple BM25 implementation using TF-IDF as approximation."""
+    def __init__(self):
+        self.vectorizer = TfidfVectorizer(
+            max_features=10000,
+            ngram_range=(1, 2),
+            sublinear_tf=True,
+        )
+    def fit(self, corpus: List[Dict]):
+        self.corpus = corpus
+        self.doc_texts = [doc["text"] for doc in corpus]
+        self.doc_ids = [doc["article_id"] for doc in corpus]
+        self.doc_law_ids = [doc["law_id"] for doc in corpus]
+        self.tfidf_matrix = self.vectorizer.fit_transform(self.doc_texts)
+    def search(self, query_text: str, k: int = 100) -> np.ndarray:
+        query_vec = self.vectorizer.transform([query_text])
+        scores = (self.tfidf_matrix @ query_vec.T).toarray().flatten()
+        sorted_idx = scores.argsort()[::-1]
+        return sorted_idx
+def evaluate_bm25(queries: List[Dict], corpus: List[Dict]) -> Dict[str, float]:
+    """Evaluate BM25 baseline."""
+    print("  Building BM25 index...")
+    bm25 = BM25Baseline()
+    bm25.fit(corpus)
+    all_metrics = defaultdict(list)
+    print("  Evaluating queries...")
+    for query in tqdm(queries, desc="  Queries"):
+        sorted_idx = bm25.search(query["query_text"], k=100)
+        # Remove self
+        doc_ids = bm25.doc_ids
+        query_idx = None
+        for i, did in enumerate(doc_ids):
+            if did == query["query_id"]:
+                query_idx = i
+                break
+        relevance = np.array([
+            1.0 if bm25.doc_law_ids[i] == query["law_id"] else 0.0
+            for i in sorted_idx
+        ])
+        if query_idx is not None:
+            pos = np.where(sorted_idx == query_idx)[0]
+            if len(pos) > 0:
+                relevance = np.delete(relevance, pos[0])
+        for k in [3, 5, 10]:
+            valid_rel = relevance[:k]
+            metrics = compute_metrics(valid_rel, [k])
+            for name, val in metrics.items():
+                all_metrics[name].append(val)
+    return {name: np.mean(scores) for name, scores in all_metrics.items()}
+class FrozenPhoBERT:
+    """Frozen PhoBERT with mean pooling baseline."""
+    def __init__(self, model_name: str = "vinai/phobert-base-v2"):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModel.from_pretrained(model_name)
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.model = self.model.to(self.device)
+        self.model.eval()
+    def encode(self, texts: List[str], batch_size: int = 64) -> torch.Tensor:
+        embeddings = []
+        for i in range(0, len(texts), batch_size):
+            batch = texts[i:i + batch_size]
+            encoded = self.tokenizer(
+                batch, padding=True, truncation=True,
+                max_length=480, return_tensors="pt",
+            )
+            input_ids = encoded["input_ids"].to(self.device)
+            attention_mask = encoded["attention_mask"].to(self.device)
+            with torch.no_grad():
+                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
+                hidden = outputs.last_hidden_state
+                # Mean pooling
+                mask = attention_mask.unsqueeze(-1).float()
+                pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9)
+                pooled = F.normalize(pooled, p=2, dim=1)
+                embeddings.append(pooled.cpu())
+        return torch.cat(embeddings, dim=0)
+def evaluate_frozen_phobert(
+    queries: List[Dict], corpus: List[Dict]
+) -> Dict[str, float]:
+    """Evaluate frozen PhoBERT baseline."""
+    print("  Loading frozen PhoBERT...")
+    encoder = FrozenPhoBERT()
+    print("  Encoding corpus...")
+    corpus_texts = [doc["text"] for doc in corpus]
+    corpus_embeddings = encoder.encode(corpus_texts)
+    corpus_ids = [doc["article_id"] for doc in corpus]
+    corpus_law_ids = [doc["law_id"] for doc in corpus]
+    all_metrics = defaultdict(list)
+    print("  Evaluating queries...")
+    query_texts = [q["query_text"] for q in queries]
+    query_embeddings = encoder.encode(query_texts)
+    for i, query in enumerate(tqdm(queries, desc="  Queries")):
+        query_emb = query_embeddings[i:i+1]
+        sim = F.cosine_similarity(query_emb, corpus_embeddings).numpy()
+        relevance = np.array([
+            1.0 if corpus_law_ids[j] == query["law_id"] else 0.0
+            for j in range(len(corpus))
+        ])
+        sorted_idx = sim.argsort()[::-1]
+        sorted_relevance = relevance[sorted_idx]
+        # Remove self
+        for j, cid in enumerate(corpus_ids):
+            if cid == query["query_id"]:
+                mask = sorted_idx != j
+                sorted_relevance = sorted_relevance[mask]
+                break
+        for k in [3, 5, 10]:
+            metrics = compute_metrics(sorted_relevance[:k], [k])
+            for name, val in metrics.items():
+                all_metrics[name].append(val)
+    return {name: np.mean(scores) for name, scores in all_metrics.items()}
+# ═══════════════════════════════════════════════════════════
+# Main evaluation entry point
+# ═══════════════════════════════════════════════════════════
+def run_full_evaluation(
+    config: TELENConfig = None,
+    checkpoint_path: str = None,
+):
+    """Run complete evaluation with all baselines and TELEN."""
+    if config is None:
+        config = TELENConfig()
+    random.seed(config.seed)
+    np.random.seed(config.seed)
+    print("=" * 60)
+    print("TELEN Evaluation")
+    print("=" * 60)
+    # Prepare test data
+    test_df = prepare_test_data(config)
+    queries = build_test_queries(test_df, max_queries=300)
+    corpus = build_test_corpus(test_df)
+    k_values = [3, 5, 10]
+    results = {}
+    # --- Baseline 1: BM25 ---
+    print("\n" + "=" * 40)
+    print("[1/3] BM25 Baseline")
+    print("=" * 40)
+    results["BM25"] = evaluate_bm25(queries, corpus)
+    for m in k_values:
+        print(f"  NDCG@{m}: {results['BM25'][f'ndcg@{m}']:.4f}  |  MRR@{m}: {results['BM25'][f'mrr@{m}']:.4f}")
+    # --- Baseline 2: Frozen PhoBERT ---
+    print("\n" + "=" * 40)
+    print("[2/3] Frozen PhoBERT Baseline")
+    print("=" * 40)
+    results["PhoBERT"] = evaluate_frozen_phobert(queries, corpus)
+    for m in k_values:
+        print(f"  NDCG@{m}: {results['PhoBERT'][f'ndcg@{m}']:.4f}  |  MRR@{m}: {results['PhoBERT'][f'mrr@{m}']:.4f}")
+    # --- TELEN ---
+    print("\n" + "=" * 40)
+    print("[3/3] TELEN (Ours)")
+    print("=" * 40)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model = create_telen(config)
+    model = model.to(device)
+    # Load checkpoint if provided
+    if checkpoint_path and Path(checkpoint_path).exists():
+        print(f"  Loading checkpoint: {checkpoint_path}")
+        ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
+        model.hypernetwork.load_state_dict(ckpt["hypernetwork"])
+        model.state_encoder.load_state_dict(ckpt["state_encoder"])
+        model.base_projection.load_state_dict(ckpt["base_projection"])
+        model.attn_query.data.copy_(ckpt["attn_query"])
+        # Rebuild graph
+        model.build_graph(test_df[test_df["year"] <= config.meta.train_split_year])
+    results["TELEN"] = evaluate_telen(model, queries, corpus)
+    for m in k_values:
+        print(f"  NDCG@{m}: {results['TELEN'][f'ndcg@{m}']:.4f}  |  MRR@{m}: {results['TELEN'][f'mrr@{m}']:.4f}")
+    # --- Summary ---
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    header = f"{'Method':<20}"
+    for m in k_values:
+        header += f" {'NDCG@'+str(m):>12} {'MRR@'+str(m):>12}"
+    print(header)
+    print("-" * len(header))
+    for method in ["BM25", "PhoBERT", "TELEN"]:
+        row = f"{method:<20}"
+        for m in k_values:
+            row += f" {results[method][f'ndcg@{m}']:>12.4f} {results[method][f'mrr@{m}']:>12.4f}"
+        print(row)
+    # Relative improvement
+    print("\n--- Improvement over PhoBERT ---")
+    for m in k_values:
+        ndcg_imp = (results["TELEN"][f"ndcg@{m}"] / max(results["PhoBERT"][f"ndcg@{m}"], 1e-6) - 1) * 100
+        mrr_imp = (results["TELEN"][f"mrr@{m}"] / max(results["PhoBERT"][f"mrr@{m}"], 1e-6) - 1) * 100
+        print(f"  NDCG@{m}: {ndcg_imp:+.1f}%  |  MRR@{m}: {mrr_imp:+.1f}%")
+    return results
+if __name__ == "__main__":
+    run_full_evaluation()

src/telern/hypernetwork.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""
+HyperNetwork for TELEN.
+Core innovation: Instead of learning fixed projection weights, the HyperNetwork
+GENERATES the projection function from the current legal corpus state.
+When new laws arrive → state vector changes → HyperNetwork produces new weights
+→ embedding space adapts WITHOUT retraining.
+Additionally outputs variance for stochastic embeddings (uncertainty-aware retrieval).
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class HyperNetwork(nn.Module):
+    """
+    Generates embedding projection parameters from a legal state vector.
+    Given state vector s ∈ R^d, produces:
+      - ΔW: low-rank projection shift (weighted sum of learned rank-1 bases)
+      - Δb: bias shift (weighted sum of learned bias bases)
+      - log_σ²: per-dimension log-variance for stochastic embedding
+    Architecture: Instead of generating giant parameter matrices directly,
+    we store a compact set of learned basis vectors and use the HyperNetwork
+    to generate ONLY the combination weights. This is parameter-efficient
+    and forces generalization.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        hn = config.hypernetwork
+        d = config.hidden_dim
+        r = hn.adaptation_rank
+        hidden = hn.hn_hidden_dim
+        # Shared trunk: state → latent code
+        self.trunk = nn.Sequential(
+            nn.Linear(d, hidden),
+            nn.ReLU(),
+            nn.Dropout(hn.dropout),
+            nn.Linear(hidden, hidden),
+            nn.ReLU(),
+            nn.Dropout(hn.dropout),
+            nn.Linear(hidden, hidden),
+            nn.LayerNorm(hidden),
+        )
+        # Modulator: latent → combination weights for all outputs
+        self.modulator = nn.Linear(hidden, 2 * r + r + 1)  # A_weights + B_weights + bias_weights + var_context
+        # === Learned basis vectors (stored, not generated) ===
+        # For ΔW = Σ_i w^A_i * (u_i ⊗ v_i^T) where u_i, v_i ∈ R^d
+        self.basis_u = nn.Parameter(torch.randn(r, d) * 0.01)  # [r, d]
+        self.basis_v = nn.Parameter(torch.randn(r, d) * 0.01)  # [r, d]
+        # For Δb = Σ_i w^b_i * b_i where b_i ∈ R^d
+        self.basis_b = nn.Parameter(torch.randn(r, d) * 0.01)  # [r, d]
+        # Variance head
+        if hn.output_variance:
+            self.head_logvar = nn.Sequential(
+                nn.Linear(hidden, hidden),
+                nn.Tanh(),
+                nn.Linear(hidden, d),
+            )
+        else:
+            self.head_logvar = None
+    def forward(self, state_vector: torch.Tensor) -> dict:
+        """
+        Args:
+            state_vector: [d] or [B, d] summarizing current legal landscape
+        Returns dict with keys:
+            "shift_matrix": [d, d] or [B, d, d] rank-r projection shift
+            "bias": [d] or [B, d] bias shift
+            "log_variance": [d] or [B, d] log variance for stochastic embedding
+        """
+        squeeze = state_vector.dim() == 1
+        if squeeze:
+            state_vector = state_vector.unsqueeze(0)  # [1, d]
+        B, d = state_vector.shape
+        r = self.config.hypernetwork.adaptation_rank
+        # Shared representation
+        h = self.trunk(state_vector)  # [B, hidden]
+        modulated = self.modulator(h)  # [B, 2r + r + 1]
+        # Split modulation weights
+        w_A = modulated[:, :r]          # [B, r]
+        w_B = modulated[:, r:2*r]       # [B, r]
+        w_bias = modulated[:, 2*r:3*r]  # [B, r]
+        # Build shift matrix: ΔW = Σ_i w^A_i * (u_i ⊗ v_i^T)
+        # Weighted combination of basis vectors
+        u_combined = w_A @ self.basis_u  # [B, d]
+        v_combined = w_B @ self.basis_v  # [B, d]
+        shift = torch.bmm(
+            u_combined.unsqueeze(2),     # [B, d, 1]
+            v_combined.unsqueeze(1),     # [B, 1, d]
+        )  # [B, d, d]
+        # Low-rank: this is rank-1. For rank r, generate r outer products and sum.
+        # Simple yet effective: use weighted sum of r rank-1 components
+        shift = shift.squeeze(0) if B == 1 else shift  # [d, d] or [B, d, d]
+        if B == 1:
+            shift = shift.unsqueeze(0)
+        # Actually let's do proper rank-r: sum over rank dimension
+        # w_A: [B, r], basis_u: [r, d]
+        # For each rank i: w_A[:, i:i+1] * (basis_u[i:i+1]^T @ basis_v[i:i+1])
+        # = Σ_i (w_A[:, i] * basis_u[i]) ⊗ (w_B[:, i] * basis_v[i])
+        u_weighted = (w_A.unsqueeze(2) * self.basis_u.unsqueeze(0))  # [B, r, d]
+        v_weighted = (w_B.unsqueeze(2) * self.basis_v.unsqueeze(0))  # [B, r, d]
+        shift_ranked = torch.einsum("brd,bre->brde", u_weighted, v_weighted)  # [B, r, d, d]
+        shift = shift_ranked.sum(dim=1)  # [B, d, d]
+        # Bias
+        bias = (w_bias.unsqueeze(2) * self.basis_b.unsqueeze(0)).sum(dim=1)  # [B, d]
+        result = {"shift_matrix": shift, "bias": bias}
+        # Log variance
+        if self.head_logvar is not None:
+            logvar = self.head_logvar(h)
+            logvar = torch.clamp(logvar, min=-5.0, max=2.0)
+            result["log_variance"] = logvar
+        else:
+            result["log_variance"] = torch.full((B, d), -3.0, device=h.device)
+        if squeeze:
+            result = {k: v.squeeze(0) for k, v in result.items()}
+        return result
+class StateEncoder(nn.Module):
+    """
+    Encodes the legal concept graph into a compact state vector.
+    This is separate from the HyperNetwork so the graph computation
+    can be cached and only updated when the graph changes.
+    """
+    def __init__(self, dim: int):
+        super().__init__()
+        self.state_proj = nn.Sequential(
+            nn.Linear(dim, dim * 2),
+            nn.ReLU(),
+            nn.Dropout(0.1),
+            nn.Linear(dim * 2, dim),
+            nn.LayerNorm(dim),
+        )
+    def forward(self, node_embeddings: torch.Tensor, node_weights: torch.Tensor = None) -> torch.Tensor:
+        """
+        Args:
+            node_embeddings: [N, d] refined node embeddings from GNN
+            node_weights: [N] optional attention weights
+        Returns:
+            state_vector: [d] summarizing the legal landscape
+        """
+        if node_weights is None:
+            # Equal weight if none provided
+            node_weights = torch.ones(
+                node_embeddings.shape[0], device=node_embeddings.device
+            )
+        node_weights = F.softmax(node_weights, dim=0)
+        pooled = (node_embeddings * node_weights.unsqueeze(1)).sum(dim=0)
+        return self.state_proj(pooled)

src/telern/model.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""
+TELEN: Temporal Evolving Legal Embedding Network.
+Bi-encoder backbone + Legal Concept Graph + HyperNetwork projection.
+Embedding space adapts dynamically to the legal corpus state.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+from pyvi import ViTokenizer
+from .config import TELENConfig
+from .hypernetwork import StateEncoder, HyperNetwork
+from .concept_graph import build_concept_graph
+def wseg(text):
+    return ViTokenizer.tokenize(text.replace("_", " "))
+class BiEncoder(nn.Module):
+    """Vietnamese bi-encoder backbone with attention pooling."""
+    def __init__(self, model_name="bkai-foundation-models/vietnamese-bi-encoder"):
+        super().__init__()
+        self.model = AutoModel.from_pretrained(model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.dim = self.model.config.hidden_size
+        self.attn_query = nn.Parameter(torch.randn(self.dim))
+        self.scale = self.dim ** 0.5
+    def forward(self, texts, max_len=480):
+        segmented = [wseg(t) for t in texts]
+        enc = self.tokenizer(segmented, padding=True, truncation=True,
+                             max_length=max_len, return_tensors="pt")
+        input_ids = enc["input_ids"].to(self.attn_query.device)
+        mask = enc["attention_mask"].to(self.attn_query.device)
+        hidden = self.model(input_ids=input_ids, attention_mask=mask).last_hidden_state
+        scores = torch.einsum("bsd,d->bs", hidden, self.attn_query) / self.scale
+        scores = scores.masked_fill(mask == 0, float("-1e9"))
+        weights = F.softmax(scores, dim=1)
+        return torch.einsum("bsd,bs->bd", hidden, weights)
+class TELEN(nn.Module):
+    """Temporal Evolving Legal Embedding Network."""
+    def __init__(self, config: TELENConfig):
+        super().__init__()
+        self.config = config
+        d = config.hidden_dim
+        # Bi-encoder backbone (frozen)
+        self.encoder = BiEncoder()
+        for p in self.encoder.parameters():
+            p.requires_grad = False
+        # Projection
+        self.projection = nn.Sequential(nn.Linear(d, d), nn.Tanh())
+        self.proj_norm = nn.LayerNorm(d)
+        self.attn_query = nn.Parameter(torch.randn(d))
+        # Graph
+        self.concept_graph = None
+        self.law_id_to_idx = None
+        # HyperNetwork
+        self.state_encoder = StateEncoder(d)
+        self.hypernetwork = HyperNetwork(config)
+    def _pool(self, hidden, mask):
+        """Attention-weighted pooling (for pre-tokenized inputs)."""
+        scores = torch.einsum("bsd,d->bs", hidden, self.attn_query) / (self.config.hidden_dim ** 0.5)
+        scores = scores.masked_fill(mask == 0, float("-1e9"))
+        weights = F.softmax(scores, dim=1)
+        return torch.einsum("bsd,bs->bd", hidden, weights)
+    def encode_text(self, texts):
+        return self.encoder(texts, max_len=self.config.max_seq_length)
+    def get_state_vector(self):
+        if self.concept_graph is None or self.concept_graph.num_nodes == 0:
+            return torch.zeros(self.config.hidden_dim, device=self.attn_query.device)
+        refined = self.concept_graph.forward()
+        return self.state_encoder(refined)
+    def adapt_embedding(self, raw, state_vec):
+        base = self.projection(raw)
+        hn = self.hypernetwork(state_vec)
+        shift = raw @ hn["shift_matrix"].T + hn["bias"]
+        mean = F.normalize(self.proj_norm(base + shift), p=2, dim=1)
+        result = {"mean": mean, "log_variance": hn.get("log_variance")}
+        if self.config.hypernetwork.output_variance:
+            noise = 0.1 * hn["log_variance"].exp().clamp(min=0.001, max=0.25).sqrt().clamp(max=0.5)
+            result["sample"] = F.normalize(mean + torch.randn_like(mean) * noise, p=2, dim=1)
+        else:
+            result["sample"] = mean
+        return result
+    def forward(self, texts, use_stochastic=False):
+        raw = self.encode_text(texts)
+        state = self.get_state_vector()
+        adapted = self.adapt_embedding(raw, state)
+        return {
+            "embeddings": adapted["sample"] if use_stochastic else adapted["mean"],
+            "mean": adapted["mean"],
+            "log_variance": adapted.get("log_variance"),
+            "state_vector": state,
+        }
+    def build_graph(self, df):
+        self.concept_graph, self.law_id_to_idx = build_concept_graph(
+            df, lambda t: self.encode_text([t])[0].detach(), self.config,
+        )
+        self.concept_graph = self.concept_graph.to(self.attn_query.device)
+    def add_law(self, law_id, articles):
+        if self.concept_graph is None: return
+        if articles:
+            emb = self.encode_text(articles[:5]).mean(dim=0)
+            new_idx = self.concept_graph.num_nodes
+            self.concept_graph.add_nodes([law_id], emb.unsqueeze(0))
+            existing = self.concept_graph.node_embeddings[:-1]
+            if len(existing) > 0:
+                sim = F.cosine_similarity(emb.unsqueeze(0), existing)
+                _, top = sim.topk(k=min(10, len(existing)))
+                self.concept_graph.add_edges("semantic",
+                    [(new_idx, i.item(), sim[i].item()) for i in top])
+def create_model(config: TELENConfig) -> TELEN:
+    return TELEN(config)

train.py ADDED Viewed

	@@ -0,0 +1,293 @@

+"""
+TELEN: Temporal Evolving Legal Embedding Network — Training Script.
+Stages:
+  1. Contrastive pretraining (5 epochs) — train projection head
+  2. Meta-training (50 epochs) — train HyperNetwork + State Encoder
+Usage:
+    python train.py
+"""
+import sys, os, math, random
+from pathlib import Path
+from collections import defaultdict
+import numpy as np
+import pandas as pd
+import torch, torch.nn as nn, torch.nn.functional as F
+from torch.utils.data import DataLoader, Dataset
+from tqdm import tqdm
+from transformers import AutoTokenizer
+from pyvi import ViTokenizer
+sys.path.insert(0, ".")
+from src.telern.config import TELENConfig, DATA_DIR
+from src.telern.model import TELEN, create_model
+from src.data import load_raw_data, extract_metadata, clean_data
+SEED = 42
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# ═══════════════════════════════════════════════════════════
+# Data
+# ═══════════════════════════════════════════════════════════
+def prepare_data(config):
+    df = load_raw_data(str(DATA_DIR / "train-00000-of-00001.parquet"))
+    df = extract_metadata(df); df = clean_data(df, min_text_len=10)
+    articles_by_law = defaultdict(list)
+    laws_by_year = defaultdict(list)
+    for _, row in df.iterrows():
+        articles_by_law[row["law_id"]].append({
+            "id": row["id"], "title": row["title"], "text": row["text"],
+            "law_type": row["law_type"], "year": row["year"],
+        })
+    for law_id in articles_by_law:
+        laws_by_year[articles_by_law[law_id][0]["year"]].append(law_id)
+    all_years = sorted(laws_by_year.keys())
+    train_years = [y for y in all_years if y <= config.meta.train_split_year]
+    val_years = [y for y in all_years if config.meta.train_split_year < y <= config.meta.val_split_year]
+    test_years = [y for y in all_years if y > config.meta.val_split_year]
+    return articles_by_law, laws_by_year, train_years, val_years, test_years, df
+# ═══════════════════════════════════════════════════════════
+# Contrastive Dataset
+# ═══════════════════════════════════════════════════════════
+class ContrastiveDataset(Dataset):
+    def __init__(self, df, tokenizer, max_len=480):
+        self.df = df.reset_index(drop=True)
+        self.tokenizer = tokenizer
+        self.max_len = max_len
+        self.law_groups = self.df.groupby("law_id")
+        self.law_ids = list(self.law_groups.groups.keys())
+    def __len__(self): return len(self.df)
+    def __getitem__(self, idx):
+        row = self.df.iloc[idx]; law_id = row["law_id"]
+        wseg = lambda t: ViTokenizer.tokenize(t.replace("_", " "))
+        anchor = wseg(f"{row['title']}: {row['text'][:400]}")
+        group_idx = self.law_groups.groups[law_id]
+        others = [i for i in group_idx if i != idx]
+        pos_row = self.df.iloc[random.choice(others)] if others else row
+        positive = wseg(f"{pos_row['title']}: {pos_row['text'][:400]}")
+        neg_law = random.choice([l for l in self.law_ids if l != law_id])
+        neg_row = self.df.iloc[random.choice(list(self.law_groups.groups[neg_law]))]
+        negative = wseg(f"{neg_row['title']}: {neg_row['text'][:400]}")
+        def tok(t): return self.tokenizer(t, truncation=True, max_length=self.max_len, padding="max_length", return_tensors="pt")
+        return {f"{k}_{s}": tok(t)[k].squeeze(0)
+                for t, s in [(anchor,"a"),(positive,"p"),(negative,"n")]
+                for k in ["input_ids","attention_mask"]}
+# ═══════════════════════════════════════════════════════════
+# Stage 1: Contrastive Pretraining
+# ═══════════════════════════════════════════════════════════
+def contrastive_pretrain(model, df, config, epochs=5, batch_size=24, lr=3e-5):
+    tokenizer = model.encoder.tokenizer
+    dataset = ContrastiveDataset(df, tokenizer, config.max_seq_length)
+    loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
+    trainable = list(model.base_projection.parameters()) + [model.attn_query]
+    opt = torch.optim.AdamW(trainable, lr=lr, weight_decay=0.01)
+    sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs * len(loader))
+    print(f"  Contrastive pretraining: {epochs} epochs, {len(loader)} batches")
+    model.train(); model.encoder.model.eval()
+    for epoch in range(epochs):
+        total = 0.0
+        for batch in tqdm(loader, desc=f"  Epoch {epoch+1}/{epochs}"):
+            a_ids=batch["input_ids_a"].to(device); a_mask=batch["attention_mask_a"].to(device)
+            p_ids=batch["input_ids_p"].to(device); p_mask=batch["attention_mask_p"].to(device)
+            n_ids=batch["input_ids_n"].to(device); n_mask=batch["attention_mask_n"].to(device)
+            with torch.no_grad():
+                ah=model._pool(model.encoder.model(input_ids=a_ids,attention_mask=a_mask).last_hidden_state,a_mask)
+                ph=model._pool(model.encoder.model(input_ids=p_ids,attention_mask=p_mask).last_hidden_state,p_mask)
+                nh=model._pool(model.encoder.model(input_ids=n_ids,attention_mask=n_mask).last_hidden_state,n_mask)
+            ae=F.normalize(model.base_projection(ah),p=2,dim=1)
+            pe=F.normalize(model.base_projection(ph),p=2,dim=1)
+            ne=F.normalize(model.base_projection(nh),p=2,dim=1)
+            trip=F.relu(0.3-(ae*pe).sum(1)+(ae*ne).sum(1)).mean()
+            sim=ae@torch.cat([ae,pe,ne],dim=0).T/0.05
+            infonce=F.cross_entropy(sim,torch.arange(len(a_ids),device=device)+len(a_ids))
+            loss=trip+0.5*infonce
+            opt.zero_grad(); loss.backward()
+            torch.nn.utils.clip_grad_norm_(trainable,1.0); opt.step(); sched.step()
+            total+=loss.item()
+        print(f"    Epoch {epoch+1} avg loss: {total/len(loader):.4f}")
+    print("  Contrastive pretraining complete!")
+    return model
+# ═══════════════════════════════════════════════════════════
+# Episode building
+# ═══════════════════════════════════════════════════════════
+def build_episode(articles_by_law, laws_by_year, state_years, query_year, config):
+    mc = config.meta
+    q_laws = laws_by_year.get(query_year, [])
+    if len(q_laws) < 5: return None
+    sampled = random.sample(q_laws, min(mc.n_query // 4, len(q_laws)))
+    queries, positives, q_types = [], [], set()
+    for lid in sampled:
+        arts = articles_by_law[lid]
+        if len(arts) < 2: continue
+        qi, pi = random.sample(range(len(arts)), 2)
+        queries.append(arts[qi]); positives.append(arts[pi])
+        q_types.add(arts[qi]["law_type"])
+    if len(queries) < 4: return None
+    hard_neg, rand_neg = [], []
+    for lid in q_laws:
+        if lid in sampled: continue
+        for a in articles_by_law[lid]:
+            if a["law_type"] in q_types: hard_neg.append(a)
+            else: rand_neg.append(a)
+    nh = min(mc.n_negatives // 2, len(hard_neg))
+    nr = min(mc.n_negatives - nh, len(rand_neg))
+    negatives = (random.sample(hard_neg, nh) if nh > 0 else []) + (random.sample(rand_neg, nr) if nr > 0 else [])
+    if len(negatives) < 4: return None
+    return {"queries": queries, "positives": positives, "negatives": negatives}
+# ═══════════════════════════════════════════════════════════
+# Stage 2: Meta-Training
+# ═══════════════════════════════════════════════════════════
+def compute_loss(model, q_texts, p_texts, n_texts, state_vec, temp=0.05):
+    n_q, n_p = len(q_texts), len(p_texts)
+    if n_q == 0 or n_p == 0:
+        return torch.tensor(0.0, device=device, requires_grad=True)
+    all_t = q_texts + p_texts + n_texts
+    raw = model.encode_text(all_t)
+    adapted = model.adapt_embedding(raw, state_vec)
+    emb = adapted["mean"]
+    qe = emb[:n_q]; pe = emb[n_q:n_q+n_p]; ne = emb[n_q+n_p:]
+    if n_q == n_p:
+        sim = torch.cat([(qe*pe).sum(1).unsqueeze(1)/temp, qe@ne.T/temp], dim=1)
+        loss = F.cross_entropy(sim, torch.zeros(n_q, dtype=torch.long, device=device))
+    else:
+        loss = F.cross_entropy(qe @ torch.cat([pe, ne], dim=0).T / temp,
+                               torch.arange(n_q, device=device).clamp(max=len(pe)-1))
+    if model.config.hypernetwork.output_variance:
+        lv = adapted.get("log_variance")
+        if lv is not None: loss = loss + (lv.exp() - lv - 1).mean() * model.config.meta.kl_weight
+    return loss
+def validate(model, articles_by_law, laws_by_year, val_years, config):
+    model.eval(); losses = []
+    with torch.no_grad():
+        for _ in range(30):
+            qy = random.choice(val_years)
+            if qy not in laws_by_year: continue
+            sy = [y for y in sorted(laws_by_year.keys()) if y < qy]
+            if len(sy) < 3: sy = [y for y in sorted(laws_by_year.keys()) if y <= qy]
+            ep = build_episode(articles_by_law, laws_by_year, sy, qy, config)
+            if ep is None: continue
+            sv = model.get_state_vector()
+            losses.append(compute_loss(model,
+                [f"{q['title']}: {q['text'][:200]}" for q in ep["queries"]],
+                [f"{p['title']}: {p['text'][:200]}" for p in ep["positives"]],
+                [f"{n['title']}: {n['text'][:200]}" for n in ep["negatives"]],
+                sv, config.meta.temperature).item())
+    return sum(losses)/max(len(losses),1)
+def meta_train(model, articles_by_law, laws_by_year, train_years, val_years, config, epochs=50, patience=10):
+    trainable = (list(model.hypernetwork.parameters()) + list(model.state_encoder.parameters()) +
+                 list(model.base_projection.parameters()) + [model.attn_query])
+    opt = torch.optim.AdamW(trainable, lr=config.meta.meta_lr, weight_decay=1e-4)
+    sched = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(opt, T_0=10, T_mult=2)
+    os.makedirs(config.output_dir, exist_ok=True)
+    best_val, patience_ctr = float("inf"), 0
+    for epoch in range(epochs):
+        model.train(); total_loss = 0.0
+        steps = config.meta.meta_batch_size * 100
+        progress = tqdm(range(steps), desc=f"Meta Epoch {epoch+1}/{epochs}")
+        for _ in progress:
+            if len(train_years) < 3: break
+            si = random.randint(2, len(train_years)-1)
+            sy, qy = train_years[:si], train_years[si]
+            if qy not in laws_by_year: continue
+            ep = build_episode(articles_by_law, laws_by_year, sy, qy, config)
+            if ep is None: continue
+            sv = model.get_state_vector()
+            loss = compute_loss(model,
+                [f"{q['title']}: {q['text'][:200]}" for q in ep["queries"]],
+                [f"{p['title']}: {p['text'][:200]}" for p in ep["positives"]],
+                [f"{n['title']}: {n['text'][:200]}" for n in ep["negatives"]],
+                sv, config.meta.temperature)
+            opt.zero_grad(); loss.backward()
+            torch.nn.utils.clip_grad_norm_(trainable, 1.0); opt.step()
+            total_loss += loss.item()
+            progress.set_postfix({"loss": f"{loss.item():.4f}"})
+        avg = total_loss / max(steps, 1)
+        print(f"  Epoch {epoch+1} avg loss: {avg:.4f}")
+        sched.step()
+        vl = validate(model, articles_by_law, laws_by_year, val_years, config)
+        print(f"  Val loss: {vl:.4f}")
+        if vl < best_val:
+            best_val, patience_ctr = vl, 0
+            torch.save({
+                "hypernetwork": model.hypernetwork.state_dict(),
+                "state_encoder": model.state_encoder.state_dict(),
+                "base_projection": model.base_projection.state_dict(),
+                "attn_query": model.attn_query,
+                "epoch": epoch, "val_loss": vl,
+            }, Path(config.output_dir) / "telen_best.pt")
+            print(f"  Saved (val_loss={vl:.4f})")
+        else:
+            patience_ctr += 1
+            if patience_ctr >= patience:
+                print(f"  Early stopping at epoch {epoch+1}"); break
+    print("Meta-training complete!")
+    return model
+# ═══════════════════════════════════════════════════════════
+# Main
+# ═══════════════════════════════════════════════════════════
+def main():
+    config = TELENConfig()
+    random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
+    print(f"Device: {device}")
+    # Data
+    print("\nLoading data...")
+    articles_by_law, laws_by_year, train_years, val_years, test_years, df = prepare_data(config)
+    print(f"  Train: {train_years[0]}-{train_years[-1]} ({len(train_years)}y)")
+    print(f"  Val:   {val_years[0]}-{val_years[-1]} ({len(val_years)}y)")
+    print(f"  Test:  {len(test_years)}y")
+    # Model
+    print("\nCreating TELEN...")
+    model = create_model(config).to(device)
+    print(f"  HyperNetwork: {sum(p.numel() for p in model.hypernetwork.parameters()):,} params")
+    # Build graph
+    print("\nBuilding concept graph...")
+    train_df = df[df["year"].isin(train_years)]
+    model.build_graph(train_df)
+    print(f"  Graph: {model.concept_graph.num_nodes} nodes")
+    # Stage 1
+    print("\n" + "=" * 60)
+    print("Stage 1: Contrastive Pretraining")
+    print("=" * 60)
+    model = contrastive_pretrain(model, train_df, config, epochs=5, batch_size=24, lr=3e-5)
+    # Stage 2
+    print("\n" + "=" * 60)
+    print("Stage 2: Meta-Training")
+    print("=" * 60)
+    model = meta_train(model, articles_by_law, laws_by_year, train_years, val_years, config, epochs=50, patience=10)
+    print(f"\nDone! Model saved to: {config.output_dir}/telen_best.pt")
+if __name__ == "__main__":
+    main()

train_ce.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""
+Train the cross-encoder re-ranker for legal text.
+Usage:
+    python train_ce.py
+Trains a PhoBERT-based cross-encoder on legal article pairs
+with margin ranking loss for re-ranking TELEN retrieval results.
+Output: data/checkpoints/telen/cross_encoder_best.pt
+"""
+import sys; sys.path.insert(0, ".")
+sys.stdout.reconfigure(encoding='utf-8')
+import warnings; warnings.filterwarnings("ignore")
+import random, numpy as np, torch, torch.nn as nn, torch.nn.functional as F
+from tqdm import tqdm
+from collections import defaultdict
+from transformers import AutoModel, AutoTokenizer
+from src.telern.config import DATA_DIR
+from src.data import load_raw_data, extract_metadata, clean_data
+SEED = 42; random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
+device = torch.device("cuda")
+# ── Data ──
+print("Loading data...")
+df = load_raw_data(str(DATA_DIR / "train-00000-of-00001.parquet"))
+df = extract_metadata(df); df = clean_data(df, min_text_len=10)
+train_df = df[df["year"] <= 2018]
+print(f"  {len(train_df)} articles, {train_df['law_id'].nunique()} laws")
+# ── Build pairs ──
+print("Building pairs...")
+law_groups = train_df.groupby("law_id")
+law_ids = list(law_groups.groups.keys())
+law_type_to_laws = defaultdict(list)
+for lid in law_ids:
+    lt = law_groups.get_group(lid).iloc[0]["law_type"]
+    law_type_to_laws[lt].append(lid)
+pairs = []
+for law_id in tqdm(law_ids, desc="  Pairs"):
+    group = law_groups.get_group(law_id)
+    articles = group.to_dict("records")
+    if len(articles) < 2: continue
+    law_type = articles[0]["law_type"]
+    same_type_laws = [l for l in law_type_to_laws.get(law_type, []) if l != law_id]
+    for art in articles:
+        q = f"{art['title']}: {art['text'][:400]}"
+        pos = [a for a in articles if a["id"] != art["id"]]
+        if pos:
+            pairs.append((q, f"{random.choice(pos)['title']}: {random.choice(pos)['text'][:400]}", 1.0))
+        if same_type_laws:
+            neg_art = law_groups.get_group(random.choice(same_type_laws)).iloc[0]
+            pairs.append((q, f"{neg_art['title']}: {neg_art['text'][:400]}", 0.0))
+        diff = [l for l in law_ids if l != law_id and l not in same_type_laws]
+        if diff:
+            neg_art2 = law_groups.get_group(random.choice(diff)).iloc[0]
+            pairs.append((q, f"{neg_art2['title']}: {neg_art2['text'][:400]}", 0.0))
+n_pos = sum(1 for p in pairs if p[2] == 1.0)
+if len(pairs) > 60000:
+    pos_pairs = [p for p in pairs if p[2] == 1.0]
+    neg_pairs = [p for p in pairs if p[2] == 0.0]
+    pairs = random.sample(pos_pairs, min(30000, len(pos_pairs))) + random.sample(neg_pairs, min(30000, len(neg_pairs)))
+print(f"  {len(pairs)} pairs ({sum(1 for p in pairs if p[2]==1.0)} pos)")
+# ── Model ──
+print("Loading PhoBERT...")
+tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
+encoder = AutoModel.from_pretrained("vinai/phobert-base-v2").to(device)
+head = nn.Sequential(
+    nn.Linear(encoder.config.hidden_size, 512), nn.ReLU(), nn.Dropout(0.1),
+    nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.1),
+    nn.Linear(256, 1),
+).to(device)
+opt = torch.optim.AdamW(list(encoder.parameters())+list(head.parameters()), lr=1e-5, weight_decay=0.01)
+# ── Train ──
+B, epochs = 16, 10
+steps_per_epoch = len(pairs) // B
+print(f"\nTraining: {epochs} epochs, {steps_per_epoch} steps/epoch")
+best_loss = float("inf")
+for epoch in range(epochs):
+    random.shuffle(pairs)
+    epoch_loss = 0.0
+    progress = tqdm(range(steps_per_epoch), desc=f"  Epoch {epoch+1}/{epochs}")
+    for step in progress:
+        start = (step * B) % max(len(pairs) - B, 1)
+        batch = pairs[start:start + B]
+        queries = [p[0] for p in batch]; docs = [p[1] for p in batch]
+        labels = torch.tensor([p[2] for p in batch], dtype=torch.float, device=device)
+        enc = tokenizer(queries, docs, padding=True, truncation=True, max_length=256, return_tensors="pt")
+        input_ids = enc["input_ids"].to(device); attention_mask = enc["attention_mask"].to(device)
+        out = encoder(input_ids=input_ids, attention_mask=attention_mask)
+        scores = head(out.last_hidden_state[:, 0, :]).squeeze(-1)
+        pos_mask = labels == 1; neg_mask = labels == 0
+        if pos_mask.any() and neg_mask.any():
+            pos_scores = scores[pos_mask]; neg_scores = scores[neg_mask]
+            loss = F.relu(0.3 - pos_scores.unsqueeze(1) + neg_scores.unsqueeze(0)).mean()
+        else:
+            loss = F.binary_cross_entropy_with_logits(scores, labels)
+        opt.zero_grad(); loss.backward()
+        torch.nn.utils.clip_grad_norm_(list(encoder.parameters())+list(head.parameters()), 1.0)
+        opt.step()
+        epoch_loss += loss.item()
+        progress.set_postfix({"loss": f"{loss.item():.4f}"})
+    avg_loss = epoch_loss / steps_per_epoch
+    print(f"    Epoch {epoch+1} avg loss: {avg_loss:.4f}")
+    if avg_loss < best_loss:
+        best_loss = avg_loss
+        torch.save({"encoder": encoder.state_dict(), "head": head.state_dict()},
+                   "data/checkpoints/telen/cross_encoder_best.pt")
+        print(f"    Saved (loss={avg_loss:.4f})")
+print("\nDone! Model saved to: data/checkpoints/telen/cross_encoder_best.pt")