NER — Vietnamese Legal Domain (Electra-base)

Mô hình Named Entity Recognition (NER) cho văn bản pháp lý tiếng Việt, fine-tune từ NlpHUST/ner-vietnamese-electra-base trên bộ dữ liệu pháp lý tự xây dựng với BIO tagging scheme, 63 loại thực thể và 125 nhãn (1 O + các tag B-/I-; riêng ARTICLE và LOCATION chỉ xuất hiện dạng B-).

Phạm vi: 3 luật — Luật Viễn thông, Luật An ninh mạng, Luật Công nghệ thông tin (cùng các văn bản hướng dẫn liên quan).

Các loại thực thể nhận dạng

125 nhãn BIO (1 O + 62 loại × 2 prefix B/I) thuộc 7 nhóm: văn bản pháp lý, chủ thể, khái niệm pháp lý, an ninh mạng, hạ tầng & kỹ thuật, tiêu chuẩn & quy định, và các thực thể khác (địa điểm, thời gian, mức độ...).

Kỹ thuật fine-tune

Subword label alignment — chỉ gán nhãn cho first subtoken, các subtoken còn lại dùng -100
Mixed precision (bf16) + TF32 — tối ưu tốc độ trên GPU Ampere/Hopper
Layer-wise Learning Rate Decay (LLRD) với decay factor 0.9 — backbone LR 3e-5, classifier head LR 1e-4
Linear warmup + decay (warmup_ratio=0.1), weight decay 0.01, gradient clipping 1.0
Label smoothing 0.1
Early stopping theo eval_f1 (seqeval entity-level), patience 5
Dynamic padding với DataCollatorForTokenClassification
Reproducible seed 42

Dữ liệu huấn luyện

Split	Số câu
Train	6 403
Validation	1 279
Test	1 288

Dataset: JSONL format, mỗi dòng gồm tokens (list) và labels (list BIO labels) — ngoài ra có thêm id và sentence. Lưu ý tokens đã được word-tokenize sẵn (token dạng Bộ_Công_an, an_ninh_mạng).

Kết quả đánh giá

Train 30 epoch (early stopping theo eval_f1), GPU NVIDIA RTX PRO 6000 Blackwell, ~262 s — train_loss ≈ 1.09 (đã gồm label smoothing 0.1). Metric seqeval entity-level:

Split	Precision	Recall	F1
Validation	0.9965	0.9970	0.9968
Test	0.9960	0.9976	0.9968

Trên tập test (7 804 thực thể), theo classification_report:

	Precision	Recall	F1
micro avg	0.9960	0.9976	0.9968
macro avg	0.9896	0.9963	0.9926
weighted avg	0.9961	0.9976	0.9968

Phần lớn entity type đạt F1 = 1.0. Các type có support lớn: OBLIGATION (1079, F1 1.00), LEGAL_ACTOR (1025, F1 0.996), LEGAL_ACTION (940, F1 0.997), LAW (690, F1 1.00), TIME (556, F1 0.998), SANCTION (485, F1 0.998), SYSTEM (327, F1 0.989). Các type F1 thấp nhất đều thuộc nhóm support rất nhỏ: AUTHORIZATION (0.857, support 3), AUDIT (0.933, 7), LICENSE (0.947, 19), LEGAL_CONCEPT (0.952, 31), PERMISSION (0.964, 28).

⚠️ F1 ≈ 0.997 phản ánh dữ liệu được sinh khá theo khuôn mẫu (template) — số liệu trên domain văn bản pháp lý thực tế, đa dạng hơn, sẽ thấp hơn đáng kể.

Cách sử dụng

⚠️ Quan trọng: model được train trên dữ liệu đã word-tokenize (token dạng Bộ_Công_an, an_ninh_mạng, ...). Khi inference bắt buộc phải word-tokenize trước rồi truyền với is_split_into_words=True — nếu truyền raw text thẳng vào tokenizer/pipeline, entity sẽ lệch hoàn toàn.

import torch
from pyvi import ViTokenizer
from transformers import AutoTokenizer, AutoModelForTokenClassification

MODEL_DIR = "NTA1802/NER-Completing-With-Legal-Dataset"
tok = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=True)
mdl = AutoModelForTokenClassification.from_pretrained(MODEL_DIR).eval()
id2label = mdl.config.id2label

@torch.no_grad()
def extract_entities(text: str):
    words = ViTokenizer.tokenize(text).split()
    enc = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True)
    pred_ids = mdl(**enc).logits[0].argmax(-1).tolist()
    word_ids = enc.word_ids(0)

    word_label = [None] * len(words)
    seen = set()
    for j, wid in enumerate(word_ids):
        if wid is None or wid in seen:
            continue
        seen.add(wid)
        word_label[wid] = id2label[pred_ids[j]]

    entities, cur = [], None
    for w, lab in zip(words, word_label):
        if lab in (None, "O"):
            if cur: entities.append(cur); cur = None
            continue
        prefix, _, etype = lab.partition("-")
        if prefix == "B" or cur is None or cur["type"] != etype:
            if cur: entities.append(cur)
            cur = {"type": etype, "text": w.replace("_", " ")}
        else:
            cur["text"] += " " + w.replace("_", " ")
    if cur: entities.append(cur)
    return entities

print(extract_entities("Khoản 1 Điều 46 Luật An ninh mạng quy định về bảo vệ dữ liệu cá nhân."))

Thông tin mô hình

Base model: NlpHUST/ner-vietnamese-electra-base (ELECTRA-base, 12 layers, hidden 768)
Số nhãn: 125 (BIO scheme — 63 entity types)
Max sequence length: 512
Framework: PyTorch + HuggingFace Transformers ≥ 4.41
Ngôn ngữ: Tiếng Việt (văn bản pháp lý — viễn thông, an ninh mạng, công nghệ thông tin)

Downloads last month: 101

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for NTA1802/NER-Completing-With-Legal-Dataset

Base model

NlpHUST/electra-base-vn

Finetuned

NlpHUST/ner-vietnamese-electra-base

Finetuned

(6)

this model