GLiNER2 Data Mention Extractor β€” datause-extraction

Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from development economics and humanitarian research documents.

Architecture: Two-Pass Hybrid

  • Pass 1 (extract_entities): Finds ALL data mention spans using 3 entity types (named_mention, descriptive_mention, vague_mention). Bypasses count_pred entirely.
  • Pass 2 (extract_json): Classifies each span individually (count=1).

Entity Types

  • named_mention: Proper names and acronyms (DHS, LSMS, FAOSTAT)
  • descriptive_mention: Described data with identifying detail but no formal name
  • vague_mention: Generic data references with minimal identifying detail

Classification Fields

  • typology_tag: survey / census / administrative / database / indicator / geospatial / microdata / report / other
  • is_used: True / False
  • usage_context: primary / supporting / background

Installation

pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror

Usage

from gliner2 import GLiNER2
import re

extractor = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
extractor.load_adapter("ai4data/datause-extraction")

ENTITY_SCHEMA = {
    "entities": ["named_mention", "descriptive_mention", "vague_mention"],
    "entity_descriptions": {
        "named_mention": "A proper name or well-known acronym for a data source (DHS, LSMS, FAOSTAT).",
        "descriptive_mention": "A described data reference with identifying detail but no formal name.",
        "vague_mention": "A generic or loosely specified reference to data.",
    },
}

def extract_sentence_context(text, char_start, char_end, margin=1):
    boundaries = [0] + [m.end() for m in re.finditer(r"[.!?]\s+", text)] + [len(text)]
    for i in range(len(boundaries) - 1):
        if boundaries[i] <= char_start < boundaries[i + 1]:
            s = max(0, i - margin)
            e = min(len(boundaries) - 1, i + margin + 1)
            return text[boundaries[s]:boundaries[e]].strip()
    return text

json_schema = (
    extractor.create_schema()
    .structure("data_mention")
    .field("mention_name", dtype="str")
    .field("typology_tag", dtype="str", choices=["survey","census","administrative","database","indicator","geospatial","microdata","report","other"])
    .field("is_used", dtype="str", choices=["True", "False"])
    .field("usage_context", dtype="str", choices=["primary", "supporting", "background"])
)

text = "The analysis draws on the DHS 2018 and administrative records from the National Statistics Office."

# Pass 1 β€” span detection
pass1 = extractor.extract(text, ENTITY_SCHEMA, threshold=0.3, include_confidence=True, include_spans=True)
entities = pass1.get("entities", {})

# Pass 2 β€” classification per span
results = []
for etype in ["named_mention", "descriptive_mention", "vague_mention"]:
    for span in entities.get(etype, []):
        mention_text = span.get("text", span) if isinstance(span, dict) else span
        char_start = span.get("start", text.find(mention_text)) if isinstance(span, dict) else text.find(mention_text)
        char_end = span.get("end", char_start + len(mention_text)) if isinstance(span, dict) else char_start + len(mention_text)
        context = extract_sentence_context(text, char_start, char_end)
        tags = extractor.extract(context, json_schema)
        tag = (tags.get("data_mention") or [{}])[0]
        results.append({
            "mention_name": mention_text,
            "specificity": etype.replace("_mention", ""),
            "typology": tag.get("typology_tag"),
            "is_used": tag.get("is_used"),
            "usage_context": tag.get("usage_context"),
        })

for r in results:
    print(r)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ai4data/datause-extraction

Adapter
(6)
this model