You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Extraction of Formal Educational Requirements from Online Job Advertisements

Pipeline

  1. Localization (NER)
    • Task: find span in job ad containing the educational requirements
    • Trained and evaluated using annotated data (see table below)
  2. Education Level Extraction (Rule-based NER)
    • Task: determine education level(s) requested by employer
    • Rules refined and evaluated using annotated data (see table below)
    • Entities correspond to granular set of educational qualifications in the German system and are assigned to one or more ISCED codes by mapping rules
  3. Educational Subject Localization (NER)
    • Task: find entities within the span retrieved from step 1 matching an educational subject
    • Performed separately for academic and vocational subjects
    • Trained using annotated data
  4. Educational Subject Classification (few-shot SBERT)
    • Task: Classify the span retrieved from step 3 according to educational subject taxonomy
    • Performed separately for academic and vocational subjects
    • Training examples from
      • academic: official classifications of various study programs at German universities
      • vocational: synonym lists of vocational professions provided by Bundesagentur für Arbeit (available at https://download-portal.arbeitsagentur.de/), additional examples for higher-order clusters of vocational subjects have been generated using ChatGPT 4
    • All examples and candidate span are converted to embeddings using fine-tuned SBERT and classified using radius nearest neighbors with cosine distance
    • Combination of steps 3 and 4 evaluated using annotated data (see table below)

Usage

from huggingface_hub import snapshot_download
import sys

# download snapshot of model
path = snapshot_download(
    cache_dir="tmp/",
    repo_id="bertelsmannstift/oja_education_extraction",
    revision="main",
    token=HF_TOKEN,
)

# Add pipeline module to path and import 
sys.path.append(path)
from pipeline import PipelineWrapper

# Init model
pipeline = PipelineWrapper(path=path)

# Predictions
queries = [{"posting_id": "123",
         "full_text": "Wir sind Firma XYZ. Wir suchen einen Data Scientist. Sie haben Mathematik, Politikwissenschaften oder ein vergleichbares Fach studiert.",
         "candidate_description": None,
         "job_description": None}]

result = pipeline(queries)

Output

[{'posting_id': 'foo', 
'education_level_raw_id': [...], 
'education_level_isced_id': [...], 
'education_studies_label': [...], 
'education_vocational_label': [...]},
...
]

Variables and Taxonomy

Performance and Datasets

Training data are sampled from the Textkernel German OJA dataset https://www.textkernel.com/

Task n Data Sampling Annotators Annotator Overlap Inter Annotator Agreement Test-Split Evaluation Mode Precision (Micro) Recall (Micro)
Localization (step 1) 1500 Stratified by first number ISCO, equally distributed 4 10 % 0.2 NER partial 0.95 0.92
Education Level (step 2) 1200 Stratified by first number ISCO, equally distributed 3 20 % 0.84 (Krippendorff) 0.5 Multilabel classification 0.92 0.87
Academic Subject 1200 Stratified by first number ISCO, optimized 5 20 % 0.77 (Krippendorff) 0.2 NER strict 0.87 0.88
Vocational Subject 1200 Stratified by first number ISCO, optimized 2 20 % 0.72 (Krippendorff) 0.2 NER strict 0.86 0.83

Notes:

  • Performance estimates for step 2 refer to granular taxonomy (education_level_raw_id), not ISCED (will be marginally better)
  • False negatives accumulate in the pipeline, so total recall might be lower
  • However, since IAA < 1, some of the false negatives and false positives are due to annotator misclassification, so effective performance will be higher (mostly precision)
  • Optimized sampling performs equally distributed sampling on a subsample of 0.1 %, where not all strata can be completely filled, thus suppressing rare subgroups
Downloads last month
0
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.