pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
Embedding model for Labor Space
This repository is fine-tuned BERT model for the Labor Space : A Unifying Representation of the Labor Market via Large Language Models
Model description
LABERT(Labor market + BERT) is a BERT based sentence-transformers model fine-tuned on a domain-specific corpus of labor market text data. We fine-tune the original BERT model in two ways to capture the latent structure of the labor market. More precisely, it was fine-tuned with two objectives:
Context learning : We use HuggingFace’s “fill mask” pipeline with descriptions for each entity to cover context information of the labor market at the individual word token level. We concatenate (1) 308 NAICS 4-digit descriptions, (2) O*NET’s descriptions for 36 skills,25 knowledge domains, 46 abilities, 1,016 occupations, (3) ESCO’s descriptions for 15,000 skills, 3,000 occupations, and (4) 489 Crunchbase S&P 500 firm descriptions, excluding their labels.
Relation learning : We build an additional fine-tuning process to incorporate inter-entity relatedness. Different types of the labor market are interwined with the other unit of the labor market. For example, industry-specific occupational employment represents the numerical relatedness between industry and occupation and tells us which occupations are conceptually close to specific industry entities. Relation learning makes our embedding space capture this inter-entity relatedness. As a result of relation learning, entitiy embedding is more closer to highly associated other entities than it does not. For more detail, see Section 3.4 Fine-tuning for relation learning in the paper
How to use
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer, models
base_model = "seongwoon/LAbert"
embedding_model = models.Transformer(base_model) ## Step 1: use an existing language model
pooling_model = models.Pooling(embedding_model.get_word_embedding_dimension()) ## Step 2: use a pool function over the token embeddings
pooling_model.pooling_mode_mean_tokens = True
pooling_model.pooling_mode_cls_token = False
pooling_model.pooling_mode_max_tokens = False
model = SentenceTransformer(modules=[embedding_model, pooling_model]) ## Join steps 1 and 2 using the modules argument
dancer_description = "Perform dances. May perform on stage, for broadcasting, or for video recording"
embedding_of_dancer_description = model.encode(dancer, convert_to_tensor= True)
print(description_embedding)
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Citing & Authors
@inproceedings{kim2024labor,
title={Labor Space: A Unifying Representation of the Labor Market via Large Language Models},
author={Kim, Seongwoon and Ahn, Yong-Yeol and Park, Jaehyuk},
booktitle={Proceedings of the ACM on Web Conference 2024},
pages={2441--2451},
year={2024}
}