PubMedNCL
A pretrained language model for document representations of biomedical papers. PubMedNCL is based on PubMedBERT, which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by SciNCL.
How to use the pretrained model
from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL')
model = AutoModel.from_pretrained('malteos/PubMedNCL')
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
# inference
result = model(**inputs)
# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]
Citation
- Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.
License
MIT
- Downloads last month
- 262
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.