UniHGKR-base-beir

Our paper: UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers.

The UniHGKR-base-beir model is derived from the UniHGKR-base model, further fine-tuned on MS MARCO for evaluation on the BEIR benchmark. We recommend using the sentence-transformers package to load our model and to perform embedding for paragraphs and sentences.

It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Evaluation on BEIR

The evaluation code can be found at https://github.com/ZhishanQ/UniHGKR.

Model Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Use the instructions to achieve the best performance from the model:

general_ins = "Given a question, retrieve relevant evidence that can answer the question from all knowledge sources:"
single_source_inst = "Given a question, retrieve relevant evidence that can answer the question from Text sources:"

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ZhishanQ/UniHGKR-base-beir")
# Run inference

general_ins = "Given a question, retrieve relevant evidence that can answer the question from all knowledge sources:"
single_source_inst = "Given a question, retrieve relevant evidence that can answer the question from Text sources:"

sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]

# Prepend each sentence with the instruction
updated_sentences = [f"{single_source_inst} {sentence}" for sentence in sentences]

embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Framework Versions

  • Python: 3.8.10
  • Sentence Transformers: 3.0.1
  • Transformers: 4.44.2
  • PyTorch: 2.0.0+cu118
  • Accelerate: 0.34.0
  • Datasets: 2.21.0
  • Tokenizers: 0.19.1

Sentence Transformers Sources

Citation

If you find this resource useful in your research, please consider giving a like and citation.

@article{min2024unihgkr,
  title={UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers},
  author={Min, Dehai and Xu, Zhiyang and Qi, Guilin and Huang, Lifu and You, Chenyu},
  journal={arXiv preprint arXiv:2410.20163},
  year={2024}
}
Downloads last month
3
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.