BioGraphFusion Dataset

? Dataset Description

This dataset contains the benchmark data used in the paper "BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning" published in Bioinformatics.

?? Dataset Structure

The dataset includes three biomedical knowledge graph completion tasks with background knowledge integration:

1. Disease-Gene Prediction (DisGeNet_cv)

Task: Disease-gene association prediction
Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Protein-Chemical relationships from STITCH (277,745 triples)
Main Dataset: DisGeNet (130,820 triples) focusing on gene targets
Description: Predicts disease-gene associations using multi-source biological knowledge

2. Protein-Chemical Interaction (STITCH)

Task: Protein-chemical interaction prediction
Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Disease-Gene relationships from DisGeNet (130,820 triples)
Main Dataset: STITCH (23,074 triples) focusing on chemical targets
Description: Predicts protein-chemical interactions with integrated disease and gene knowledge

3. Medical Ontology Reasoning (UMLS)

Task: Medical concept reasoning
Background Knowledge: Various medical relationships from UMLS (4,006 triples)
Main Dataset: UMLS (2,523 triples) with multi-domain entities
Description: Reasons about medical concepts and their hierarchical relationships

? Dataset Statistics

Dataset	Task	Background Knowledge Sources	Main Dataset Targets	Total Triples
Disease-Gene Prediction	Disease-gene association prediction	Drug-Disease Relationships SIDER (14,631) + Protein-Chemical Relationships STITCH (277,745)	DisGeNet (130,820) Gene	~423K
Protein-Chemical Interaction	Protein-chemical interaction prediction	Drug-Disease Relationships SIDER (14,631) + Disease-Gene Relationships DisGeNet (130,820)	STITCH (23,074) Chemical	~168K
Medical Ontology Reasoning	Medical concept reasoning	Various Medical Relationships UMLS (4,006)	UMLS (2,523) Multi-domain Entities	~6.5K

? Usage

Loading the Dataset

from datasets import load_dataset

# Load the complete dataset
dataset = load_dataset("Y-TARL/BioGraphFusion")

# Load specific task
disgenet_data = load_dataset("Y-TARL/BioGraphFusion", "Disease-Gene")
stitch_data = load_dataset("Y-TARL/BioGraphFusion", "Protein-Chemical") 
umls_data = load_dataset("Y-TARL/BioGraphFusion", "umls")

? Citation

If you use this dataset in your research, please cite our paper:

@article{lin2025biographfusion,
  title={BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning},
  author={Lin, Yitong and He, Jiaying and Chen, Jiahe and Zhu, Xinnan and Zheng, Jianwei and Tao, Bo},
  journal={Bioinformatics},
  pages={btaf408},
  year={2025},
  publisher={Oxford University Press}
}

? Related Resources

Paper: Bioinformatics
Preprint: arXiv:2507.14468
Code: GitHub Repository

? License

This dataset is released under the Apache 2.0 License.

? Acknowledgements

We thank the original data providers:

DisGeNet for disease-gene associations
STITCH for protein-chemical interactions
UMLS for medical ontology data

? Contact

For questions about the dataset, please open an issue in the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support