BioGraphFusion Dataset

License Paper arXiv

? Dataset Description

This dataset contains the benchmark data used in the paper "BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning" published in Bioinformatics.

?? Dataset Structure

The dataset includes three biomedical knowledge graph completion tasks with background knowledge integration:

1. Disease-Gene Prediction (DisGeNet_cv)

  • Task: Disease-gene association prediction
  • Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Protein-Chemical relationships from STITCH (277,745 triples)
  • Main Dataset: DisGeNet (130,820 triples) focusing on gene targets
  • Description: Predicts disease-gene associations using multi-source biological knowledge

2. Protein-Chemical Interaction (STITCH)

  • Task: Protein-chemical interaction prediction
  • Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Disease-Gene relationships from DisGeNet (130,820 triples)
  • Main Dataset: STITCH (23,074 triples) focusing on chemical targets
  • Description: Predicts protein-chemical interactions with integrated disease and gene knowledge

3. Medical Ontology Reasoning (UMLS)

  • Task: Medical concept reasoning
  • Background Knowledge: Various medical relationships from UMLS (4,006 triples)
  • Main Dataset: UMLS (2,523 triples) with multi-domain entities
  • Description: Reasons about medical concepts and their hierarchical relationships

? Dataset Statistics

Dataset Task Background Knowledge Sources Main Dataset Targets Total Triples
Disease-Gene Prediction Disease-gene association prediction Drug-Disease Relationships SIDER (14,631) + Protein-Chemical Relationships STITCH (277,745) DisGeNet (130,820) Gene ~423K
Protein-Chemical Interaction Protein-chemical interaction prediction Drug-Disease Relationships SIDER (14,631) + Disease-Gene Relationships DisGeNet (130,820) STITCH (23,074) Chemical ~168K
Medical Ontology Reasoning Medical concept reasoning Various Medical Relationships UMLS (4,006) UMLS (2,523) Multi-domain Entities ~6.5K

? Usage

Loading the Dataset

from datasets import load_dataset

# Load the complete dataset
dataset = load_dataset("Y-TARL/BioGraphFusion")

# Load specific task
disgenet_data = load_dataset("Y-TARL/BioGraphFusion", "Disease-Gene")
stitch_data = load_dataset("Y-TARL/BioGraphFusion", "Protein-Chemical") 
umls_data = load_dataset("Y-TARL/BioGraphFusion", "umls")

? Citation

If you use this dataset in your research, please cite our paper:

@article{lin2025biographfusion,
  title={BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning},
  author={Lin, Yitong and He, Jiaying and Chen, Jiahe and Zhu, Xinnan and Zheng, Jianwei and Tao, Bo},
  journal={Bioinformatics},
  pages={btaf408},
  year={2025},
  publisher={Oxford University Press}
}

? Related Resources

? License

This dataset is released under the Apache 2.0 License.

? Acknowledgements

We thank the original data providers:

  • DisGeNet for disease-gene associations
  • STITCH for protein-chemical interactions
  • UMLS for medical ontology data

? Contact

For questions about the dataset, please open an issue in the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support