BioGraphFusion Dataset
? Dataset Description
This dataset contains the benchmark data used in the paper "BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning" published in Bioinformatics.
?? Dataset Structure
The dataset includes three biomedical knowledge graph completion tasks with background knowledge integration:
1. Disease-Gene Prediction (DisGeNet_cv)
- Task: Disease-gene association prediction
- Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Protein-Chemical relationships from STITCH (277,745 triples)
- Main Dataset: DisGeNet (130,820 triples) focusing on gene targets
- Description: Predicts disease-gene associations using multi-source biological knowledge
2. Protein-Chemical Interaction (STITCH)
- Task: Protein-chemical interaction prediction
- Background Knowledge: Drug-Disease relationships from SIDER (14,631 triples) + Disease-Gene relationships from DisGeNet (130,820 triples)
- Main Dataset: STITCH (23,074 triples) focusing on chemical targets
- Description: Predicts protein-chemical interactions with integrated disease and gene knowledge
3. Medical Ontology Reasoning (UMLS)
- Task: Medical concept reasoning
- Background Knowledge: Various medical relationships from UMLS (4,006 triples)
- Main Dataset: UMLS (2,523 triples) with multi-domain entities
- Description: Reasons about medical concepts and their hierarchical relationships
? Dataset Statistics
Dataset | Task | Background Knowledge Sources | Main Dataset Targets | Total Triples |
---|---|---|---|---|
Disease-Gene Prediction | Disease-gene association prediction | Drug-Disease Relationships SIDER (14,631) + Protein-Chemical Relationships STITCH (277,745) | DisGeNet (130,820) Gene | ~423K |
Protein-Chemical Interaction | Protein-chemical interaction prediction | Drug-Disease Relationships SIDER (14,631) + Disease-Gene Relationships DisGeNet (130,820) | STITCH (23,074) Chemical | ~168K |
Medical Ontology Reasoning | Medical concept reasoning | Various Medical Relationships UMLS (4,006) | UMLS (2,523) Multi-domain Entities | ~6.5K |
? Usage
Loading the Dataset
from datasets import load_dataset
# Load the complete dataset
dataset = load_dataset("Y-TARL/BioGraphFusion")
# Load specific task
disgenet_data = load_dataset("Y-TARL/BioGraphFusion", "Disease-Gene")
stitch_data = load_dataset("Y-TARL/BioGraphFusion", "Protein-Chemical")
umls_data = load_dataset("Y-TARL/BioGraphFusion", "umls")
? Citation
If you use this dataset in your research, please cite our paper:
@article{lin2025biographfusion,
title={BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning},
author={Lin, Yitong and He, Jiaying and Chen, Jiahe and Zhu, Xinnan and Zheng, Jianwei and Tao, Bo},
journal={Bioinformatics},
pages={btaf408},
year={2025},
publisher={Oxford University Press}
}
? Related Resources
- Paper: Bioinformatics
- Preprint: arXiv:2507.14468
- Code: GitHub Repository
? License
This dataset is released under the Apache 2.0 License.
? Acknowledgements
We thank the original data providers:
- DisGeNet for disease-gene associations
- STITCH for protein-chemical interactions
- UMLS for medical ontology data
? Contact
For questions about the dataset, please open an issue in the GitHub repository.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support