Dataset description
The CYP P450 genes are involved in the formation and breakdown (metabolism) of various molecules and chemicals within cells. Specifically, CYP3A4 is an important enzyme in the body, mainly found in the liver and in the intestine. It oxidizes small foreign organic molecules (xenobiotics), such as toxins or drugs, so that they can be removed from the body.
Task description
Binary classification. Given a drug SMILES string, predict CYP3A4 inhibition.
Dataset statistics
Total: 12,328 drugs
Pre-requisites
Install the following packages
pip install PyTDC
pip install DeepPurpose
pip install git+https://github.com/bp-kelley/descriptastorus
pip install dgl torch torchvision
You can also reference the colab notebook here
Dataset split
Random split on 70% training, 10% validation, and 20% testing
To load the dataset in TDC, type
from tdc.single_pred import ADME
data = ADME(name = 'CYP3A4_Veith')
Model description
Morgan chemical fingerprint with an MLP decoder. The model is tuned with 100 runs using the Ax platform.
from tdc import tdc_hf_interface
tdc_hf = tdc_hf_interface("CYP3A4_Veith-Morgan")
# load deeppurpose model from this repo
dp_model = tdc_hf.load_deeppurpose('./data')
tdc_hf.predict_deeppurpose(dp_model, ['YOUR SMILES STRING'])
References
- Dataset entry in Therapeutics Data Commons, https://tdcommons.ai/single_pred_tasks/adme/#cyp-p450-3a4-inhibition-veith-et-al
- Veith, Henrike et al. “Comprehensive characterization of cytochrome P450 isozyme selectivity across chemical libraries.” Nature Biotechnology vol. 27,11 (2009): 1050-5.