AlphaMWang commited on
Commit
8b29ec4
·
verified ·
1 Parent(s): 32d3796

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +337 -162
README.md CHANGED
@@ -4,33 +4,67 @@ license: afl-3.0
4
 
5
  Refer to [GitHub](https://github.com/Wang-Lin-boop/GeminiMol).
6
 
7
- This repository provides the official implementation of the GeminiMol model, training data, and utitiles.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- We also provide:
10
 
11
- 1. scripts for data collection, partitioning and analysis.
12
- 2. scripts for features analysis, visualisation and similarity calculation.
13
- 3. scripts, datasets and results for benchmarking molecular fingerprints and GeminiMol models on virtual screening, target identification, and QSAR (drug-target binding affinity, cellar activity, ADME, and toxicity).
14
 
15
- Please also refer to our paper for a detailed description of GeminiMol.
16
 
17
  ## 💡 Highlight
18
 
19
- * By capturing the intricate interplay between molecular structure and conformational space, our training strategy enhances the representational capacity of GeminiMol.
20
- * GeminiMol was pre-trained on only 37,336 molecular structures, yet it can generalize to zero-shot and QSAR tasks involving millions of molecules.
21
- * GeminiMol exhibits the capability to identify pairs of molecules with similar 3D active conformation, even in scenarios where their 2D structures exhibit significant differences.
22
-
23
- ## 💗 Motivation
24
-
25
- The **molecular representation model** is an emerging artificial intelligence technology for extracting features of small molecules. It has been **widely applied in drug discovery scenarios**, such as **virtual screening**, Quantitative Structure-Activity Relationship (**QSAR**) analysis, and **ADMET propteries prediction**. In previous work, molecular representation models were mostly trained on the static structure of molecules, however, the small molecules in solution are highly dynamic, and their flexible conformational changes endow them with the potential to bind to drug targets. Therefore, introducing information on small molecule conformational space into molecular representation models is a promising aim. In this work, a training strategy, named GeminiMol, was proposed to **incorporate the comprehension of conformational space into the molecular representation model**.
26
 
27
  ## 🔔 News
28
 
29
  * 2023-12, our paper has been uploaded to BioRxiv, you can find it [here](https://www.biorxiv.org/content/10.1101/2023.12.14.571629).
 
 
 
 
 
 
 
 
30
 
31
  ## 📕 Installation
32
 
33
- GeminiMol is a pytorch-based AI model. To set up the GeminiMol model, we recommend using conda for Python environment configuration.
34
 
35
  > Installing MiniConda (skip if conda was installed)
36
 
@@ -51,33 +85,44 @@ GeminiMol is a pytorch-based AI model. To set up the GeminiMol model, we recomme
51
  ``` shell
52
  git clone https://github.com/Wang-Lin-boop/GeminiMol
53
  cd GeminiMol/
54
- export PATH=${PWD}:\${PATH}" >> ~/.bashrc # optional, not required in the current version
55
- export GeminiMol=${PWD}" >> ~/.bashrc
56
- cd geminimol/
57
- export geminimol_app=${PWD}" >> ~/.bashrc
58
- cd ../model/
59
- export geminimol_lib=${PWD}" >> ~/.bashrc
60
- cd ../data/
61
- export geminimol_data=${PWD}" >> ~/.bashrc
62
  source ~/.bashrc
63
  ```
64
 
65
- ### Download datasets and models
66
 
67
- In this repository, we provide all the training, validation, and testing datasets used in our paper, as well as an optimal GeminiMol binary-encoder model, a series of CSS similarity decoder models, a molecular structure decoder model, and a variety of decoder models of basic ADMET properties.
68
 
69
- > Download all datasets, model parameters and weights via Zenodo
70
 
71
- ``` shell
72
- cd ${GeminiMol}/data
73
- wget https://zenodo.org/api/records/10273480/files-archive
74
- unzip *
75
- ```
76
 
77
- > Download model parameters and weights via [Google Driver](https://drive.google.com/drive/folders/183WGytS-zy_POlLxEvijEtarow56zmnz?usp=drive_link)
 
 
78
 
79
  Then, we need place the models to the `${GeminiMol}/models`.
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  The expected structure of GeminiMol path is:
82
 
83
  ```
@@ -98,27 +143,27 @@ GeminiMol
98
  │ ├── CrossEncoder_Training.py # scripts for training the CrossEncoders.
99
  │ ├── GeminiMol_Training.py # scripts for training the GeminiMol models.
100
  │ ├── benchmark.py # benchmarking presentation methods on provide datasets
101
- ├── data # training and benchmark data in this work
102
- │ ├── Benchmark_DUD-E # virtual screeening
103
- │ ├── Benchmark_LIT-PCBA # virtual screeening
104
- │ ├── Benchmark_TIBD # target identification
105
- │ ├── Benchmark_QSAR # QSAR and ADMET
106
- │ ├── Chem_SmELECTRA # text backbone of chemical language
107
- │ ├── css_library # CSS training data
108
- │ ├── benchmark.json # dataset index for benchmark tasks
109
- │ ├── database.csv # molecular datasets in this work
110
- │ ├── BindingDB_DATA.csv # dataset used in target identification
111
- │ ├── ChemDiv.csv # library of common commercial compounds
112
- │ ├── Specs.csv # library of common commercial compounds
 
113
  ├── models # CrossEncoder and GeminiMol models
114
- │ ├── CrossEncoder # CrossEncoder
115
  │ ├── GeminiMol # GeminiMol, recommended for zero-shot tasks
116
- │ ├── GeminiMol-MOD # GeminiMol-MOD, recommended for QSAR tasks
117
  ```
118
 
119
- ### Installing the dependency packages
120
 
121
- If you intend to utilize molecular fingerprint baseline methods or conduct QSAR benchmarking, it is required to install RDKit and AutoGluon in advance.
122
 
123
  > Installing the RDkit for generating fingerprints
124
 
@@ -126,146 +171,110 @@ If you intend to utilize molecular fingerprint baseline methods or conduct QSAR
126
  pip install rdkit
127
  ```
128
 
129
- > Installing the AutoGluon for performing AutoQSAR
130
-
131
- ``` shell
132
- pip3 install -U pip
133
- pip3 install -U setuptools wheel
134
- pip3 install torch==1.13.1+cu116 torchvision==0.14.1+cu116 \
135
- --extra-index-url https://download.pytorch.org/whl/cu116
136
- pip3 install autogluon==0.8.1
137
- ```
138
-
139
  > Installing the statatics and plot packages
140
 
141
  ``` shell
142
- pip install oddt scikit-learn matplotlib
 
143
  ```
144
 
145
- To re-train the model or make predictions using the models we provide, follow the steps below to install the dependencies in advance.
146
-
147
  > Installing the dependency packages of GeminiMol
148
 
149
  ``` shell
150
- pip install scipy dgllife
151
  pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 \
152
  --extra-index-url https://download.pytorch.org/whl/cu116
153
- pip install dgl -f https://data.dgl.ai/wheels/cu116/repo.html
154
- pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html
 
155
  ```
156
 
157
- ## 👐 Reproducing
158
-
159
- Here, we present the reproducible code for training the Cross-Encoder and GeminiMol models based on the conformational space similarity descriptors of 39,290 molecules described in the paper.
160
- Additionally, benchmark test scripts were provided. With this code, the community can reproduce the results reported in the paper, explore different model architectures, or incorporate additional molecular similarity data to further enhance the performance of the models.
161
-
162
- > Training the Cross-Encoder
163
 
164
  ``` shell
165
- conda activate GeminiMol
166
- export model_name="CrossEncoder"
167
- export batch_size_per_gpu=200 # batch size = 200 (batch_size_per_gpu) * 4 (gpu number)
168
- export epoch=20 # max epochs
169
- export lr="1.0e-3" # learning rate
170
- export label_list="MCMM1AM_MAX:LCMS2A1Q_MAX:MCMM1AM_MIN:LCMS2A1Q_MIN" # ShapeScore:ShapeAggregation:ShapeOverlap:CrossSim:CrossAggregation:CrossOverlap
171
- CUDA_VISIBLE_DEVICES=0,1,2,3 python ${geminimol_app}/CrossEncoder_Training.py "${geminimol_data}/css_library/" "${geminimol_data}/Chem_SmELECTRA" "${epoch}" "${lr}" "${batch_size_per_gpu}" "${model_name}" "${geminimol_data}/benchmark.json" "${label_list}"
172
  ```
173
 
174
- > Training the GeminiMol Encoder
175
 
176
- ``` shell
177
- conda activate GeminiMol
178
- export model_name="GeminiMol"
179
- export batch_size=512
180
- export epoch=20 # max epochs
181
- export patience=50 # for early stoping
182
- export GNN='WLN' # Weisfeiler-Lehman Network (WLN)
183
- export network="MeanMLP:2048:4:2048:None:0:5:0" # "Weighted:1024:12:2048:None:0:5:0" for GeminiMol-MOD
184
- export label_dict="ShapeScore:0.2,ShapeAggregation:0.2,ShapeOverlap:0.05,ShapeDistance:0.05,CrossSim:0.15,CrossAggregation:0.15,CrossDist:0.05,CrossOverlap:0.05,MCS:0.1"
185
- CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/GeminiMol_Training.py "${geminimol_data}/css_library/" "${epoch}" "${batch_size}" "${GNN}" "${network}" "${label_dict}" "${model_name}" "${patience}" "${geminimol_data}/benchmark.json"
186
- ```
187
 
188
- > Benchmarking molecular fingerprints and our models
189
 
190
- ``` shell
191
- conda activate GeminiMol
192
- # benchmarking Fixed GeminiMol models and Fingerprints
193
- for task in "DUDE" "LIT-PCBA" "TIBD" \
194
- "ADMET-C" "ADMET-R" \
195
- "LIT-QSAR" "CELLS-QSAR" "ST-QSAR" "PW-QSAR" \
196
- "PropDecoder-ADMET" "PropDecoder-QSAR"
197
- do
198
- for model_name in "CombineFP" \
199
- "FCFP6" "MACCS" "RDK" "ECFP6" "FCFP4" "TopologicalTorsion" "AtomPairs" "ECFP4" \
200
- "${geminimol_lib}/GeminiMol" "${geminimol_lib}/GeminiMol-MOD"
201
- do
202
- mkdir -p ${model_name}
203
- CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/benchmark.py "${model_name}" "${geminimol_data}/benchmark.json" "${task}"
204
- done
205
- done
206
- # benchmarking FineTuning GeminiMol models
207
- for task in "FineTuning-ADMET" "FineTuning-QSAR"; do
208
- for model_name in "${geminimol_lib}/GeminiMol" "${geminimol_lib}/GeminiMol-MOD"; do
209
- CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/benchmark.py "${model_name}" "${geminimol_data}/benchmark.json" "${task}"
210
- done
211
- done
212
- ```
213
 
214
- ## 📓 Application
215
 
216
- As a molecular representation model, GeminiMol finds applications in **ligand-based virtual screening, target identification, and quantitative structure-activity relationship (QSAR)** modeling of small molecular drugs.
217
 
218
- ![benchmark](imgs/benchmark.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
 
220
- We have provided Cross-Encoder and GeminiMol models that can be used directly for inference. Here, we demonstrate the utilization of GeminiMol for virtual screening, target identification, and molecular property modeling.
221
 
222
- ### Virtual Screening
223
 
224
- In order to conduct virtual screening, it is essential to preassemble a collection of molecules that represent the pharmacological profile, including both active and non-active (optional) compounds, along with a library of compounds. These datasets should be saved in CSV format with specific column names.
225
 
226
- Note that the "**Label**" column is used to input the pharmacological profile. Ideally, you can input some **active** molecules and some **inactive** molecules that are similar to the active ones but lack activity. This will ensure that the selected molecules are as close as possible to the active molecules and simultaneously far from the inactive ones.
227
 
228
- Please note that the inactive molecules can refer to those lacking activity or those with **side effects** or **lower activity**.
229
 
230
- We have provided a processed version of the commercial Specs and ChemDiv compound library at the `${geminimol_data}/specs.csv` and `${geminimol_data}/ChemDiv.csv`, which contained 335,212 and 1,755,930 purchasable compounds. If you intend to utilize your own prepared compound library, please enable the "prepare" switch in the line 85 at the `Screener.py` script.
231
 
232
  ``` shell
233
  export job_name="Virtual_Screening"
234
- export decoy_set="decoys.csv" # SMILES, Title, and Label (optional)
235
- export compound_library="${geminimol_data}/ChemDiv.csv"
236
  export smiles_column="SMILES" # Specify the column name in the compound_library
237
- export id_column="Title" # Specify the column name in the compound_library
238
  export keep_top=1000
239
- CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/Screener.py "${geminimol_lib}/GeminiMol" "${job_name}" "${decoy_set}" "${compound_library}" "${keep_top}" "${smiles_column}" "${id_column}"
 
240
  ```
241
 
242
- The column denoting the SMILES representation of the compounds should be labeled as "**SMILES**", while the column indicating the activity label should be named "**Label**". Please assign the label "active" to the active molecules and "inactive" to the non-active molecules. Lastly, the column representing the molecule ID should be titled "**Title**".
243
-
244
- We restrict the use of column names to those specified in the designated compound library. This is primarily done to avoid confusion for novice users when modifying column names in large files. As for the decoy set, please ensure that the input CSV file contains at least two columns: SMILES and Title.
245
-
246
- ### Target Identification
247
-
248
- To conduct reverse virtual screening for target identification, it is essential to utilize a database that encompasses ligand-target relationships. This database should be structured with three columns: SMILES, Title, and **Targets**. The Targets column should specify the potential targets with which the drugs may interact.
249
-
250
- We have provided a processed version of the BindingDB database at the `${geminimol_data}/BindingDB_DATA.csv`, which contains 2,159,221 target-ligand paris.
251
 
252
  ``` shell
253
  export job_name="Target_Identification"
254
- export decoy_set="decoys.csv" # SMILES, Title, and Label (optional)
255
- export compound_library="${geminimol_data}/BindingDB_DATA.csv"
256
- export smiles_column="Ligand_SMILES" # Specify the column name in the compound_library
257
- export id_column="Monomer_ID" # Specify the column name in the compound_library
258
- export keep_top=100
259
- CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/Screener.py "${geminimol_lib}/GeminiMol" "${job_name}" "${decoy_set}" "${compound_library}" "${keep_top}" "${smiles_column}" "${id_column}"
 
260
  ```
261
 
262
- ### Molecular Proptery Modeling (QSAR and ADMET)
 
 
263
 
264
- > Prepare your datasets
265
 
266
  Before conducting molecular property modeling, it is crucial to carefully prepare your data, which includes compound structure pre-processing and dataset splitting.
267
 
268
- Firstly, you need to clarify the chirality and protonation states of molecules in the dataset, which can be done using chemical informatics tools such as RDKit or Schrödinger software package. The processed data should be saved in CSV file format, containing at least one column for **SMILES** and one column for **Labels**. Subsequently, utilize the following command for skeleton splitting. You can modify the script to change the splitting ratio, where by default, 70% of the dataset is used for training and 30% for validation and testing.
 
 
269
 
270
  ``` shell
271
  export dataset_path="data.csv"
@@ -278,42 +287,210 @@ mv ${dataset_name}_scaffold_*.csv ${dataset_name}/
278
  export task=${dataset_name}
279
  ```
280
 
 
 
281
  We have presented three approaches for molecular property modeling, namely AutoQSAR (broad applicability, slow speed), PropDecoder (fast speed), and FineTuning (optimal performance, moderate speed).
282
 
283
- In the majority of instances, the attainment of optimal performance can be accomplished through the utilization of the FineTuning script to invoke GeminiMol.
284
 
285
- > Fine-Tuning on downstream task
286
 
287
  ``` shell
288
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
289
  export smiles_column="SMILES" # Specify the column name in datasets
290
  export label_column="Label" # Specify the column name in datasets
291
- CUDA_VISIBLE_DEVICES=${gpu_id} python -u ${geminimol_app}/FineTuning.py "${task}" "${geminimol_lib}/GeminiMol" "${smiles_column}" "${label_column}" "${task}_GeminiMol"
292
  ```
293
 
294
- If the integration of molecular fingerprints and a pre-trained GeminiMol model is desired for training a molecular property prediction model, either PropDecoder or AutoQSAR can be employed.
295
 
296
- > PropDecoder
 
 
 
 
 
 
 
 
 
 
 
 
297
 
298
  ``` shell
299
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
300
- export fingerprints="ECFP4:AtomPairs:TopologicalTorsion:FCFP6:MACCS"
301
  export smiles_column="SMILES" # Specify the column name in datasets
302
  export label_column="Label" # Specify the column name in datasets
303
- CUDA_VISIBLE_DEVICES=${gpu_id} python -u ${geminimol_app}/PropDecoder.py "${task}" "${geminimol_lib}/GeminiMol:${fingerprints}" "${smiles_column}" "${label_column}" "${task}_GeminiMol"
304
  ```
305
 
306
- > AutoQSAR (AutoGluon)
307
 
308
  ``` shell
 
 
309
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
310
- export fingerprints="ECFP4:AtomPairs:TopologicalTorsion:FCFP6:MACCS"
311
  export smiles_column="SMILES" # Specify the column name in datasets
312
  export label_column="Label" # Specify the column name in datasets
313
- CUDA_VISIBLE_DEVICES=${gpu_id} python -u ${geminimol_app}/AutoQSAR.py "${task}" "${geminimol_lib}/GeminiMol:${fingerprints}" "${smiles_column}" "${label_column}" "" "${task}_GeminiMol"
314
  ```
315
 
316
- ## Citing this work
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
 
318
  **Conformational Space Profile Enhances Generic Molecular Representation Learning**
319
  Lin Wang, Shihang Wang, Hao Yang, Shiwei Li, Xinyu Wang, Yongqi Zhou, Siyuan Tian, Lu Liu, Fang Bai
@@ -321,21 +498,19 @@ bioRxiv 2023.12.14.571629; doi: https://doi.org/10.1101/2023.12.14.571629
321
 
322
  ## ✅ License
323
 
324
- GeminiMol is released under the Academic Free Licence, which permits academic use, modification and distribution free of charge, but prohibits unauthorised commercial use, including commercial training and as part of a paid computational platform. However, communication and authorization with [our supervisor](baifang@shanghaitech.edu.cn) is permitted for its application in pipeline development and research activities within pharmaceutical R&D.
325
 
326
- ## 💌 Get in Touch
327
 
328
- We welcome community contributions of extension tools based on the GeminiMol model, etc. If you have any questions not covered in this overview, please contact the [GeminiMol Developer Team](wanglin3@shanghaitech.edu.cn). We would love to hear your feedback and understand how GeminiMol has been useful in your research. Share your stories with us at wanglin3@shanghaitech.edu.cn or baifang@shanghaitech.edu.cn.
329
 
330
- In addition to GitHub, we offer a WeChat community to provide a forum for discussion between users. You can access the community's QR code by following the "蛋白矿工" on WeChat.
331
 
332
  ## 😃 Acknowledgements
333
 
334
- We appreciate the technical support provided by the engineers of the high-performance computing cluster of ShanghaiTech University. Lin Wang also thanks Jianxin Duan, Gaokeng Xiao, Quanwei Yu, Zheyuan Shen, Shenghao Dong, Huiqiong Li, Zongquan Li, and Fenglei Li for providing technical support, inspiration and help for this work.
335
-
336
- We appreciate the developers of AutoGluon and Deep Graph Library (DGL). We also thank the developers and maintainers of MarcoModel and PhaseShape modules in the Schrödinger package.
337
 
338
- Besides, GeminiMol communicates with and/or references the following separate libraries and packages, we thank all their contributors and maintainers!
339
 
340
  * [_RDKit_](https://www.rdkit.org/)
341
  * [_PyTorch_](https://pytorch.org/)
 
4
 
5
  Refer to [GitHub](https://github.com/Wang-Lin-boop/GeminiMol).
6
 
7
+ <h1 align="left"> GeminiMol </h1>
8
+ <h3 align="left"> Molecular Representation Model Enhanced by Conformational Space Profile </h3>
9
+ <p align="left">
10
+ 📃 <a href="https://www.biorxiv.org/content/10.1101/2023.12.14.571629" target="_blank">Paper</a> · 🤗 <a href="https://huggingface.co/AlphaMWang/GeminiMol" target="_blank">Model</a> · 📕 <a href="https://zenodo.org/records/10450788" target="_blank">Data</a><br>
11
+ </p>
12
+
13
+ <p align="right">
14
+ <img style="float: right" src="imgs/geminimol.png" alt="alt text" width="550px" align="right"/>
15
+ </p>
16
+
17
+ This repository provides the official implementation of the GeminiMol model, training data, and utilities. In this work, we propose a hybrid contrastive learning framework, which conducts **inter-molecular contrastive learning** by multiple projection heads of **conformational space similarities (CSS)**. Please also refer to our [paper](https://doi.org/10.1101/2023.12.14.571629) for a detailed description of GeminiMol.
18
+
19
+ ## Table of Contents
20
+ - [Table of Contents](#table-of-contents)
21
+ - [💗 Motivation](#-motivation)
22
+ - [💡 Highlight](#-highlight)
23
+ - [🔔 News](#-news)
24
+ - [😫 Limitations](#-limitations)
25
+ - [📕 Installation](#-installation)
26
+ - [Download datasets and models](#download-datasets-and-models)
27
+ - [Installing the dependency packages](#installing-the-dependency-packages)
28
+ - [📓 Application](#-application)
29
+ - [Virtual Screening and Target Identification](#virtual-screening-and-target-identification)
30
+ - [Molecular Proptery Modeling (QSAR and ADMET)](#molecular-proptery-modeling-qsar-and-admet)
31
+ - [Molecular Clustering](#molecular-clustering)
32
+ - [Extract Molecular Features (GeminiMol Encoding)](#extract-molecular-features-geminimol-encoding)
33
+ - [👐 Reproducing](#-reproducing)
34
+ - [Download Training and Benchmark Datasets](#download-training-and-benchmark-datasets)
35
+ - [Re-training our models](#re-training-our-models)
36
+ - [Benchmarking the fingerprints and our models](#benchmarking-the-fingerprints-and-our-models)
37
+ - [⭐ Citing This Work](#-citing-this-work)
38
+ - [✅ License](#-license)
39
+ - [💌 Get in Touch](#-get-in-touch)
40
+ - [😃 Acknowledgements](#-acknowledgements)
41
 
 
42
 
43
+ ## 💗 Motivation
 
 
44
 
45
+ The **molecular representation model** is an emerging artificial intelligence technology for extracting features of small molecules. Inspired by the dynamics of small molecules in solution, introducing the **conformational space profile** into molecular representation models is a promising aim. The conformational space profile covers the heterogeneity of molecule properties, such as the multi-target mechanism of drug action, recognition of different biomolecules, dynamics in cytoplasm and membrane, which may facilitate further downstream application and generalization capability of molecular representation model.
46
 
47
  ## 💡 Highlight
48
 
49
+ * GeminiMol exhibits the capability to **identify molecular pairs with similar 3D active conformers**, even in scenarios where their 2D structures exhibit significant differences.
50
+ * GeminiMol was pre-trained on only 37,336 molecular structures, yet it can **generalize** to zero-shot and QSAR tasks involving millions of molecules.
51
+ * GeminiMol shown the **balanced performance** across various applications, including virtual screening, target identification, and cellular phenotype-based property modeling.
 
 
 
 
52
 
53
  ## 🔔 News
54
 
55
  * 2023-12, our paper has been uploaded to BioRxiv, you can find it [here](https://www.biorxiv.org/content/10.1101/2023.12.14.571629).
56
+ * 2024-01, we have released `PharmProfiler.py`, which facilitates virtual screening and target identification.
57
+ * 2024-03, we have released `PropPredictor.py`, which facilitates the deployment and repurposing of QSAR and ADMET prediction models.
58
+
59
+ ## 😫 Limitations
60
+
61
+ * Note that, the conformational space profile is **not a panacea** for drug discovery. For a portion of tasks, the 2D structure of a compound already contains sufficient information to establish structure-activity relationships, rendering the introduction of the conformational space profile inconsequential for these tasks.
62
+ * The evaluation of intermolecular similarity is not limited to pharmacophore similarity in 3D conformational space and maximum common substructure similarity in 2D structures. By incorporating **additional intermolecular similarity metrics** during pre-training, we can further enrich the knowledge that the model can learn, such as molecular fingerprints and molecular surface potentials.
63
+ * Due to computational resource limitations, we only included 39,290 molecules in our pre-training. It is foreseeable that incorporating **more molecular structures** during pre-training could further enhance the performance of GeminiMol, particularly when guided by drug-target relationships to obtain high-quality data.
64
 
65
  ## 📕 Installation
66
 
67
+ GeminiMol is a pytorch-based AI model. To set up the GeminiMol model, we recommend using conda for Python environment configuration. If you encounter any problems with the installation, please feel free to post an issue or discussion it.
68
 
69
  > Installing MiniConda (skip if conda was installed)
70
 
 
85
  ``` shell
86
  git clone https://github.com/Wang-Lin-boop/GeminiMol
87
  cd GeminiMol/
88
+ echo "# GeminiMol" >> ~/.bashrc
89
+ echo "export PATH=\"${PWD}:\${PATH}\"" >> ~/.bashrc # optional, not required in the current version
90
+ echo "export GeminiMol=\"${PWD}\"" >> ~/.bashrc
91
+ source ~/.bashrc
92
+ echo "export geminimol_app=\"${GeminiMol}/geminimol\"" >> ~/.bashrc # geminimol applications
93
+ echo "export geminimol_lib=\"${GeminiMol}/models\"" >> ~/.bashrc # geminimol models
94
+ echo "export geminimol_data=\"${GeminiMol}/data\"" >> ~/.bashrc # compound library
 
95
  source ~/.bashrc
96
  ```
97
 
98
+ #### Download datasets and models
99
 
100
+ In this repository, we provide the pre-trained GeminiMol and CrossEncoder models.
101
 
102
+ > Download model parameters and weights via [Google Driver](https://drive.google.com/drive/folders/183WGytS-zy_POlLxEvijEtarow56zmnz?usp=drive_link) and [HuggingFace](https://huggingface.co/AlphaMWang)
103
 
104
+ Here is an example of how to download a model from huggingface. Besides wget, you can also download the model directly from Google Cloud Drive or huggingface using your browser.
 
 
 
 
105
 
106
+ ``` bash
107
+ git clone https://huggingface.co/AlphaMWang/GeminiMol
108
+ ```
109
 
110
  Then, we need place the models to the `${GeminiMol}/models`.
111
 
112
+ > Download all chemical datasets via [Zenodo](https://zenodo.org/records/10450788) for applications
113
+
114
+ ``` shell
115
+ cd ${geminimol_data}
116
+ wget https://zenodo.org/records/10450788/files/ChemDiv.zip # compound library for virtual screening
117
+ wget https://zenodo.org/records/10450788/files/DTIDB.zip # DTI database for target identification
118
+ for i in Benchmark*.zip css*.zip Chem*.zip;do
119
+ mkdir ${i%%.zip}
120
+ unzip -d ${i%%.zip}/ $i
121
+ done
122
+ unzip -d compound_library/ ChemDiv.zip
123
+ unzip -d compound_library/ DTIDB.zip
124
+ ```
125
+
126
  The expected structure of GeminiMol path is:
127
 
128
  ```
 
143
  │ ├── CrossEncoder_Training.py # scripts for training the CrossEncoders.
144
  │ ├── GeminiMol_Training.py # scripts for training the GeminiMol models.
145
  │ ├── benchmark.py # benchmarking presentation methods on provide datasets
146
+ ├── data # training and benchmark datasets in this work
147
+ │ ├── Benchmark_DUD-E # virtual screeening benchmark, optional
148
+ │ ├── Benchmark_LIT-PCBA # virtual screeening benchmark, optional
149
+ │ ├── Benchmark_TIBD # target identification benchmark, optional
150
+ │ ├── Benchmark_QSAR # QSAR and ADMET benchmarks, optional
151
+ │ ├── Chem_SmELECTRA # text backbone of chemical language, optional
152
+ │ ├── css_library # CSS training data, optional
153
+ │ ├── benchmark.json # dataset index for benchmark tasks, optional
154
+ │ ├── database.csv # molecular datasets in this work, optional
155
+ │ ├── compound_library # the compound librarys
156
+ ├── DTIDB.csv # dataset used in target identification
157
+ ├── ChemDiv.csv # library of common commercial compounds
158
+ │ │ ├── Specs.csv # library of common commercial compounds
159
  ├── models # CrossEncoder and GeminiMol models
160
+ │ ├── CrossEncoder # CrossEncoder, optional
161
  │ ├── GeminiMol # GeminiMol, recommended for zero-shot tasks
 
162
  ```
163
 
164
+ #### Installing the dependency packages
165
 
166
+ Before running GeminiMol, you need to install the basic dependency packages.
167
 
168
  > Installing the RDkit for generating fingerprints
169
 
 
171
  pip install rdkit
172
  ```
173
 
 
 
 
 
 
 
 
 
 
 
174
  > Installing the statatics and plot packages
175
 
176
  ``` shell
177
+ pip install six
178
+ pip install oddt scikit-learn matplotlib scipy==1.10.1
179
  ```
180
 
 
 
181
  > Installing the dependency packages of GeminiMol
182
 
183
  ``` shell
 
184
  pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 \
185
  --extra-index-url https://download.pytorch.org/whl/cu116
186
+ pip install dgl==1.1.1+cu116 -f https://data.dgl.ai/wheels/cu116/repo.html
187
+ pip install dglgo==0.0.2 -f https://data.dgl.ai/wheels-test/repo.html
188
+ pip install dgllife==0.3.2
189
  ```
190
 
191
+ If you intend to reproduce the benchmark results in our work, it is required to install the AutoGluon.
 
 
 
 
 
192
 
193
  ``` shell
194
+ pip install autogluon==0.8.1 # requried for AutoQSAR
 
 
 
 
 
 
195
  ```
196
 
197
+ ## 📓 Application
198
 
199
+ As a molecular representation model, GeminiMol finds applications in **ligand-based virtual screening, target identification, and quantitative structure-activity relationship (QSAR)** modeling of small molecular drugs.
 
 
 
 
 
 
 
 
 
 
200
 
201
+ ![benchmark](imgs/benchmark.png)
202
 
203
+ We have provided Cross-Encoder and GeminiMol models that can be used directly for inference. Here, we demonstrate the utilization of GeminiMol for virtual screening, target identification, and molecular property modeling.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
+ Please note that while molecular fingerprints are considered simple molecular representation methods, they are an indispensable baseline (see our [paper](https://www.biorxiv.org/content/10.1101/2023.12.14.571629)). When conducting your drug development project, we recommend exploring ECFP4, CombineFP, and GeminiMol that are provided simultaneously in our `PharmProfiler.py` and various molecular property modeling scripts.
206
 
207
+ #### Virtual Screening and Target Identification
208
 
209
+ In concept, molecules share similar conformational space also share similar biological activities, allowing us to predict the similarity of biological activities between molecules by comparing the similarity of GeminiMol encodings.
210
+
211
+ Here, we introduce the ``PharmProfiler.py``, a novel approach that employs the GeminiMol encoding to establish pharmacological profiles and facilitate the search for molecules with specific properties in chemical space.
212
+
213
+ ``PharmProfiler.py`` offers the capability to conduct ligand-based virtual screening using commercially available compound libraries. Furthermore, it enables target identification through ligand similarity analysis by leveraging comprehensive drug-target relationship databases.
214
+
215
+ To support experimentation, we have included a collection of diverse commercial compound libraries and drug-target relationship databases, conveniently located in the `${geminimol_data}/compound_library/` directory.
216
+
217
+ > 1. Prepare the pharmacological profile and compound libraries
218
+
219
+ To define a pharmacological profile, you will need to input a `profile.csv` file, which should have the following format:
220
+
221
+ ```
222
+ SMILES,Label
223
+ C=CC(=O)N[C@@H]1CN(c2nc(Nc3cn(C)nc3OC)c3ncn(C)c3n2)C[C@H]1F,1.0
224
+ C=CC(=O)Nc1cccc(Nc2nc(Nc3ccc(N4CCN(C(C)=O)CC4)cc3OC)ncc2C(F)(F)F)c1,1.0
225
+ C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1,1.0
226
+ COC(=O)CCC/N=C1\SCCN1Cc1ccccc1,0.4
227
+ C=C(C)[C@@H]1C[C@@H](CC2(CC=C(C)C)C(=O)C(C(CC(=O)O)c3ccccc3)=C3O[C@@H](C)[C@@H](C)C(=O)C3=C2O)C1(C)C,-0.8
228
+ C/C(=C\c1ncccc1C)[C@@H]1C[C@@H]2O[C@]2(C)CCC[C@H](C)[C@H](O)[C@@H](C)C(=O)C(C)(C)[C@@H](O)CC(=O)O1,-0.5
229
+ ```
230
 
231
+ The "Label" column signifies the weight assigned to the reference compound. Positive values indicate that the selected compounds should bear resemblance to the reference compound, while negative values imply that the selected compounds should be dissimilar to the reference compound. Typically, positive values are assigned to **active** compounds, whereas negative values are assigned to **inactive** compounds or those causing **side effects**.
232
 
233
+ The compound libraries are also stored in CSV format in the `${geminimol_data}/compound_library/` directory. It is requried to maintain consistency between the SMILES column name in the `profile.csv` file and the compound library.
234
 
235
+ > 2. Perform the PharmProfiler
236
 
237
+ To perform virtual screening, the following command can be used.
238
 
239
+ Here, `profile_set` represents the provided pharmacological profile by the user, `keep_top` indicates the number of compounds to be outputted in the end, and `probe_cluster` determines whether compounds with the same weight should be treated as a cluster. Compounds within the same cluster will be compared individually with the query mol, and the highest similarity score will be taken as the score of query mol.
240
 
241
+ We have provided a processed version of the commercial Specs and ChemDiv compound library at the `${geminimol_data}/compound_library/specs.csv` and `${geminimol_data}/compound_library/ChemDiv.csv`, which contained 335,212 and 1,755,930 purchasable compounds.
242
 
243
  ``` shell
244
  export job_name="Virtual_Screening"
245
+ export profile_set="profile.csv" # SMILES (same to compound library) and Label (requried)
246
+ export compound_library="${geminimol_data}/compound_library/ChemDiv.csv"
247
  export smiles_column="SMILES" # Specify the column name in the compound_library
248
+ export weight_column="Label" # weights for profiles
249
  export keep_top=1000
250
+ export probe_cluster="Yes"
251
+ python -u ${geminimol_app}/PharmProfiler.py "${geminimol_lib}/GeminiMol" "${job_name}" "${smiles_column}" "${compound_library}" "${profile_set}:${weight_column}" "${keep_top}" "${probe_cluster}"
252
  ```
253
 
254
+ To perform target identification, the compound library can be replaced with the `${geminimol_data}/compound_library/DTIDB.csv`, which contains drug-target relationships. This is a processed version of the BindingDB database, which contains 2,159,221 target-ligand paris.
 
 
 
 
 
 
 
 
255
 
256
  ``` shell
257
  export job_name="Target_Identification"
258
+ export profile_set="profile.csv" # Ligand_SMILES (same to compound library), and Label (requried)
259
+ export compound_library="${geminimol_data}/compound_library/DTIDB.csv"
260
+ export smiles_column="SMILES" # Specify the column name in the compound_library
261
+ export weight_column="Label" # weights for profiles
262
+ export keep_top=2000
263
+ export probe_cluster="No"
264
+ python -u ${geminimol_app}/PharmProfiler.py "${geminimol_lib}/GeminiMol" "${job_name}" "${smiles_column}" "${compound_library}" "${profile_set}:${weight_column}" "${keep_top}" "${probe_cluster}"
265
  ```
266
 
267
+ After the initial run of PharmProfiler, a extracted GeminiMol feature file will be generated in the `${geminimol_data}/compound_library/`. Subsequent screening tasks on the same compound library can benefit from PharmProfiler automatically reading the feature file, which helps to accelerate the running speed.
268
+
269
+ #### Molecular Proptery Modeling (QSAR and ADMET)
270
 
271
+ > 1. Prepare your datasets
272
 
273
  Before conducting molecular property modeling, it is crucial to carefully prepare your data, which includes compound structure pre-processing and dataset splitting.
274
 
275
+ Firstly, you need to clarify the chirality and protonation states of molecules in the dataset, which can be done using chemical informatics tools such as RDKit or Schrödinger software package. Typically, omitting pre-processing will not result in an error, but it may potentially impair the performance of GeminiMol.
276
+
277
+ The processed data should be saved in CSV file format, containing at least one column for **`SMILES`** and one column for **`Labels`**. Subsequently, utilize the following command for skeleton splitting. You can modify the script to change the splitting ratio, where by default, 70% of the dataset is used for training and 30% for validation and testing.
278
 
279
  ``` shell
280
  export dataset_path="data.csv"
 
287
  export task=${dataset_name}
288
  ```
289
 
290
+ > 2. Training the molecular property prediction model
291
+
292
  We have presented three approaches for molecular property modeling, namely AutoQSAR (broad applicability, slow speed), PropDecoder (fast speed), and FineTuning (optimal performance, moderate speed).
293
 
294
+ Given that you have enough experience with hyperparameter tuning, the attainment of optimal performance can be accomplished through the utilization of the FineTuning script to invoke GeminiMol. Also, AutoQSAR is recommended if you lack experience with hyperparameter tuning.
295
 
296
+ > 2.1 Fine-Tuning on downstream task
297
 
298
  ``` shell
299
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
300
  export smiles_column="SMILES" # Specify the column name in datasets
301
  export label_column="Label" # Specify the column name in datasets
302
+ python -u ${geminimol_app}/FineTuning.py "${task}" "${geminimol_lib}/GeminiMol" "${smiles_column}" "${label_column}" "${task}_GeminiMol"
303
  ```
304
 
305
+ > 2.2 AutoQSAR (AutoGluon)
306
 
307
+ It is recommended to try using AutoQSAR to call CombineFP or GeminiMol when you lack deep learning experience, which usually produces a model with good performance.
308
+
309
+ ``` shell
310
+ export encoder_method="${geminimol_lib}/GeminiMol" # only GeminiMol
311
+ ```
312
+
313
+ In our paper, we introduced a powerful joint molecular fingerprint baseline method named CombineFP.In our experiments, the performance of CombineFP in molecular property modeling is very superior and we highly recommend trying CombineFP along with GeminiMol.
314
+
315
+ ``` shell
316
+ export encoder_method="ECFP4:AtomPairs:TopologicalTorsion:FCFP6" # CombineFP
317
+ ```
318
+
319
+ Having defined the encoder, you can train the model to convert the encoding of the molecule into properties using AutoQSAR. In fact, a potential advantage of this over FineTuning is that it can decode diverse molecular properties based on the fixed encoding, which will speed up the efficiency of chemical space searching.
320
 
321
  ``` shell
322
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
 
323
  export smiles_column="SMILES" # Specify the column name in datasets
324
  export label_column="Label" # Specify the column name in datasets
325
+ python -u ${geminimol_app}/AutoQSAR.py "${task}" "${encoder_method}" "${smiles_column}" "${label_column}" "" "${task}_GeminiMol"
326
  ```
327
 
328
+ If the integration of molecular fingerprints and a pre-trained GeminiMol model is desired for training a molecular property prediction model, either PropDecoder or AutoQSAR can be employed.
329
 
330
  ``` shell
331
+ export fingerprints="ECFP4:AtomPairs:TopologicalTorsion:FCFP6:MACCS" # CombineFP+MACCS
332
+ export encoder_method="${geminimol_lib}/GeminiMol:${fingerprints}" # CombineFP+MACCS+GeminiMol
333
  export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
 
334
  export smiles_column="SMILES" # Specify the column name in datasets
335
  export label_column="Label" # Specify the column name in datasets
336
+ python -u ${geminimol_app}/AutoQSAR.py "${task}" "${encoder_method}" "${smiles_column}" "${label_column}" "" "${task}_GMFP"
337
  ```
338
 
339
+ > 2.3 PropDecoder
340
+
341
+ For the most tasks, performing fine-tuning or using AutoQSAR will give pretty good performance in molecular property modeling, so you don't need to try PropDecoder unless the first two give poor performance.
342
+
343
+ ``` shell
344
+ export task="Your_Dataset" # Specify a path to your datasets (train, valid, and test)
345
+ export smiles_column="SMILES" # Specify the column name in datasets
346
+ export label_column="Label" # Specify the column name in datasets
347
+ python -u ${geminimol_app}/PropDecoder.py "${task}" "${encoder_method}" "${smiles_column}" "${label_column}" "${task}_GeminiMol"
348
+ ```
349
+
350
+ > 3. Make predictions (only for AutoQSAR or fine-Tuned models)
351
+
352
+ Next, we can load the model trained based on `AutoQSAR` and `FineTuning` to predict molecular properties in a new dataset.
353
+
354
+ ``` shell
355
+ export model_path="QSAR_GeminiMol" # ${task}_GeminiMol when your build QSAR model
356
+ export encoder_method="${geminimol_lib}/GeminiMol" # Match to the encoders selected during QSAR model training
357
+ export extrnal_data="dataset.csv" # must contain the ${smiles_column}
358
+ export smiles_column="SMILES" # Specify the column name in datasets
359
+ export model_type="FineTuning" # FineTuning, PropDecoder, ['LightGBM', 'LightGBMLarge', 'LightGBMXT', 'NeuralNetTorch'] for AutoQSAR
360
+ python -u ${geminimol_app}/PropPredictor.py "${model_path}" "${encoder_method}" "${extrnal_data}" "${smiles_column}" "${model_type}"
361
+ ```
362
+
363
+ If you have constructed a regression model using AutoQSAR, refer to the following command.
364
+
365
+ ``` shell
366
+ export model_path="QSAR_GeminiMol" # ${task}_GeminiMol when your build QSAR model
367
+ export encoder_method="${geminimol_lib}/GeminiMol" # Match to the encoders selected during QSAR model training
368
+ export extrnal_data="dataset.csv" # must contain the ${smiles_column}
369
+ export smiles_column="SMILES" # Specify the column name in datasets
370
+ export model_type="NeuralNetTorch" # ['LightGBM', 'LightGBMLarge', 'LightGBMXT', 'NeuralNetTorch'] for AutoQSAR
371
+ export task_type="regression"
372
+ python -u ${geminimol_app}/PropPredictor.py "${model_path}" "${encoder_method}" "${extrnal_data}" "${smiles_column}" "${model_type}" "${task_type}"
373
+ ```
374
+
375
+ #### Molecular Clustering
376
+
377
+ You can use GeminiMol to cluster molecules just like molecular fingerprints!
378
+
379
+ ``` shell
380
+ export encoder_method="${geminimol_lib}/GeminiMol" # Match to the encoders selected during QSAR model training
381
+ export data_table="dataset.csv" # must contain the ${smiles_column}
382
+ export smiles_column="SMILES" # Specify the column name in datasets
383
+ export output_fn="Cluster"
384
+ export cluster_num=10 # only for supervised clustering algorithm, such as K-Means
385
+ python -u ${geminimol_app}/Analyzer.py "${data_table}" "${encoder_method}" "${smiles_column}" "${output_fn}" "cluster:${cluster_num}"
386
+ ```
387
+
388
+ #### Extract Molecular Features (GeminiMol Encoding)
389
+
390
+ You can use GeminiMol or molecular fingerprints to extract molecular features for further analysis.
391
+
392
+ ``` shell
393
+ export encoder_method="${geminimol_lib}/GeminiMol" # Match to the encoders selected during QSAR model training
394
+ export data_table="dataset.csv" # must contain the ${smiles_column}
395
+ export smiles_column="SMILES" # Specify the column name in datasets
396
+ export output_fn="${data_table%%.*}_Encoding"
397
+ python -u ${geminimol_app}/Analyzer.py "${data_table}" "${encoder_method}" "${smiles_column}" "${output_fn}" "encode"
398
+ ```
399
+
400
+ ## 👐 Reproducing
401
+
402
+ Here, we present the reproducible code for training the Cross-Encoder and GeminiMol models based on the CSS descriptors of 39,290 molecules described in the paper.
403
+
404
+ #### Download Training and Benchmark Datasets
405
+
406
+ > Download all datasets via [Zenodo](https://zenodo.org/records/10450788) for training, test and benchmark
407
+
408
+ ``` shell
409
+ cd ${geminimol_data}
410
+ wget https://zenodo.org/records/10450788/files/css_library.zip # only for reproducing GeminiMol training
411
+ wget https://zenodo.org/records/10450788/files/Benchmark_DUD-E.zip # only for reproducing benchmark
412
+ wget https://zenodo.org/records/10450788/files/Benchmark_LIT-PCBA.zip # only for reproducing benchmark
413
+ wget https://zenodo.org/records/10450788/files/Benchmark_QSAR.zip # only for reproducing benchmark
414
+ wget https://zenodo.org/records/10450788/files/Benchmark_TIBD.zip # only for reproducing benchmark
415
+ wget https://zenodo.org/records/10450788/files/Chem_SmELECTRA.zip # only for reproducing cross-encoder baseline
416
+ ```
417
+
418
+ #### Re-training our models
419
+
420
+ > Training the Cross-Encoder
421
+
422
+ ``` shell
423
+ conda activate GeminiMol
424
+ export model_name="CrossEncoder"
425
+ export batch_size_per_gpu=200 # batch size = 200 (batch_size_per_gpu) * 4 (gpu number)
426
+ export epoch=20 # max epochs
427
+ export lr="1.0e-3" # learning rate
428
+ export label_list="MCMM1AM_MAX:LCMS2A1Q_MAX:MCMM1AM_MIN:LCMS2A1Q_MIN" # ShapeScore:ShapeAggregation:ShapeOverlap:CrossSim:CrossAggregation:CrossOverlap
429
+ CUDA_VISIBLE_DEVICES=0,1,2,3 python ${geminimol_app}/CrossEncoder_Training.py "${geminimol_data}/css_library/" "${geminimol_data}/Chem_SmELECTRA" "${epoch}" "${lr}" "${batch_size_per_gpu}" "${model_name}" "${geminimol_data}/benchmark.json" "${label_list}"
430
+ ```
431
+
432
+ > Training the GeminiMol Encoder and Decoder of CSS descriptors
433
+
434
+ ``` shell
435
+ conda activate GeminiMol
436
+ export model_name="GeminiMol"
437
+ export batch_size=512
438
+ export epoch=20 # max epochs
439
+ export patience=50 # for early stoping
440
+ export GNN='WLN' # Weisfeiler-Lehman Network (WLN)
441
+ export network="MeanMLP:2048:4:2048:None:0:5:0"
442
+ export label_dict="ShapeScore:0.2,ShapeAggregation:0.2,ShapeOverlap:0.05,ShapeDistance:0.05,CrossSim:0.15,CrossAggregation:0.15,CrossDist:0.05,CrossOverlap:0.05,MCS:0.1"
443
+ CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/GeminiMol_Training.py "${geminimol_data}/css_library/" "${epoch}" "${batch_size}" "${GNN}" "${network}" "${label_dict}" "${model_name}" "${patience}" "${geminimol_data}/benchmark.json"
444
+ ```
445
+
446
+ #### Benchmarking the fingerprints and our models
447
+
448
+ Additionally, benchmark test scripts were provided. With this code, the community can reproduce the results reported in the paper, explore different model architectures, even incorporate additional molecular similarity data to further enhance the performance of the models.
449
+
450
+ > Benchmarking molecular fingerprints and GeminiMol on virutual screening and target identification
451
+
452
+ For each molecular fingerprint, we used all supported similarity metrics, including Tanimoto, Cosine, and Tversky. For the GeminiMol model, in addition to the projected heads used in pre-training, we introduced similarities between molecular representation vectors, including Cosine and Pearson. It is worth noting that in practice we cannot be sure which combination of molecular fingerprints and similarity metrics is optimal, and therefore each combination is considered an independent method in benchmarking.
453
+
454
+ ``` shell
455
+ conda activate GeminiMol
456
+ # benchmarking Fixed GeminiMol models and Fingerprints
457
+ for task in "DUDE" "LIT-PCBA" "TIBD" # zero-shot tasks
458
+ do
459
+ for model_name in "FCFP6" "MACCS" "RDK" "ECFP6" "FCFP4" \
460
+ "TopologicalTorsion" "AtomPairs" "ECFP4" \
461
+ "${geminimol_lib}/GeminiMol"
462
+ do
463
+ mkdir -p ${model_name}
464
+ CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/benchmark.py "${model_name}" "${geminimol_data}/benchmark.json" "${task}"
465
+ done
466
+ done
467
+ ```
468
+
469
+ > Benchmarking molecular fingerprints and GeminiMol on molecular property modeling
470
+
471
+ It is worth noting that different decoders exhibit varying performance on different tasks and encodings. Therefore, it is essential to select the appropriate decoder for each specific molecular encoder and task. In practice, we can determine when the model should stop-training and choose the optimal decoder architecture by dividing the training, validation and test sets. Consequently, all results should be merged using a data pivot table to analyze the optimal decoder for each encoder-task combination. In our work, the hyperparameters of the PropDecoder were chosen based on empirical experience and were not subjected to any hyperparameter tuning. Performing further hyperparameter tuning for each task may potentially yield improved performance.
472
+
473
+ ``` shell
474
+ for task in "ADMET-C" "ADMET-R" \
475
+ "LIT-QSAR" "CELLS-QSAR" "ST-QSAR" "PW-QSAR" \
476
+ "PropDecoder-ADMET" "PropDecoder-QSAR" # fixed the molecular encoder
477
+ do
478
+ for model_name in "CombineFP" \
479
+ "FCFP6" "MACCS" "RDK" "ECFP6" "FCFP4" "TopologicalTorsion" "AtomPairs" "ECFP4" \
480
+ "${geminimol_lib}/GeminiMol"
481
+ do
482
+ mkdir -p ${model_name}
483
+ CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/benchmark.py "${model_name}" "${geminimol_data}/benchmark.json" "${task}"
484
+ done
485
+ done
486
+ for task in "FineTuning-ADMET" "FineTuning-QSAR"; do # benchmarking with FineTuning GeminiMol models
487
+ for model_name in "${geminimol_lib}/GeminiMol"; do
488
+ CUDA_VISIBLE_DEVICES=0 python -u ${geminimol_app}/benchmark.py "${model_name}" "${geminimol_data}/benchmark.json" "${task}"
489
+ done
490
+ done
491
+ ```
492
+
493
+ ## ⭐ Citing This Work
494
 
495
  **Conformational Space Profile Enhances Generic Molecular Representation Learning**
496
  Lin Wang, Shihang Wang, Hao Yang, Shiwei Li, Xinyu Wang, Yongqi Zhou, Siyuan Tian, Lu Liu, Fang Bai
 
498
 
499
  ## ✅ License
500
 
501
+ GeminiMol is released under the Academic Free Licence, which permits academic use, modification and distribution free of charge. GeminiMol can be utilized in academic publications, open-source software projects, and open-source competitions (e.g. Kaggle competitions under the MIT Open Source license).
502
 
503
+ GeminiMol prohibits unauthorised commercial use, including commercial training and as part of a paid computational platform, which intended to prevent speculators from exploiting informational asymmetry for profit. Communication and authorization with [our supervisor](baifang@shanghaitech.edu.cn) is permitted for its application in pipeline development and research activities within pharmaceutical R&D.
504
 
505
+ ## 💌 Get in Touch
506
 
507
+ We welcome community contributions of extension tools based on the GeminiMol model, etc. If you have any questions not covered in this overview, please contact the [GeminiMol Developer Team](Wanglin1102@outlook.com). We would like to hear your feedback and understand how GeminiMol has been useful in your research. Share your stories with [us](Wanglin1102@outlook.com).
508
 
509
  ## 😃 Acknowledgements
510
 
511
+ We appreciate the technical support provided by the engineers of the high-performance computing cluster of ShanghaiTech University. Lin Wang also thanks Jianxin Duan, Gaokeng Xiao, Quanwei Yu, Zheyuan Shen, Shenghao Dong, Huiqiong Li, Zongquan Li, and Fenglei Li for providing technical support, inspiration and help for this work. We express our gratitude to Dr. Zhongji Pu, Dr. Quanwei Yu for their invaluable assistance in third-party testing for model installation, reproducibility and application.
 
 
512
 
513
+ We also thank the developers and maintainers of MarcoModel and PhaseShape modules in the Schrödinger package. Besides, GeminiMol communicates with and/or references the following separate libraries and packages, we thank all their contributors and maintainers!
514
 
515
  * [_RDKit_](https://www.rdkit.org/)
516
  * [_PyTorch_](https://pytorch.org/)