eduardosoares99 commited on
Commit
449f87e
β€’
1 Parent(s): 7bf11c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # SMILES-based Transformer Encoder-Decoder (SMI-TED)
5
+
6
+ This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".
7
+
8
+ Paper: [Arxiv Link](paper/smi_ted_preprint.pdf)
9
+
10
+ For model weights contact: eduardo.soares@ibm.com or evital@br.ibm.com .
11
+
12
+ ## Introduction
13
+
14
+ We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants ($289M$ and $8 \times 289M$). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For model weights contact: eduardo.soares@ibm.com or evital@br.ibm.com .
15
+
16
+ ## Table of Contents
17
+
18
+ 1. [Getting Started](#getting-started)
19
+ 1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
20
+ 2. [Replicating Conda Environment](#replicating-conda-environment)
21
+ 2. [Pretraining](#pretraining)
22
+ 3. [Finetuning](#finetuning)
23
+ 4. [Feature Extraction](#feature-extraction)
24
+ 5. [Citations](#citations)
25
+
26
+ ## Getting Started
27
+
28
+ **This code and environment have been tested on Nvidia V100s and Nvidia A100s**
29
+
30
+ ### Pretrained Models and Training Logs
31
+
32
+ We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet. For model weights contact: eduardo.soares@ibm.com or evital@br.ibm.com .
33
+
34
+ Add the SMI-TED `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
35
+
36
+ ```
37
+ inference/
38
+ β”œβ”€β”€ smi_ted_light
39
+ β”‚ β”œβ”€β”€ smi_ted_light.pt
40
+ β”‚ β”œβ”€β”€ bert_vocab_curated.txt
41
+ β”‚ └── load.py
42
+ ```
43
+ and/or:
44
+
45
+ ```
46
+ finetune/
47
+ β”œβ”€β”€ smi_ted_light
48
+ β”‚ β”œβ”€β”€ smi_ted_light.pt
49
+ β”‚ β”œβ”€β”€ bert_vocab_curated.txt
50
+ β”‚ └── load.py
51
+ ```
52
+
53
+ ### Replicating Conda Environment
54
+
55
+ Follow these steps to replicate our Conda environment and install the necessary libraries:
56
+
57
+ #### Create and Activate Conda Environment
58
+
59
+ ```
60
+ conda create --name smi-ted-env python=3.8.18
61
+ conda activate smi-ted-env
62
+ ```
63
+
64
+ #### Install Packages with Conda
65
+
66
+ ```
67
+ conda install pytorch=1.13.1 cudatoolkit=11.4 -c pytorch
68
+ conda install numpy=1.23.5 pandas=2.0.3
69
+ conda install rdkit=2021.03.5 -c conda-forge
70
+ ```
71
+
72
+ #### Install Packages with Pip
73
+
74
+ ```
75
+ pip install transformers==4.6.0 pytorch-fast-transformers==0.4.0 torch-optimizer==0.3.0 datasets==1.6.2 scikit-learn==1.3.2 scipy==1.12.0 tqdm==4.66.1
76
+ ```
77
+
78
+ ## Pretraining
79
+
80
+ For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.
81
+
82
+ SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
83
+
84
+ - Compounds are filtered to a maximum length of 202 tokens during preprocessing.
85
+ - A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
86
+ - A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
87
+
88
+ The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
89
+
90
+ To pre-train the two variants of the SMI-TED model, run:
91
+
92
+ ```
93
+ bash training/run_model_light_training.sh
94
+ ```
95
+ or
96
+ ```
97
+ bash training/run_model_large_training.sh
98
+ ```
99
+
100
+ Use `train_model_D.py` to train only the decoder or `train_model_ED.py` to train both the encoder and decoder.
101
+
102
+ ## Finetuning
103
+
104
+ The finetuning datasets and environment can be found in the [finetune](finetune/) directory. After setting up the environment, you can run a finetuning task with:
105
+
106
+ ```
107
+ bash finetune/smi_ted_light/esol/run_finetune_esol.sh
108
+ ```
109
+
110
+ Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
111
+
112
+ ## Feature Extraction
113
+
114
+ The example notebook [smi_ted_encoder_decoder_example.ipynb](notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks. For model weights contact: eduardo.soares@ibm.com or evital@br.ibm.com.
115
+
116
+ To load smi-ted, you can simply use:
117
+
118
+ ```python
119
+ model = load_smi_ted(
120
+ folder='../inference/smi_ted_light',
121
+ ckpt_filename='smi_ted_light.pt'
122
+ )
123
+ ```
124
+
125
+ To encode SMILES into embeddings, you can use:
126
+
127
+ ```python
128
+ with torch.no_grad():
129
+ encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
130
+ ```
131
+ For decoder, you can use the function, so you can return from embeddings to SMILES strings:
132
+
133
+ ```python
134
+ with torch.no_grad():
135
+ decoded_smiles = model.decode(encoded_embeddings)
136
+ ```