nferruz commited on
Commit
f604bb0
1 Parent(s): 59560b2

Upload 9 files

Browse files
README.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: translation
4
+ tags:
5
+ - chemistry
6
+ - biology
7
+ inference: false
8
+ ---
9
+
10
+ # **Contributors**
11
+
12
+ - Sebastian Lindner (GitHub [@Bienenwolf655](https://github.com/Bienenwolf655); Twitter [@lindner_seb](https://twitter.com/lindner_seb))
13
+ - Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
14
+ - Noelia Ferruz (GitHub [@noeliaferruz](https://github.com/noeliaferruz); Twitter [@ferruz_noelia](https://twitter.com/ferruz_noelia); Webpage: [www.aiproteindesign.com](https://www.aiproteindesign.com) )
15
+
16
+
17
+ # **REXzyme: A Translation Machine for the Generation of New-to-Nature Enzymes**
18
+ **Work in Progress**
19
+
20
+ REXzyme (Reaction to Enzyme) (manuscript in preparation) is a translation machine -similar to Google Translator-
21
+ for the generation of enzymes that catalize user-defined reactions.
22
+
23
+ ![Inference of REXzyme](./rexzyme3.png)
24
+
25
+ It is possible to provide fine-grained input at the substrate level.
26
+ Akin to how translation machines have learned to translate between complex language pairs with great success,
27
+ often diverging in their representation at the character level (Japanese - English), we posit that an advanced architecture will
28
+ be able to translate between the chemical and sequence spaces. REXzyme was trained on a set of 2480 reactions
29
+ and ~32M enzyme pairs and it produces sequences that are predicted to perform their intended reactions. A second
30
+ version of the model with 14k more reactions will be uploaded to this repository shortly.
31
+
32
+ you will need to provide a reaction in the SMILES format
33
+ (Simplified molecular-input line-entry system). A useful online server to convert from molecules to SMILES
34
+ can be found here: https://cactus.nci.nih.gov/chemical/structure.
35
+
36
+ After converting each of the reaction components you should combine them in the following scheme: ```ReactantA.ReactantB>AgentA>ProductA.ProductB```<br/>
37
+ Additionally, one should prepend the task suffix ```r2s``` and append the eos token ```</s>```
38
+ e.g. for the carbonic anhydrase reaction: ```r2sO.COO>>HCOOO.[H+]</s>```
39
+
40
+ We provide this python script to convert reactants to the required reaction format, but
41
+ we always recommend to draw and double-check the structures in a server like [cactus](https://cactus.nci.nih.gov/chemical/structure)
42
+
43
+ ```python
44
+ # left reactants (seperated by '.') seperated by a equal sign from the products (also seperated by '.')
45
+ reactions = "CO2 . H2O = carbonic acid . H+"
46
+ # agents (seperated by .)
47
+ agent = ""
48
+
49
+ # https://stackoverflow.com/questions/54930121/converting-molecule-name-to-smiles
50
+ from urllib.request import urlopen
51
+ from urllib.parse import quote
52
+
53
+ def CIRconvert(ids):
54
+ try:
55
+ url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
56
+ ans = urlopen(url).read().decode('utf8')
57
+ return ans
58
+ except:
59
+ return 'Did not work'
60
+
61
+ reagent = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[0].split('.') if i != ""]
62
+ agent = [CIRconvert(i) for i in agent.replace(' ','').split('.') if i != ""]
63
+ product = [CIRconvert(i) for i in reactions.replace(' ','').split('=')[1].split('.') if i != ""]
64
+ f"r2s{'.'.join(reagent)}>{'.'.join(agent)}>{'.'.join(product)}</s>"
65
+ ```
66
+
67
+ We are still working in the analysis of the model for different tasks, including experimental testing.
68
+ See below in this documentation information about the models' performance in different in-silico tasks and how to generate your own enzymes.
69
+
70
+ ## **Model description**
71
+
72
+ REXzyme is based on the [Efficient T5 Large Transformer](https://huggingface.co/google/t5-efficient-large) architecture (which in turn is very similar to the current version of Google Translator)
73
+ and contains 48 (24 encoder/ 24 decoder) layers with a model dimensionality of 1024, totaling 770 million parameters.
74
+
75
+ REXzyme is a translation machine trained on portion the [RHEA database](https://www.rhea-db.org/) containing 31,970,152 reaction-enzyme pairs.
76
+ A second dataset with >14k reactions is being trained and will be uploaded soon.
77
+ The pre-training was done on pairs of SMILES and amino acid sequences, tokenized with a char-level
78
+ Sentencepiece tokenizer. Note that two seperate tokenizers were used for input (./tokenizer_smiles) and labels (./tokenizer_aa).
79
+
80
+ REXzyme was pre-trained with a supervised translation objective i.e., the model learned to process the continous
81
+ representation of the reaction from the encoder to autoregressively (causual language modeling) produce the output.
82
+ The output tokens (amino acids) are generated one at a time, from left to right, and the model learns to match the original enzyme sequence.
83
+ Hence, the model learns the dependencies among protein sequence features that enable a specific enzymatic reaction.
84
+
85
+ There are stark differences in the number of members among reaction classes.
86
+ However, since we are tokenizing the reaction SMILES on a character level,
87
+ the model has learnt dependencies among molecules and enzyme sequence features, and it can transfer learning from more to less populated
88
+ reaction classes.
89
+
90
+ ## **Model Performance**
91
+
92
+ - **Dataset curation**
93
+ We converted the reactions from rxn format to smile string including only left-to-right reactions.
94
+ The enzyme sequences were truncated to 1024.
95
+ Enzymes catalyzing more than one reaction appear in multiple enzyme-reaction pairs.
96
+ <br/><br/>
97
+ - **General descriptors**
98
+
99
+ | Method | Natural | Generated <sup>[1]</sup> |
100
+ | :--- | :----: | ---: |
101
+ | **IUPRED3 (ordered)** | 99.9% | 99.9% |
102
+ | **ESMFold (avg. plddt)** | 85.03 | 79.82 |
103
+ | **FlDPnn** | 0.0878 | 0.0929 |
104
+ <sup>[1]|</sup> We excluded sequences with %identities ≥ 70% and pLDDTs < 60%.
105
+ <br/><br/>
106
+
107
+
108
+ - **Functional classification**
109
+ <br/><br/>
110
+ <table>
111
+ <tr>
112
+ <td><b>Method </b></td>
113
+ <td colspan="2"> <a href="https://google-research.github.io/proteinfer/">ProteInfer</a></td>
114
+ <td colspan="2"> <a href="https://www.science.org/doi/10.1126/science.adf2465">CLEAN</a></td>
115
+ </tr>
116
+ <tr>
117
+ <td><b>Dataset</b></td>
118
+ <td >Natural (%) </td>
119
+ <td >Generated (%) </td>
120
+ <td >Natural (%) </td>
121
+ <td >Generated (%) </td>
122
+ </tr>
123
+ <tr>
124
+ <td><b> EC: Level 1 </b></td>
125
+ <td >81</td>
126
+ <td >80</td>
127
+ <td >80</td>
128
+ <td >79</td>
129
+ </tr>
130
+ <tr>
131
+ <td><b> EC: Level 2 </b></td>
132
+ <td >78</td>
133
+ <td >77</td>
134
+ <td >79</td>
135
+ <td >78</td>
136
+ </tr>
137
+ <tr>
138
+ <td><b> EC: Level 3 </b></td>
139
+ <td >76</td>
140
+ <td >75</td>
141
+ <td >78</td>
142
+ <td >77</td>
143
+ </tr>
144
+ <tr>
145
+ <td><b> EC: Level 4 </b></td>
146
+ <td >62</td>
147
+ <td >58</td>
148
+ <td >70</td>
149
+ <td >65</td>
150
+ </tr>
151
+ <tr>
152
+ <td><b> No EC predicted </b></td>
153
+ <td >10</td>
154
+ <td >7</td>
155
+ <td >0</td>
156
+ <td >0</td>
157
+ </tr>
158
+ <tr>
159
+ <td><b> GO-Terms </b></td>
160
+ <td >41</td>
161
+ <td >39</td>
162
+ <td >-</td>
163
+ <td >-</td>
164
+ </tr>
165
+ <tr>
166
+ <td><b> No GO predicted </b></td>
167
+ <td >1</td>
168
+ <td >1</td>
169
+ <td >-</td>
170
+ <td >-</td>
171
+ </tr>
172
+ </table>
173
+ <br/><br/>
174
+ - **PGP pipeline** [(see GitHub)](https://github.com/hefeda/PGP)
175
+
176
+ | Method | Natural | Generated |
177
+ | :--- | :---- | :--- |
178
+ | **Disorder** | 11.473 | 11.467 |
179
+ | **DSSP3** | L: 42%, H: 41%, E:18% | L: 45%, H: 39%, E: 16%|
180
+ | **DSSP8** | C:25%, H:38% T:10%, S:5%, I:0%, E:19%, G:2%, B:0% | C:29%, H:38% T:10%, S:4%, I:0%, E:17%, G:3%, B:0%|
181
+ | **CATH Classes** | Mainly Beta: 6%, Alpha Beta: 78%, Mainly Alpha: 16%, Special: 0%, Few Secondary Structures: 0% | Mainly Beta: 4%, Alpha Beta: 87%, Mainly Alpha: 9%, Special: 0%, Few Secondary Structures: 0%|
182
+ | **Transmembrane Prediction** | Membrane: 9%, Soluble: 91% | Membrane: 9%, Soluble: 91%|
183
+ | **Conservation** | High: 37%, Low: 33% | High: 38%, Low: 33% |
184
+ | **Localization** | Cytop.: 66%, Nucleus: 4%, Extracellular: 6%, PM: 4%, ER: 11%, Lysosome/Vacuole: 1%, Mito.: 6%, Plastid: 1%, Golgi: 1%, Perox.: 1% | Cytop.: 85%, Nucleus: 2%, Extracellular: 6%, PM: 1%, ER: 6%, Lysosome/Vacuole: 0%, Mito.: 4%, Plastid: 0%, Golgi: 0%, Perox.: 0%|
185
+ <br/><br/>
186
+
187
+
188
+ ## **How to generate from REXzyme**
189
+ REXzyme can be used with the HuggingFace transformer python package.
190
+ Detailed installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).
191
+
192
+ Since REXzyme has been trained on the objective of machine translation, users have to specify a chemical reaction, specified in the format of SMILES.
193
+
194
+ Disclaimer: Although the perplexity gets computed here it is not the best selection criteria.
195
+ Usually the BLEU score is deployed for translation evaluation,
196
+ but this score would enforce a high sequence similarity (thus not *de novo* design, which is what we tend to go for).
197
+ We recommend generating many sequences and selecting them by plDDT, as well as other metrics.
198
+
199
+ ```python
200
+ from datasets import load_from_disk
201
+ from transformers import AutoTokenizer
202
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
203
+ import math
204
+ import torch
205
+ from tqdm import tqdm
206
+ import pickle
207
+ tokenizer_aa = AutoTokenizer.from_pretrained('/path/to//tokenizer_aa')
208
+ tokenizer_smiles = AutoTokenizer.from_pretrained('/path/to//tokenizer_smiles')
209
+
210
+ model = T5ForConditionalGeneration.from_pretrained("/path/to/REXzyme").cuda()
211
+ print(model.generation_config)
212
+ reactions = ["NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.*N[C@@H](CO)C(*)=O>>NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.[H+].*N[C@@H](COP(=O)([O-])[O-])C(*)=O"]
213
+
214
+ def calculatePerplexity(inputs,model):
215
+ '''Function to compute perplexity'''
216
+ a=tokenizer_aa.decode(inputs)
217
+ b=tokenizer_aa(a, return_tensors="pt").input_ids.to(device='cuda')
218
+ b = torch.stack([[b[b!=tokenizer_aa.pad_token_id]] for label in b][0])
219
+ with torch.no_grad():
220
+ outputs = model(b, labels=b)
221
+ loss, logits = outputs[:2]
222
+ return math.exp(loss)
223
+
224
+
225
+ for idx,i in tqdm(enumerate(reactions)):
226
+ input_ids = tokenizer_smiles(f"r2s{i}</s>", return_tensors="pt").input_ids.to(device='cuda')
227
+ print(f'Generating for {i}')
228
+ ppls_total = []
229
+ for _ in range(4):
230
+ outputs = model.generate(input_ids,
231
+ top_k=15,
232
+ top_p = 0.92,
233
+ repetition_penalty=1.2,
234
+ max_length=1024,
235
+ do_sample=True,
236
+ num_return_sequences=25)
237
+ ppls = [(tokenizer_aa.decode(output,skip_special_tokens=True), calculatePerplexity(output, model),len(tokenizer_aa.decode(output))) for output in tqdm(outputs)]
238
+ ppls_total.extend(ppls)
239
+ ```
240
+
241
+ ## **A word of caution**
242
+
243
+ - We have not yet fully tested the ability of the model for the generation of new-to-nature enzymes, i.e.,
244
+ with chemical reactions that do not appear in Nature (and hence neither in the training set). While this is the intended objective of our work, it is very much work in progress. We'll uptadate the model and documentation shortly.
config (1).json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "google/t5-efficient-large",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 4096,
7
+ "d_kv": 64,
8
+ "d_model": 1024,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "relu",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "relu",
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": false,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 24,
21
+ "num_heads": 16,
22
+ "num_layers": 24,
23
+ "pad_token_id": 0,
24
+ "relative_attention_max_distance": 128,
25
+ "relative_attention_num_buckets": 32,
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.26.1",
28
+ "use_cache": true,
29
+ "vocab_size": 32128
30
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "decoder_start_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.26.1"
7
+ }
gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e10483a311f320786c5f3d89ffefda83a64c8ad4e6eabe7f4762c9a3f35a239f
3
+ size 5901651283
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b721a7612cdbc471803a18f3adbd9a6d308aa33729a1c0ae66442c741f2ce2d
3
+ size 2950844807
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb61b174f3ee835840cc48957d94b11aef3155be8597b70a8ffe7d8d14a428f2
3
+ size 623
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd605798bf354f11dfa3f393e3276acf6c745c46fd47ac5be9878866f43c65e8
3
+ size 3567