File size: 4,364 Bytes
fca296e
 
 
 
 
 
 
 
 
 
 
 
8d5df03
fca296e
 
9a4096c
fca296e
9a4096c
f5ac947
 
8d5df03
f5ac947
 
 
 
 
eedca8b
8d5df03
 
f5ac947
fca296e
f5ac947
 
fca296e
 
 
 
 
 
 
 
 
 
 
 
a66e467
fca296e
 
 
 
 
 
 
 
 
 
 
 
e504da9
8d5df03
fca296e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49bcd8a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
library_name: transformers
tags:
- CodonTransformer
- Computational Biology
- Machine Learning
- Bioinformatics
- Synthetic Biology
license: apache-2.0
pipeline_tag: token-classification
---

![image/png](https://github.com/Adibvafa/CodonTransformer/raw/main/src/banner_final.png)

**CodonTransformer** is the ultimate tool for codon optimization, transforming protein sequences into optimized DNA sequences specific for your target organisms. Whether you are a researcher or a practitioner in genetic engineering, CodonTransformer provides a comprehensive suite of features to facilitate your work. By leveraging the Transformer architecture and a user-friendly Jupyter notebook, it reduces the complexity of codon optimization, saving you time and effort.
<br>

**This is the pretrained model, for best results please use the [finetuned model](https://huggingface.co/adibvafa/CodonTransformer)**.

## Authors
Adibvafa Fallahpour<sup>1,2</sup>\*, Vincent Gureghian<sup>3</sup>\*, Guillaume J. Filion<sup>2</sup>‡,  Ariel B. Lindner<sup>3</sup>‡,  Amir Pandi<sup>3</sup><sup>1</sup> Vector Institute for Artificial Intelligence, Toronto ON, Canada  
<sup>2</sup> University of Toronto Scarborough; Department of Biological Science; Scarborough ON, Canada  
<sup>3</sup> Université Paris Cité, INSERM U1284, Center for Research and Interdisciplinarity, F-75006 Paris, France  
\* These authors contributed equally to this work.  
‡ To whom correspondence should be addressed: <br>
guillaume.filion@utoronto.ca, ariel.lindner@inserm.fr, amir.pandi@cri-paris.org
<br>

## Use Case
**For a guide on finetuning CodonTransformer, check out our [GitHub.](https://github.com/Adibvafa/CodonTransformer/tree/main?tab=readme-ov-file#finetuning-codontransformer)**
<br>**For an interactive demo, check out our [Google Colab Notebook.](https://adibvafa.github.io/CodonTransformer/GoogleColab)**
<br></br>
After installing CodonTransformer, you can use:
```python
import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer-base").to(DEVICE)


# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"


# Predict with CodonTransformer
output = predict_dna_sequence(
    protein=protein,
    organism=organism,
    device=DEVICE,
    tokenizer=tokenizer,
    model=model,
    attention_type="original_full",
)
print(format_model_output(output))
```
The output is:
<br>


```python
-----------------------------
|          Organism         |
-----------------------------
Escherichia coli general

-----------------------------
|       Input Protein       |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG

-----------------------------
|      Processed Input      |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK

-----------------------------
|       Predicted DNA       |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA
```


## Additional Resources
- **Project Website** <br>
  https://adibvafa.github.io/CodonTransformer/

- **GitHub Repository** <br>
  https://github.com/Adibvafa/CodonTransformer

- **Google Colab Demo** <br>
  https://adibvafa.github.io/CodonTransformer/GoogleColab

- **PyPI Package** <br>
  https://pypi.org/project/CodonTransformer/

- **Paper** <br>
  https://www.biorxiv.org/content/10.1101/2024.09.13.612903