habdine commited on
Commit
0372f20
1 Parent(s): d49dad6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -3
README.md CHANGED
@@ -1,3 +1,119 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - Causal Language Modeling
4
+ - GPT2
5
+ - ESM2
6
+ - Proteins
7
+ - GNN
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ language:
11
+ - en
12
+
13
+ license: cc-by-nc-4.0
14
+ ---
15
+ # Prot2Text Model Card
16
+
17
+ ## Model Information
18
+
19
+ **Model Page:** [Prot2Text](http://nlp.polytechnique.fr/prot2text#proteins) <br>
20
+ **Paper:** [https://arxiv.org/abs/2307.14367](https://arxiv.org/abs/2307.14367) <br>
21
+ **Github:** [https://github.com/hadi-abdine/Prot2Text](https://github.com/hadi-abdine/Prot2Text) <br>
22
+ **Authors:** Hadi Abdine<sup>(1)</sup>, Michail Chatzianastasis<sup>(1)</sup>, Costas Bouyioukos<sup>(2, 3)</sup>, Michalis Vazirgiannis<sup>(1)</sup><br>
23
+ <sup>**(1)**</sup>DaSciM, LIX, ÉcolePolytechnique, Institut Polytechnique de Paris, France.<br>
24
+ <sup>**(2)**</sup>Epigenetics and Cell Fate, CNRS UMR7216, Université Paris Cité, Paris, France.<br>
25
+ <sup>**(3)**</sup>Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus.<br>
26
+
27
+ **Prot2Text** paper is published in **AAAI 2024**. Preliminary versions of the paper were accepted as a spotlight at [DGM4H@NeurIPS 2023](https://sites.google.com/ethz.ch/dgm4h-neurips2023/home?authuser=0) and [AI4Science@NeurIPS 2023](https://ai4sciencecommunity.github.io/neurips23.html).
28
+
29
+ ```
30
+ @inproceedings{abdine2024prot2text,
31
+ title={Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers},
32
+ author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
33
+ booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
34
+ volume={38},
35
+ pages={10757--10765},
36
+ year={2024}
37
+ }
38
+ ```
39
+
40
+ ### Description
41
+
42
+ Prot2Text is a family of models that predict a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework. Prot2Text effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.
43
+
44
+ Prot2Text is trained on a [multimodal dataset](https://huggingface.co/datasets/habdine/Prot2Text-Data) that consists of 256,690 proteins. For each protein, we have three information: the correspond- ing sequence, the AlphaFold accession ID and the textual description. To build this dataset, we used the SwissProt database the only curated proteins knowledge base with full proteins’ textual description included in the UniProtKB Consortium (2016) Release 2022_04.
45
+
46
+ ### Models and Results
47
+
48
+
49
+ | Model | #params | BLEU Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT Score | Link |
50
+ |:--------------------------:|:--------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:------------:|
51
+ | Prot2Text<sub>SMALL</sub> | 256M | 30.01 | 45.78 | 38.08 | 43.97 | 82.60 | [v1.0](https://huggingface.co/habdine/Prot2Text-Small-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Small-v1-1) |
52
+ | Prot2Text<sub>BASE</sub> | 283M | 35.11 | 50.59 | 42.71 | 48.49 | 84.30 | [v1.0](https://huggingface.co/habdine/Prot2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Base-v1-1) |
53
+ | Prot2Text<sub>MEDIUM</sub>| 398M | 36.51 | 52.13 | 44.17 | 50.04 | 84.83 | [v1.0](https://huggingface.co/habdine/Prot2Text-Medium-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Medium-v1-1) |
54
+ | Prot2Text<sub>LARGE</sub> | 898M | 36.29 | 53.68 | 45.60 | 51.40 | 85.20 | [v1.0](https://huggingface.co/habdine/Prot2Text-Large-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Large-v1-1) |
55
+ | Esm2Text<sub>BASE</sub> | 225M | 32.11 | 47.46 | 39.18 | 45.31 | 83.21 | [v1.0](https://huggingface.co/habdine/Esm2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Esm2Text-Base-v1-1) |
56
+
57
+ The reported results are computed using v1.0
58
+
59
+ ### Usage
60
+
61
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library, graphein, DSSP, torch and torch geometric with:
62
+ ```sh
63
+ pip install -U transformers
64
+ git clone https://github.com/a-r-j/graphein.git
65
+ pip install -e graphein/
66
+ pip install torch
67
+ pip install torch_geometric
68
+ pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv
69
+ sudo apt-get install dssp
70
+ sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
71
+ ```
72
+ You might need to install different versions/variants according to your environnement.
73
+
74
+ Then, copy the snippet from the section that is relevant for your usecase.
75
+
76
+ #### Running Prot2Text to generate a protein's function using both its structure and sequence
77
+ To generate a protein's function using both its structure and amino-acid sequence, you need to load one of Prot2Text models and choose the AlphaFold database ID of the protein.
78
+
79
+ ```python
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+
82
+ tokenizer = AutoTokenizer.from_pretrained('habdine/Prot2Text-Base-v1-1',
83
+ trust_remote_code=True)
84
+ model = AutoModelForCausalLM.from_pretrained('habdine/Prot2Text-Base-v1-1',
85
+ trust_remote_code=True)
86
+
87
+ function = model.generate_protein_description(protein_pdbID='Q10MK9',
88
+ tokenizer=tokenizer,
89
+ device='cuda' # replace with 'mps' to run on a Mac device
90
+ )
91
+
92
+ print(function)
93
+ # 'Carboxylate--CoA ligase that may use 4-coumarate as substrate. Follows a two-step reaction mechanism, wherein the carboxylate substrate first undergoes adenylation by ATP, followed by a thioesterification in the presence of CoA to yield the final CoA thioester.'
94
+ ```
95
+ <br>
96
+
97
+ #### Running Esm2Text to generate a protein's function using only its sequence
98
+ To generate a protein's function using only its amino-acid sequence, you need to load Esm2Text-Base model and pass an amino-acid sequence.
99
+
100
+ ```python
101
+ from transformers import AutoModelForCausalLM, AutoTokenizer
102
+
103
+ tokenizer = AutoTokenizer.from_pretrained('habdine/Esm2Text-Base-v1-1',
104
+ trust_remote_code=True)
105
+ model = AutoModelForCausalLM.from_pretrained('habdine/Esm2Text-Base-v1-1',
106
+ trust_remote_code=True)
107
+
108
+ function = model.generate_protein_description(protein_sequence='AEQAERYEEMVEFMEKL',
109
+ tokenizer=tokenizer,
110
+ device='cuda' # replace with 'mps' to run on a Mac device
111
+ )
112
+
113
+ print(function)
114
+ # 'A cytochrome b6-f complex catalyzes the calcium-dependent hydrolysis of the 2-acyl groups in 3-sn-phosphoglycerides. Its physiological function is not known.'
115
+ ```
116
+ <br>
117
+
118
+ ## Notice
119
+ THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE.