littleworth
commited on
Commit
•
8011031
1
Parent(s):
8e7d17d
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- transformers
|
5 |
+
- protein
|
6 |
+
- peptide-receptor
|
7 |
+
license: apache-2.0
|
8 |
+
datasets:
|
9 |
+
- custom
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
## Model Description
|
15 |
+
|
16 |
+
|
17 |
+
This model predicts receptor classes, identified by their PDB IDs, from peptide sequences using the [ESM2](https://huggingface.co/docs/transformers/model_doc/esm) (Evolutionary Scale Modeling) protein language model with esm2_t6_8M_UR50D pre-trained weights. The model is fine-tuned for receptor prediction using datasets from [PROPEDIA](http://bioinfo.dcc.ufmg.br/propedia2/) and [PepNN](https://www.nature.com/articles/s42003-022-03445-2), as well as novel peptides experimentally validated to bind to their target proteins, with binding conformations determined using ClusPro, a protein-protein docking tool. The name `pep2rec_cppp` reflects the model's ability to predict peptide-to-receptor relationships, leveraging training data from ClusPro, PROPEDIA, and PepNN.
|
18 |
+
It's particularly useful for researchers and practitioners in bioinformatics, drug discovery, and related fields, aiming to understand or predict peptide-receptor interactions.
|
19 |
+
|
20 |
+
## How to Use
|
21 |
+
|
22 |
+
Here is how to predict the receptor class for a peptide sequence using this model:
|
23 |
+
|
24 |
+
```python
|
25 |
+
import torch
|
26 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
27 |
+
from joblib import load
|
28 |
+
|
29 |
+
MODEL_PATH = "littleworth/esm2_t12_35M_UR50D_pep2rec_cppp"
|
30 |
+
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
|
32 |
+
|
33 |
+
LABEL_ENCODER_PATH = f"{MODEL_PATH}/label_encoder.joblib"
|
34 |
+
label_encoder = load(LABEL_ENCODER_PATH)
|
35 |
+
|
36 |
+
|
37 |
+
input_sequence = "GNLIVVGRVIMS"
|
38 |
+
|
39 |
+
inputs = tokenizer(input_sequence, return_tensors="pt", truncation=True, padding=True)
|
40 |
+
|
41 |
+
with torch.no_grad():
|
42 |
+
outputs = model(**inputs)
|
43 |
+
probabilities = torch.softmax(outputs.logits, dim=1)
|
44 |
+
predicted_class_idx = probabilities.argmax(dim=1).item()
|
45 |
+
|
46 |
+
predicted_class = label_encoder.inverse_transform([predicted_class_idx])[0]
|
47 |
+
|
48 |
+
class_probabilities = probabilities.squeeze().tolist()
|
49 |
+
class_labels = label_encoder.inverse_transform(range(len(class_probabilities)))
|
50 |
+
|
51 |
+
sorted_indices = torch.argsort(probabilities, descending=True).squeeze()
|
52 |
+
sorted_class_labels = [class_labels[i] for i in sorted_indices.tolist()]
|
53 |
+
sorted_class_probabilities = probabilities.squeeze()[sorted_indices].tolist()
|
54 |
+
|
55 |
+
print(f"Predicted Receptor Class: {predicted_class}")
|
56 |
+
print("Top 10 Class Probabilities:")
|
57 |
+
for label, prob in zip(sorted_class_labels[:10], sorted_class_probabilities[:10]):
|
58 |
+
print(f"{label}: {prob:.4f}")
|
59 |
+
|
60 |
+
|
61 |
+
```
|
62 |
+
|
63 |
+
Which gives this output:
|
64 |
+
|
65 |
+
```
|
66 |
+
Predicted Receptor Class: 1JXP
|
67 |
+
Top 10 Class Probabilities:
|
68 |
+
1JXP: 0.9839
|
69 |
+
3KEE: 0.0001
|
70 |
+
5EAY: 0.0001
|
71 |
+
1Z9O: 0.0001
|
72 |
+
2KBM: 0.0001
|
73 |
+
2FES: 0.0001
|
74 |
+
1MWN: 0.0001
|
75 |
+
5CFC: 0.0001
|
76 |
+
6O09: 0.0001
|
77 |
+
1DKD: 0.0001
|
78 |
+
```
|
79 |
+
|
80 |
+
## Evaluation Results
|
81 |
+
|
82 |
+
The model was evaluated on a held-out test set, yielding the following metrics:
|
83 |
+
|
84 |
+
```
|
85 |
+
{
|
86 |
+
"train/loss": 0.727,
|
87 |
+
"train/grad_norm": 4.4672017097473145,
|
88 |
+
"train/learning_rate": 2.3235385792411667e-8,
|
89 |
+
"train/epoch": 10,
|
90 |
+
"train/global_step": 352910,
|
91 |
+
"_timestamp": 1712189024.5060718,
|
92 |
+
"_runtime": 503183.0418128967,
|
93 |
+
"_step": 716,
|
94 |
+
"eval/loss": 0.7138708829879761,
|
95 |
+
"eval/accuracy": 0.7794731752930051,
|
96 |
+
"eval/runtime": 5914.5446,
|
97 |
+
"eval/samples_per_second": 15.912,
|
98 |
+
"eval/steps_per_second": 15.912,
|
99 |
+
"train/train_runtime": 497231.6027,
|
100 |
+
"train/train_samples_per_second": 5.678,
|
101 |
+
"train/train_steps_per_second": 0.71,
|
102 |
+
"train/total_flos": 600463318555361300,
|
103 |
+
"train/train_loss": 0.9245198557043193,
|
104 |
+
"_wandb": {
|
105 |
+
"runtime": 503182
|
106 |
+
}
|
107 |
+
}
|
108 |
+
```
|
109 |
+
|