|
--- |
|
language: en |
|
tags: |
|
- transformers |
|
- protein |
|
- peptide-receptor |
|
license: apache-2.0 |
|
datasets: |
|
- custom |
|
--- |
|
|
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model predicts receptor classes, identified by their PDB IDs, from peptide sequences using the [ESM2](https://huggingface.co/docs/transformers/model_doc/esm) (Evolutionary Scale Modeling) protein language model with esm2_t6_8M_UR50D pre-trained weights. The model is fine-tuned for receptor prediction using datasets from [PROPEDIA](http://bioinfo.dcc.ufmg.br/propedia2/) and [PepNN](https://www.nature.com/articles/s42003-022-03445-2), as well as novel peptides experimentally validated to bind to their target proteins, with binding conformations determined using ClusPro, a protein-protein docking tool. The name `pep2rec_cppp` reflects the model's ability to predict peptide-to-receptor relationships, leveraging training data from ClusPro, PROPEDIA, and PepNN. |
|
It's particularly useful for researchers and practitioners in bioinformatics, drug discovery, and related fields, aiming to understand or predict peptide-receptor interactions. |
|
|
|
## How to Use |
|
|
|
Here is how to predict the receptor class for a peptide sequence using this model: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
from joblib import load |
|
|
|
MODEL_PATH = "littleworth/esm2_t12_35M_UR50D_pep2rec_cppp" |
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) |
|
|
|
LABEL_ENCODER_PATH = f"{MODEL_PATH}/label_encoder.joblib" |
|
label_encoder = load(LABEL_ENCODER_PATH) |
|
|
|
|
|
input_sequence = "GNLIVVGRVIMS" |
|
|
|
inputs = tokenizer(input_sequence, return_tensors="pt", truncation=True, padding=True) |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
probabilities = torch.softmax(outputs.logits, dim=1) |
|
predicted_class_idx = probabilities.argmax(dim=1).item() |
|
|
|
predicted_class = label_encoder.inverse_transform([predicted_class_idx])[0] |
|
|
|
class_probabilities = probabilities.squeeze().tolist() |
|
class_labels = label_encoder.inverse_transform(range(len(class_probabilities))) |
|
|
|
sorted_indices = torch.argsort(probabilities, descending=True).squeeze() |
|
sorted_class_labels = [class_labels[i] for i in sorted_indices.tolist()] |
|
sorted_class_probabilities = probabilities.squeeze()[sorted_indices].tolist() |
|
|
|
print(f"Predicted Receptor Class: {predicted_class}") |
|
print("Top 10 Class Probabilities:") |
|
for label, prob in zip(sorted_class_labels[:10], sorted_class_probabilities[:10]): |
|
print(f"{label}: {prob:.4f}") |
|
|
|
|
|
``` |
|
|
|
Which gives this output: |
|
|
|
``` |
|
Predicted Receptor Class: 1JXP |
|
Top 10 Class Probabilities: |
|
1JXP: 0.9839 |
|
3KEE: 0.0001 |
|
5EAY: 0.0001 |
|
1Z9O: 0.0001 |
|
2KBM: 0.0001 |
|
2FES: 0.0001 |
|
1MWN: 0.0001 |
|
5CFC: 0.0001 |
|
6O09: 0.0001 |
|
1DKD: 0.0001 |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
The model was evaluated on a held-out test set, yielding the following metrics: |
|
|
|
``` |
|
{ |
|
"train/loss": 0.727, |
|
"train/grad_norm": 4.4672017097473145, |
|
"train/learning_rate": 2.3235385792411667e-8, |
|
"train/epoch": 10, |
|
"train/global_step": 352910, |
|
"_timestamp": 1712189024.5060718, |
|
"_runtime": 503183.0418128967, |
|
"_step": 716, |
|
"eval/loss": 0.7138708829879761, |
|
"eval/accuracy": 0.7794731752930051, |
|
"eval/runtime": 5914.5446, |
|
"eval/samples_per_second": 15.912, |
|
"eval/steps_per_second": 15.912, |
|
"train/train_runtime": 497231.6027, |
|
"train/train_samples_per_second": 5.678, |
|
"train/train_steps_per_second": 0.71, |
|
"train/total_flos": 600463318555361300, |
|
"train/train_loss": 0.9245198557043193, |
|
"_wandb": { |
|
"runtime": 503182 |
|
} |
|
} |
|
``` |
|
|
|
|