Peptide Wasserstein Autoencoder (WAE)
Generative model for peptide sequences using Wasserstein Autoencoder with Maximum Mean Discrepancy (MMD) loss.
Model Description
- Architecture: GRU encoder-decoder with Wasserstein loss (MMD)
- Latent dimension: 100
- Condition dimension: 2
- Max sequence length: 25 amino acids
- Vocabulary: 22 amino acids + 4 special tokens (26 total)
- Encoder: Bidirectional GRU (h_dim=80)
- Decoder: GRU with word dropout (0.3)
Installation
pip install deepchemography
Usage
Basic Usage
from deepchemography.peptides import load_peptide_model, sample_peptides
# Load model
model, vocab = load_peptide_model("path/to/model.pt", "path/to/vocab.dict")
# Sample new peptides
peptides = sample_peptides(model, vocab, n_samples=10)
for p in peptides:
print(p) # e.g., "M L L L L L A L A L L A L L L"
Encoding and Decoding
from deepchemography.peptides import encode_peptide, decode_latent
# Encode a peptide to latent space
sequence = "M L L L L L A L A L L A L L L A L L L"
z = encode_peptide(model, vocab, sequence)
print(f"Latent shape: {z.shape}") # (1, 100)
# Decode latent vectors back to sequences
reconstructed = decode_latent(model, vocab, z, sample_mode='greedy')
print(f"Reconstructed: {reconstructed[0]}")
Interpolation
from deepchemography.peptides import interpolate_peptides
seq1 = "M L L L L L A L A L L A L L L A L L L"
seq2 = "M D K L I V L K M L N S K L P Y G Q R K"
# Linear interpolation in latent space
sequences, weights = interpolate_peptides(
model, vocab, seq1, seq2,
n_steps=5,
method='linear'
)
for w, seq in zip(weights, sequences):
print(f"w={w:.2f}: {seq}")
Exploring Neighborhoods
from deepchemography.peptides import explore_neighborhood
base_sequence = "M L L L L L A L A L L A L L L A L L L"
# Generate similar peptides
neighbors = explore_neighborhood(
model, vocab, base_sequence,
noise_scale=0.1, # Low noise = high similarity
n_neighbors=10
)
for neighbor in neighbors:
print(neighbor)
Input Format
Peptide sequences are represented as space-separated single-letter amino acid codes:
M L L L L L A L A L L A L L L A L L L
Supported Amino Acids
Standard 20 amino acids plus U (Selenocysteine) and Z (Glutamic acid/Glutamine):
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, U, V, W, Y, Z
Special Tokens
<unk>(index 0): Unknown token<pad>(index 1): Padding<start>(index 2): Start of sequence<eos>(index 3): End of sequence
Training Details
- Training data: Antimicrobial peptide sequences
- Loss function: WAE with MMD (Gaussian kernel, sigma=7.0)
- Training iterations: 344,000
- Batch size: 32
Citation
If you use this model, please cite:
@software{wae_peptides,
title={Peptide Wasserstein Autoencoder},
author={Orlov, Alexander},
year={2025},
url={https://huggingface.co/axelrolov/wae_peptides}
}
License
MIT License
- Downloads last month
- 38
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support