Peptide Wasserstein Autoencoder (WAE)

Generative model for peptide sequences using Wasserstein Autoencoder with Maximum Mean Discrepancy (MMD) loss.

Model Description

  • Architecture: GRU encoder-decoder with Wasserstein loss (MMD)
  • Latent dimension: 100
  • Condition dimension: 2
  • Max sequence length: 25 amino acids
  • Vocabulary: 22 amino acids + 4 special tokens (26 total)
  • Encoder: Bidirectional GRU (h_dim=80)
  • Decoder: GRU with word dropout (0.3)

Installation

pip install deepchemography

Usage

Basic Usage

from deepchemography.peptides import load_peptide_model, sample_peptides

# Load model
model, vocab = load_peptide_model("path/to/model.pt", "path/to/vocab.dict")

# Sample new peptides
peptides = sample_peptides(model, vocab, n_samples=10)
for p in peptides:
    print(p)  # e.g., "M L L L L L A L A L L A L L L"

Encoding and Decoding

from deepchemography.peptides import encode_peptide, decode_latent

# Encode a peptide to latent space
sequence = "M L L L L L A L A L L A L L L A L L L"
z = encode_peptide(model, vocab, sequence)
print(f"Latent shape: {z.shape}")  # (1, 100)

# Decode latent vectors back to sequences
reconstructed = decode_latent(model, vocab, z, sample_mode='greedy')
print(f"Reconstructed: {reconstructed[0]}")

Interpolation

from deepchemography.peptides import interpolate_peptides

seq1 = "M L L L L L A L A L L A L L L A L L L"
seq2 = "M D K L I V L K M L N S K L P Y G Q R K"

# Linear interpolation in latent space
sequences, weights = interpolate_peptides(
    model, vocab, seq1, seq2,
    n_steps=5,
    method='linear'
)

for w, seq in zip(weights, sequences):
    print(f"w={w:.2f}: {seq}")

Exploring Neighborhoods

from deepchemography.peptides import explore_neighborhood

base_sequence = "M L L L L L A L A L L A L L L A L L L"

# Generate similar peptides
neighbors = explore_neighborhood(
    model, vocab, base_sequence,
    noise_scale=0.1,  # Low noise = high similarity
    n_neighbors=10
)

for neighbor in neighbors:
    print(neighbor)

Input Format

Peptide sequences are represented as space-separated single-letter amino acid codes:

M L L L L L A L A L L A L L L A L L L

Supported Amino Acids

Standard 20 amino acids plus U (Selenocysteine) and Z (Glutamic acid/Glutamine): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, U, V, W, Y, Z

Special Tokens

  • <unk> (index 0): Unknown token
  • <pad> (index 1): Padding
  • <start> (index 2): Start of sequence
  • <eos> (index 3): End of sequence

Training Details

  • Training data: Antimicrobial peptide sequences
  • Loss function: WAE with MMD (Gaussian kernel, sigma=7.0)
  • Training iterations: 344,000
  • Batch size: 32

Citation

If you use this model, please cite:

@software{wae_peptides,
  title={Peptide Wasserstein Autoencoder},
  author={Orlov, Alexander},
  year={2025},
  url={https://huggingface.co/axelrolov/wae_peptides}
}

License

MIT License

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support