license: apache-2.0
SilkomeGPT: Generative strategies for modeling, design and analysis of spider silk protein sequences for enhanced mechanical properties
Generative strategies for modeling, design and analysis of silk protein sequences for enhanced mechanical properties
Wei Lu, David L. Kaplan, Markus J. Buehler
Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA
Contact email: mbuehler@mit.edu
Abstract: Spider silks are remarkable materials characterized by superb mechanical properties such as strength, extensibility and lightweightedness. Yet, to date, limited models are available to fully explore sequence-property relationships for analysis and design. Here a custom generative large-language model is proposed to enable design of novel spider silk protein sequences to meet complex combinations of target mechanical properties. The model, pretrained on a large set of protein sequences, is fine-tuned on ~1,000 major ampullate spidroin (MaSp) sequences for which associated fiber-level mechanical properties exist, to yield an end-to-end forward and inverse generative approach that is aplied in a multi-agent strategy. Performance is assessed through: (1) a novelty analysis and protein type classification for generated spidroin sequences through Basic Local Alignment Search Tool (BLAST) searches, (2) property evaluation and comparison with similar sequences, (3) comparison of molecular structures, as well as, and (4) a detailed sequence motif analyses. This work generates silk sequences with property combinations that do not exist in nature, and develops a deep understanding of the mechanistic roles of sequence patterns in achieving overarching key mechanical properties (elastic modulus, strength, toughness, failure strain). The model provides an efficient approach to expand the silkome dataset, facilitating further sequence-structure analyses of silks, and establishes a foundation for synthetic silk design and optimization. This work not only shows the capacity of generative transformer models to design complex materials, but also illustrates an effective use of agentic modeling for self-improving design solutions.
Keywords: biomaterials; deep learning; generative autoregressive transformer; hierarchical; multiscale modeling; spider silk; spidroin
GitHub (more codes, notebooks, etc.): https://github.com/lamm-mit/SilkomeGPT
Trained model and inference
This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of silk and other protein sequences. The pretraining task is defined as "Sequence<...>" where ... is an amino acid sequence.
Load pretrained model:
from transformers import AutoModelForCausalLM, AutoTokenizer
trained_model_name='lamm-mit/SilkomeGPT'
tokenizer = AutoTokenizer.from_pretrained(trained_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model_name = pretrained_model_name
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True
).to(device)
model.config.use_cache = False
Sample inference using the "GenerateSilkContent<...>" task, where here, the model will produce a silk sequence that meets the list of properties requested:
prompt = "GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515>"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)).unsqueeze(0).to(device)
print(generated.shape, generated)
sample_outputs = model.generate(
inputs=generated,
eos_token_id =tokenizer.eos_token_id,
do_sample=True,
top_k=500,
max_length = 300,
top_p=0.9,
num_return_sequences=3,
temperature=1,
).to(device)
for i, sample_output in enumerate(sample_outputs):
print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
Output (here, three candidate sequences):
torch.Size([1, 66]) tensor([[ 43, 299, 73, 86, 69, 88, 73, 55, 77, 80, 79, 39, 83, 82,
88, 299, 88, 32, 20, 18, 21, 27, 27, 16, 20, 18, 22, 22,
22, 16, 20, 18, 20, 28, 22, 16, 20, 18, 20, 26, 25, 16,
20, 18, 22, 22, 25, 16, 20, 18, 22, 24, 21, 16, 20, 18,
22, 26, 26, 16, 20, 18, 25, 21, 25, 34]], device='cuda:0')
0: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [AAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGAGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGLSGSGDAAAAAAAAAGGSGGSEGYGPGGYGPGGSGDAAAAAAAAAGGSGGPGGYGPGGYGPGGYGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGAAVAAASAAGGSGGSGGYGPGGYGPGGSGAAAASAAASAISSPASTSRISFVASRLVSGGTANVSNLSNTIGTVMSQVRAGNPGASECEVVIQTLIELLAALIHILGSASIGNVNYGSTAQSAAVVSESFQSAFQ]
1: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MTLTIRLALSLLVAICTQSMFALGQSVSPWSSPDMAENFMSVFTDSLSQSGAFSYDQMDDISSIGDSIRSGVEKMARSGKTSANKLQAMNMAFASAVAEIAISEGGGQSAQVKTNAVADALSTAFLQTTGVVNTQFVNEIRSLISMFAQANSVSSSSASVSASAGGAGGYGPQAQGAAAVVAGGYGPGSQGPQSYGPGPQAQSSAVAVSAGSQGPQSYGPGPQGPGPQGPGPQGSGPQGPGPQGPGSQGPQSYGPGPQGPSSPGQSSYQYSVSITSQSGSQGTSGGLGSQGAGGADQGGYGNGQGGSGSAAAAAAAGGAGGAGQGGLGAGGAGQGYGAGLGRQGGSGQGGAAAAAAAAGGLGGQGGYGGQDSQGAGQGGYGSGQGGSGAAAAAAAAGGAGRGGLGSGGAGQGYGAGLGGQGGSGQGGQGGQQPGQSGYGRQGQGSGGAGQGGLGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGRQGPGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGGAGRG]
2: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MNWSIRLALLGLVVLSTQTTFAFGQAATPWENTALAEAFINSFLDSIGRTGAFSLSQQDDMSTIGDTLKSAMEKMAQSRKSSKSKLQALNMAFASSMAEIAVAEEGGLSIQAKTEAIASSLSSAFLQTTGVVNYQFVNEIKSLIYMIAQATTNEVASSEASAGGGGGSGQGRYVSSSAAGTYGSAPQSTGENRPAPQGPPQQGPTYGPSAAVLVSAVGGYGQGPAAPSQQGPTGPSQQRQANQGPYGLSVQQEPESQGSYGPETNAAAAAAGGYGPGAVGQQGLGAGGQQGPGGQRP]
Citation
To cite this work:
@article{WeiKaplanBuehler_2023,
title = {Generative Modeling, Design, and Analysis of Spider Silk Protein Sequences for Enhanced Mechanical Properties},
author = {W. Lu, D. L., Kaplan, M.J. Buehler},
journal = {Adv. Funct. Mater.},
year = {2023},
volume = {},
pages = {},
url = {https://doi.org/10.1002/adfm.202311324}
}