ProCALM / README.md
jsunn-y's picture
Update README.md
7884c9e verified
|
raw
history blame
1.59 kB
metadata
license: bsd-3-clause

ProCALM

ProCALM (Protein Conditionally Adapted Language Model) is a suite of models where ProGen2-base is finetuned with conditional adapters for conditional generation of functional enzymes, based on EC number, taxonomy, or both.

ProCALM models share tokenizer.json and individual models are organized into subfolders. We have uploaded the most relevant models here, but please reach out if you would like to use other models from our paper. 1.5B and 9B refer to checkpoints trained to 1.5 and 9 billion tokens, respectively

Name Description
progen2-base Original ProGen2 model with ~760 million parameters
ec-onehot-uniref Trained with onehot-encoded EC conditioning, on ~29e6 enzymes from Uniref
ec-onehot-swissprot Trained with onehot-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train
tax-swissprot Trained on onehot-encoded EC taxonomy conditioning, on ~150e3 enzymes from Swissprot Train
ec+tax-swissprot Trained jointly on onehot-encoded EC conditioning and onehot-encoded taxonomy conditioning with parallel adapters, on ~150e3 enzymes from Swissprot Train
ec-drfp-swissprot Trained with DRFP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train
ec-creep-swissprot Trained with CREEP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train

More usage details can be found in github and in our paper.