license: bsd-3-clause
ProCALM
ProCALM (Protein Conditionally Adapted Language Model) is a suite of models where ProGen2-base is finetuned with conditional adapters for conditional generation of functional enzymes, based on EC number, taxonomy, or both.
ProCALM models share tokenizer.json
, and individual models are organized into subfolders. We have uploaded the most relevant models here, but please reach out if you would like to use other models from our paper. 1.5B
and 9B
refer to checkpoints trained to 1.5 and 9 billion tokens, respectively.
Quickstart
Usage details with examples can be found in github under "Generation" and in our paper. Example framework for generation from pretrained models:
from tokenizers import Tokenizer
from model import ProgenConditional
model = ProgenConditional.from_pretrained("jsunn-y/ProCALM", subfolder="ec-onehot-swissprot/1.5B")
tokenizer = Tokenizer.from_pretrained("jsunn-y/ProCALM")
with torch.no_grad():
input_ids = torch.tensor(self.tokenizer.encode(context).ids).view([1, -1]).to(self.device)
tokens_batch = model.generate(input_ids=input_ids, condition_encodings=condition_encodings, do_sample=True, temperature=temperature, max_length=max_length, top_p=top_p, num_return_sequences=num_return_sequences, pad_token_id=self.pad_token_id, eos_token_id=4)
as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
sequences = tokenizer.decode_batch(as_lists(tokens_batch))
Note that condition_encodings
is a representation of the conditioning, which can be calculated using the dictionaries .pt
provided in our github under data
.
Summary of Available Models
Name | Description |
---|---|
progen2-base | Original ProGen2 model with ~760 million parameters |
ec-onehot-uniref | Trained with onehot-encoded EC conditioning, on ~29e6 enzymes from Uniref |
ec-onehot-swissprot | Trained with onehot-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train |
tax-swissprot | Trained on onehot-encoded EC taxonomy conditioning, on ~150e3 enzymes from Swissprot Train |
ec+tax-swissprot | Trained jointly on onehot-encoded EC conditioning and onehot-encoded taxonomy conditioning with parallel adapters, on ~150e3 enzymes from Swissprot Train |
ec-drfp-swissprot | Trained with DRFP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train |
ec-creep-swissprot | Trained with CREEP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train |