ProCALM / README.md
jsunn-y's picture
Update README.md
7b7f7dd verified
metadata
license: bsd-3-clause

ProCALM

ProCALM (Protein Conditionally Adapted Language Model) is a suite of models where ProGen2-base is finetuned with conditional adapters for conditional generation of functional enzymes, based on EC number, taxonomy, or both.

ProCALM models share tokenizer.json, and individual models are organized into subfolders. We have uploaded the most relevant models here, but please reach out if you would like to use other models from our paper. 1.5B and 9B refer to checkpoints trained to 1.5 and 9 billion tokens, respectively.

General Usage

Usage details with examples can be found in github under "Generation" and in our paper. Example framework for generation from pretrained models:

from tokenizers import Tokenizer
from model import ProgenConditional

model = ProgenConditional.from_pretrained("jsunn-y/ProCALM", subfolder="ec-onehot-swissprot/1.5B")
tokenizer = Tokenizer.from_pretrained("jsunn-y/ProCALM")

with torch.no_grad():
  input_ids = torch.tensor(self.tokenizer.encode(context).ids).view([1, -1]).to(self.device)
  tokens_batch = model.generate(input_ids=input_ids, condition_encodings=condition_encodings, do_sample=True, temperature=temperature, max_length=max_length, top_p=top_p, num_return_sequences=num_return_sequences, pad_token_id=self.pad_token_id, eos_token_id=4)

as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
sequences = tokenizer.decode_batch(as_lists(tokens_batch))

Note that condition_encodings is a representation of the conditioning, which can be calculated using the dictionaries .pt provided in our github under data.

Summary of Available Models

Name Description
progen2-base Original ProGen2 model with ~760 million parameters
ec-onehot-uniref Trained with onehot-encoded EC conditioning, on ~29e6 enzymes from Uniref
ec-onehot-swissprot Trained with onehot-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train
tax-swissprot Trained on onehot-encoded EC taxonomy conditioning, on ~150e3 enzymes from Swissprot Train
ec+tax-swissprot Trained jointly on onehot-encoded EC conditioning and onehot-encoded taxonomy conditioning with parallel adapters, on ~150e3 enzymes from Swissprot Train
ec-drfp-swissprot Trained with DRFP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train
ec-creep-swissprot Trained with CREEP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train