|
--- |
|
license: bsd-3-clause |
|
--- |
|
|
|
# ProCALM |
|
[ProCALM](https://github.com/jsunn-y/ProCALM/tree/main) (Protein Conditionally Adapted Language Model) is a suite of models where [ProGen2-base](https://github.com/enijkamp/progen2) is finetuned with conditional adapters for conditional generation of functional enzymes, based on EC number, taxonomy, or both. |
|
|
|
ProCALM models share `tokenizer.json`, and individual models are organized into subfolders. We have uploaded the most relevant models here, but please reach out if you would like to use other models from our paper. `1.5B` and `9B` refer to checkpoints trained to 1.5 and 9 billion tokens, respectively. |
|
|
|
## General Usage |
|
Usage details with examples can be found in [github](https://github.com/jsunn-y/ProCALM/tree/main) under "Generation" and in our paper. Example framework for generation from pretrained models: |
|
``` |
|
from tokenizers import Tokenizer |
|
from model import ProgenConditional |
|
|
|
model = ProgenConditional.from_pretrained("jsunn-y/ProCALM", subfolder="ec-onehot-swissprot/1.5B") |
|
tokenizer = Tokenizer.from_pretrained("jsunn-y/ProCALM") |
|
|
|
with torch.no_grad(): |
|
input_ids = torch.tensor(self.tokenizer.encode(context).ids).view([1, -1]).to(self.device) |
|
tokens_batch = model.generate(input_ids=input_ids, condition_encodings=condition_encodings, do_sample=True, temperature=temperature, max_length=max_length, top_p=top_p, num_return_sequences=num_return_sequences, pad_token_id=self.pad_token_id, eos_token_id=4) |
|
|
|
as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])] |
|
sequences = tokenizer.decode_batch(as_lists(tokens_batch)) |
|
``` |
|
Note that `condition_encodings` is a representation of the conditioning, which can be calculated using the dictionaries `.pt` provided in our github under `data`. |
|
|
|
## Summary of Available Models |
|
|
|
| Name | Description | |
|
|:--------|:-------:| |
|
| progen2-base | Original ProGen2 model with ~760 million parameters| |
|
| ec-onehot-uniref | Trained with onehot-encoded EC conditioning, on ~29e6 enzymes from Uniref | |
|
| ec-onehot-swissprot | Trained with onehot-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train | |
|
| tax-swissprot | Trained on onehot-encoded EC taxonomy conditioning, on ~150e3 enzymes from Swissprot Train | |
|
| ec+tax-swissprot | Trained jointly on onehot-encoded EC conditioning and onehot-encoded taxonomy conditioning with parallel adapters, on ~150e3 enzymes from Swissprot Train | |
|
| ec-drfp-swissprot | Trained with DRFP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train | |
|
| ec-creep-swissprot | Trained with CREEP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train | |
|
|
|
|