jsunn-y
/

ProCALM

Model card Files Files and versions Community

ProCALM / README.md

jsunn-y's picture

Update README.md

7b7f7dd verified 3 months ago

|

history blame contribute delete

2.66 kB

	---
	license: bsd-3-clause
	---

	# ProCALM
	[ProCALM](https://github.com/jsunn-y/ProCALM/tree/main) (Protein Conditionally Adapted Language Model) is a suite of models where [ProGen2-base](https://github.com/enijkamp/progen2) is finetuned with conditional adapters for conditional generation of functional enzymes, based on EC number, taxonomy, or both.

	ProCALM models share `tokenizer.json`, and individual models are organized into subfolders. We have uploaded the most relevant models here, but please reach out if you would like to use other models from our paper. `1.5B` and `9B` refer to checkpoints trained to 1.5 and 9 billion tokens, respectively.

	## General Usage
	Usage details with examples can be found in [github](https://github.com/jsunn-y/ProCALM/tree/main) under "Generation" and in our paper. Example framework for generation from pretrained models:
	```
	from tokenizers import Tokenizer
	from model import ProgenConditional

	model = ProgenConditional.from_pretrained("jsunn-y/ProCALM", subfolder="ec-onehot-swissprot/1.5B")
	tokenizer = Tokenizer.from_pretrained("jsunn-y/ProCALM")

	with torch.no_grad():
	input_ids = torch.tensor(self.tokenizer.encode(context).ids).view([1, -1]).to(self.device)
	tokens_batch = model.generate(input_ids=input_ids, condition_encodings=condition_encodings, do_sample=True, temperature=temperature, max_length=max_length, top_p=top_p, num_return_sequences=num_return_sequences, pad_token_id=self.pad_token_id, eos_token_id=4)

	as_lists = lambda batch: [batch[i, ...].detach().cpu().numpy().tolist() for i in range(batch.shape[0])]
	sequences = tokenizer.decode_batch(as_lists(tokens_batch))
	```
	Note that `condition_encodings` is a representation of the conditioning, which can be calculated using the dictionaries `.pt` provided in our github under `data`.

	## Summary of Available Models

	\| Name \| Description \|
	\|:--------\|:-------:\|
	\| progen2-base \| Original ProGen2 model with ~760 million parameters\|
	\| ec-onehot-uniref \| Trained with onehot-encoded EC conditioning, on ~29e6 enzymes from Uniref \|
	\| ec-onehot-swissprot \| Trained with onehot-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train \|
	\| tax-swissprot \| Trained on onehot-encoded EC taxonomy conditioning, on ~150e3 enzymes from Swissprot Train \|
	\| ec+tax-swissprot \| Trained jointly on onehot-encoded EC conditioning and onehot-encoded taxonomy conditioning with parallel adapters, on ~150e3 enzymes from Swissprot Train \|
	\| ec-drfp-swissprot \| Trained with DRFP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train \|
	\| ec-creep-swissprot \| Trained with CREEP-encoded EC conditioning, on ~150e3 enzymes from Swissprot Train \|