PromoGen2 Model for Prokaryotic Promoter Sequence Generation
PromoGen2 is a specialized language model developed for generating and scoring prokaryotic promoter sequences. The model is particularly suitable for species with limited experimentally verified data. This model card provides guidance on loading the model, generating sequences, and scoring them using a custom scoring function.
Model Details
- Model type: Transformer-based language model (GPT-2 architecture)
- Primary use case: Generating and scoring species-specific promoter sequences
- Tags: Prokaryotic promoters, sequence generation, synthetic biology
Installation
Ensure the required packages are installed:
pip install torch transformers[torch] biopython datasets pandas numpy scipy seaborn matplotlib jupyter notebook
Loading the Model and Tokenizer
To get started, load the model and tokenizer with Hugging Face's transformers
library.
from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline
import torch
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jinyuan22/promogen2-xsmall")
tokenizer = AutoTokenizer.from_pretrained("jinyuan22/promogen2-xsmall")
# Set device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline("text-generation", model=model, device=device, tokenizer=tokenizer)
Generating Sequences
Use the text-generation
pipeline to generate sequences based on an input sequence and various parameters such as sampling temperature, repetition penalty, and top-p sampling. Customize the input sequence (txt
), number of sequences, and sampling parameters.
# Define input text and generation parameters
txt = "<|bos|>5"
num_return_sequences = 5
batch_size = 2
max_new_tokens = 50
repetition_penalty = 1.2
top_p = 0.9
temperature = 0.7
do_sample = True
# Generate sequences
all_outputs = []
for i in range(0, num_return_sequences, batch_size):
outputs = pipe(
txt,
num_return_sequences=batch_size,
max_new_tokens=max_new_tokens,
repetition_penalty=repetition_penalty,
top_p=top_p,
temperature=temperature,
do_sample=do_sample
)
all_outputs.extend(outputs)
Scoring Generated Sequences
A custom scoring function (score
) evaluates each generated sequence. It calculates the sequence's likelihood under the model, based on the provided tag (or none
if no tag is used).
@torch.no_grad()
def score(seq, tag="none"):
# Format input with specified tag
if tag == "none":
inputs = tokenizer(f"<|bos|>5{seq}3<|eos|>", return_tensors="pt")
else:
inputs = tokenizer(f"<|bos|>{tag}5{seq}3{tag}<|eos|>", return_tensors="pt")
inputs.to(device)
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
return pred['loss'].item()
Post-processing and Saving Outputs
The generated sequences are cleaned of special tokens and then scored using the score
function. Each sequence and its score are saved to an output file.
# Post-process generated sequences
tag = "none"
seqs = [output["generated_text"].replace("<|bos|>", "").replace("5", "").replace("3", "").replace(tag, "") for output in all_outputs]
scores = [score(seq, tag) for seq in seqs]
# Save sequences and scores
with open("output.txt", "w") as f:
for i, (seq, score) in enumerate(zip(seqs, scores)):
f.write(f">{i}|score={score}\n{seq}\n")
Example Parameters
- txt: Input sequence string for generation
- tag: Tag to define the context or label for generation (
"none"
if no specific tag is used) - num_return_sequences: Number of sequences to generate
- batch_size: Number of sequences generated per batch
- max_new_tokens: Maximum length of generated sequences
- repetition_penalty: Penalty to control repetition in generated sequences
- top_p: Probability for nucleus sampling
- temperature: Temperature for sampling (controls diversity)
- do_sample: Set to
True
for sampling-based generation
Usage Notes
- For best results, ensure that the device (CPU/GPU) matches the model's requirements.
- This setup supports sequence generation tasks tailored to synthetic biology, particularly for organisms lacking experimentally verified promoter data.
- Downloads last month
- 16