PromoGen2 Model for Prokaryotic Promoter Sequence Generation

PromoGen2 is a specialized language model developed for generating and scoring prokaryotic promoter sequences. The model is particularly suitable for species with limited experimentally verified data. This model card provides guidance on loading the model, generating sequences, and scoring them using a custom scoring function.

Model Details

  • Model type: Transformer-based language model (GPT-2 architecture)
  • Primary use case: Generating and scoring species-specific promoter sequences
  • Tags: Prokaryotic promoters, sequence generation, synthetic biology

Installation

Ensure the required packages are installed:

pip install torch transformers[torch] biopython datasets pandas numpy scipy seaborn matplotlib jupyter notebook

Loading the Model and Tokenizer

To get started, load the model and tokenizer with Hugging Face's transformers library.

from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline
import torch

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("jinyuan22/promogen2-base")
tokenizer = AutoTokenizer.from_pretrained("jinyuan22/promogen2-base")

# Set device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline("text-generation", model=model, device=device, tokenizer=tokenizer)

Generating Sequences

Use the text-generation pipeline to generate sequences based on an input sequence and various parameters such as sampling temperature, repetition penalty, and top-p sampling. Customize the input sequence (txt), number of sequences, and sampling parameters.

# Define input text and generation parameters
txt = "<|bos|>5"
num_return_sequences = 5
batch_size = 2
max_new_tokens = 50
repetition_penalty = 1.2
top_p = 0.9
temperature = 0.7
do_sample = True

# Generate sequences
all_outputs = []
for i in range(0, num_return_sequences, batch_size):
    outputs = pipe(
        txt, 
        num_return_sequences=batch_size,
        max_new_tokens=max_new_tokens,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        temperature=temperature,
        do_sample=do_sample
    )
    all_outputs.extend(outputs)

Scoring Generated Sequences

A custom scoring function (score) evaluates each generated sequence. It calculates the sequence's likelihood under the model, based on the provided tag (or none if no tag is used).

@torch.no_grad()
def score(seq, tag="none"):
    # Format input with specified tag
    if tag == "none":
        inputs = tokenizer(f"<|bos|>5{seq}3<|eos|>", return_tensors="pt")
    else:
        inputs = tokenizer(f"<|bos|>{tag}5{seq}3{tag}<|eos|>", return_tensors="pt")
    inputs.to(device)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
    return pred['loss'].item()

Post-processing and Saving Outputs

The generated sequences are cleaned of special tokens and then scored using the score function. Each sequence and its score are saved to an output file.

# Post-process generated sequences
tag = "none"
seqs = [output["generated_text"].replace("<|bos|>", "").replace("5", "").replace("3", "").replace(tag, "") for output in all_outputs]
scores = [score(seq, tag) for seq in seqs]

# Save sequences and scores
with open("output.txt", "w") as f:
    for i, (seq, score) in enumerate(zip(seqs, scores)):
        f.write(f">{i}|score={score}\n{seq}\n")

Example Parameters

  • txt: Input sequence string for generation
  • tag: Tag to define the context or label for generation ("none" if no specific tag is used)
  • num_return_sequences: Number of sequences to generate
  • batch_size: Number of sequences generated per batch
  • max_new_tokens: Maximum length of generated sequences
  • repetition_penalty: Penalty to control repetition in generated sequences
  • top_p: Probability for nucleus sampling
  • temperature: Temperature for sampling (controls diversity)
  • do_sample: Set to True for sampling-based generation

Usage Notes

  • For best results, ensure that the device (CPU/GPU) matches the model's requirements.
  • This setup supports sequence generation tasks tailored to synthetic biology, particularly for organisms lacking experimentally verified promoter data.
Downloads last month
35
Safetensors
Model size
148M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.