README.md · ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility at 7981353e09fed8579598ba47d1fd5c115efd78d1

metadata

tags:
  - protein
  - ibm
  - mammal
  - pytorch
  - transformers
library_name: biomed
license: apache-2.0
base_model:
  - ibm/biomed.omics.bl.sm.ma-ted-400m

Protein solubility is a critical factor in both pharmaceutical research and production processes, as it can significantly impact the quality and function of a protein.
This is an example for finetuning ibm/biomed.omics.bl.sm-ted-400m for protein solubility prediction (binary classification) based solely on the amino acid sequence.

The benchmark defined in: https://academic.oup.com/bioinformatics/article/34/15/2605/4938490
Data retrieved from: https://zenodo.org/records/1162886

Model Summary

Developers: IBM Research
GitHub Repository: https://github.com/BiomedSciAI/biomed-multi-alignment
Paper: TBD
Release Date: Oct 28th, 2024
License: Apache 2.0.

Usage

Using ibm/biomed.omics.bl.sm.ma-ted-400m requires installing https://github.com/BiomedSciAI/biomed-multi-alignment

pip install git+https://github.com/BiomedSciAI/biomed-multi-alignment.git

A simple example for a task already supported by ibm/biomed.omics.bl.sm.ma-ted-400m:

import os

from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp

from mammal.examples.protein_solubility.task import ProteinSolubilityTask
from mammal.keys import CLS_PRED, SCORES
from mammal.model import Mammal

# Load Model
model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")

# Load Tokenizer
tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m.protein_solubility")

# convert to MAMMAL style
sample_dict = {"protein_seq": protein_seq}
sample_dict = ProteinSolubilityTask.data_preprocessing(
    sample_dict=sample_dict,
    protein_sequence_key="protein_seq",
    tokenizer_op=tokenizer_op,
    device=model.device,
)

# running in generate mode
batch_dict = model.generate(
    [sample_dict],
    output_scores=True,
    return_dict_in_generate=True,
    max_new_tokens=5,
)

# Post-process the model's output
ans = ProteinSolubilityTask.process_model_output(
    tokenizer_op=tokenizer_op,
    decoder_output=batch_dict[CLS_PRED][0],
    decoder_output_scores=batch_dict[SCORES][0],
)

# Print prediction
print(f"{ans=}")

For more advanced usage, see our detailed example at: on https://github.com/BiomedSciAI/biomed-multi-alignment

Citation

If you found our work useful, please consider giving a star to the repo and cite our paper:

@article{TBD,
  title={TBD},
  author={IBM Research Team},
  jounal={arXiv preprint arXiv:TBD},
  year={2024}
}