GenePT Composable Embeddings
This model is a set of embeddings for a list of about 33K functional genes, created using OpenAI embedding models (and other in the future) to embed text about the genes. Details about the process and evaluations can be found in the paper:
Chen YT, Zou J. (2023+) GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. bioRxiv preprint: https://www.biorxiv.org/content/10.1101/2023.10.16.562533v2.
and on GitHub: https://github.com/yiqunchen/GenePT
In this repsitory, we (not the original authors) are collecting modifications of the original embeddings with the intent of creating a set of composable embeddings for genes. These embeddings will encode specific information about each gene regarding a set of factors, such as aging, drug interactions, pathways, etc. The repository also contains the original embeddings.
Dataset
The base dataset was collected from NCBI and UniProt, and contains a set of gene descriptions. We have used gtp-4o-mini
(and potentially other models in the future) to generate descriptions of the genes, and other factors as mentioned above. We have collected the source datasets in the honicky/genept-composable-embeddings-source-data
Dataset repository.
Model
The model is used by multiplying the gene embedding vectors by the expression level for each genes and summing the results (e.g. a matrix multiplication). See the original paper for more details.
Code
The https://github.com/honicky/GenePT-tools repository contains the latest code for building and using the models, as well as some example notebooks.
License
The original models and data in this repository is licensed under the MIT license. The original GenePT weights are governed by the license of the original GenePT repository.