Converge-SC for Embeddings: How to use?

Task Description

Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells. The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more.

Basic Usage

The examples folder under the tab files and versions contains both the notebook and the gene mapping json file.

Go to the examples/get_embeddings.ipynb notebook to see how to generate embeddings for your single-cell data.

Pipeline Description

The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves:

Loading your single-cell data (as an AnnData object)
Preprocessing and normalizing the data
Loading the pre-trained Converge-SC model and tokenizer
Generating embeddings for each cell
Storing the embeddings for downstream tasks

Input Data Requirements

Your data should be in the form of an AnnData object (.h5ad file) with:

Expression Data: Gene expression measurements in adata.X
Gene Information: Gene identifiers in adata.var_names

Preprocessing Steps

Before generating embeddings, you should preprocess your data:

Normalization: Normalize your data to a common scale

import scanpy as sc
   
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)  # Log-transform the data

Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable

import json
   
# Load the mapping file
with open('examples/ensembl_to_gene_symbol.json', 'r') as file:
    ensg_to_symbol = json.load(file)
    
# Map gene names
adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col))

Generating Embeddings

Load model and tokenizer

model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)

Compute Embeddings

tokenized_cell = tokenizer(gene_names, expression_values=gene_values)
embedding = model(**tokenized_cell)

ConvergeBio
/

ConvergeSC-embeddings

You need to agree to share your contact information to access this model