Converge-SC for Embeddings: How to use?
Task Description
Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells. The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more.
Basic Usage
The examples
folder under the tab files and versions
contains both the notebook and the gene mapping json file.
Go to the examples/get_embeddings.ipynb
notebook to see how to generate embeddings for your single-cell data.
Pipeline Description
The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves:
- Loading your single-cell data (as an AnnData object)
- Preprocessing and normalizing the data
- Loading the pre-trained Converge-SC model and tokenizer
- Generating embeddings for each cell
- Storing the embeddings for downstream tasks
Input Data Requirements
Your data should be in the form of an AnnData object (.h5ad file) with:
- Expression Data: Gene expression measurements in adata.X
- Gene Information: Gene identifiers in adata.var_names
Preprocessing Steps
Before generating embeddings, you should preprocess your data:
- Normalization: Normalize your data to a common scale
import scanpy as sc
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata) # Log-transform the data
- Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable
import json
# Load the mapping file
with open('examples/ensembl_to_gene_symbol.json', 'r') as file:
ensg_to_symbol = json.load(file)
# Map gene names
adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col))
Generating Embeddings
Load model and tokenizer
model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
Compute Embeddings
tokenized_cell = tokenizer(gene_names, expression_values=gene_values)
embedding = model(**tokenized_cell)
- Downloads last month
- 341