GeneMamba: Efficient and Effective Large Cell Model on Single Cell Data

Model Description

GeneMamba is a pretrained transformer-based model designed for analyzing single-cell RNA sequencing (scRNA-seq) data. It adapts the Mamba architecture to represent single-cell data, viewing cells as sentences and genes as tokens. GeneMamba is specifically optimized to handle long-sequence data, supporting context lengths up to 8192 tokens. This capability allows the model to incorporate previously ignored low-expression genes, enabling a comprehensive analysis of gene expression profiles.

Key Features

Scalability: Trained on over 50 million cells, offering robust generalization across datasets.
Versatility: Supports multiple tasks, including gene classification, pathway analysis, and gene-pair correlations.
Pretrained Efficiency: Leverages large-scale pretraining to encode gene relationships effectively.

Applications

GeneMamba excels in tasks that require high-dimensional understanding of gene-gene interactions, such as:

Cell Type Prediction: Facilitates accurate classification of cell types.
Gene Pathway Analysis: Uncovers complex relationships between genes and pathways.
Context-Aware Gene Correlation: Detects gene expression patterns influenced by broader biological contexts.
Gene Ranking Reconstruction: Evaluates gene importance by reconstructing ranking correlations.

Training Dataset

GeneMamba was trained on a diverse collection of scRNA-seq datasets, encompassing various tissue types, species, and experimental conditions. Preprocessing steps included normalization (sc.pp.normalize_total) and logarithmic transformation (sc.pp.log1p) to ensure robust handling of variability.

Usage

The model can be loaded and fine-tuned for specific scRNA-seq tasks using the Hugging Face Transformers library:

from transformers import AutoModel, AutoTokenizer

# Load the pretrained GeneMamba model
model = AutoModel.from_pretrained("your-hf-repo/GeneMamba")
tokenizer = AutoTokenizer.from_pretrained("your-hf-repo/GeneMamba")

# Example: Encode gene sequences
inputs = tokenizer(["Gene1 Gene2 Gene3 ..."], return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

Supported Tasks

Token classification
Sequence-to-sequence modeling
Embedding generation for downstream analysis

Performance Metrics

GeneMamba has been benchmarked against other state-of-the-art models like scGPT and Geneformer across 10 tasks, consistently achieving competitive rankings. Metrics include:

Clusterness Score (CTS)
Hopkins Score (HS)
Average Task Performance (bubble plot visualization available in supplementary material)

Limitations

The model may require fine-tuning for datasets with novel cell types or experimental conditions.
Performance might vary for genes or pathways underrepresented in the training dataset.

Ethical Considerations

GeneMamba is intended for research and educational purposes. Users should validate results before applying them in clinical or commercial settings. Model usage should comply with relevant ethical guidelines and regulations.

Citation

If you use GeneMamba in your research, please cite:

@article{GeneMamba2024,
  title={GeneMamba: Efficient and Effective Large Cell Model on Single Cell Data},
  author={Cong, et al.},
  journal={Under Review},
  year={2024}
}