license: cc-by-nc-nd-4.0
Model Card for CellFM
CellFM is a large-scale foundation model pre-trained on transcriptomics of 100 million human cells.
Model Details
CellFM is a large-scale foundation model that helps efficiently analyze single-cell data and exploit the rich knowledge contained in single-cell atlas datasets. CellFM was pre-trained on a dataset of approximately 100 million human cells. The model can be applied to a range of single-cell tasks, including cell type annotation, prediction of responses to perturbations and gene function prediction.
Model Description
- Developed by: Yuansong Zeng, Jiancong Xie, Ningyuan Shangguan, Zhuoyi Wei, Wenbing Li, Yun Su, Shuangyu Yang, Chengyang Zhang, Jinbo Zhang, Nan Fang, Hongyu Zhang, Yutong Lu, Weijiang Yu, Jue Fan, Huiying Zhao, Yuedong Yang
- Model type: RetNet-based foundation model
- License: cc-by-nc-nd-4.0
- Github Repository: CellFM
Uses
Installation
To reproduce CellFM, we suggest first create a conda environment by:
conda create -n CellFM python=3.9
conda activate CellFM
and then install the required packages below:
- mindspore=2.2.10
- scanpy=1.10
- scib=1.1.5
Optional
- gears
- torch
Data preprocessing
To run CellFM, we need to first preprocess the data in h5 or h5ad format. The preprocessing pipeline for different downstream tasks can refer to process.ipynb. We recommend storing the processed datasets in the Datasets.
Train on the new dataset
We provided a script train.py for finetuning or training on new datasets. For example, we can train on the HumanPBMC dataset with a single NPU device by executing:
# Train with single device
python train.py --data HumanPBMC --batch 4 --epoch 5 --load_pretrain [--fp16] [--lora LORA_RANK] [--workpath /DIR/TO/WORKSPACE]
- --data: dataset name. Note that the dataset should be located in /DIR/TO/WORKSPACE/datasets with h5 or h5ad format.
- --batch: batch size.
- --epoch: the number of training epochs.
- --load_pretrain: load the pretrained weight of CellFM.
- --fp16: unnecessary. Set the training process under half-precision mode.
- --lora: unnecessary. Using the LoRA algorithm to update the weights using LORA_RANK as the hidden dimension of lora block, default 0 i.e. not use LoRA.
- --workpath: unnecessary when training with a single device. Set the absolute directory of the work path and default the directory containing codes.
We also provide a script to apply parallel training within one node. For the same example, the command below works the same as the command above except it will works on 8 devices while each device handle an input with batch size=4.
# Train parallelly in one node
bash 1node_train.sh train 4 5 HumanPBMC
Tutorials
We provide tutorials for CellFM applications. Checkpoints of our model are stored in Model.
Tutorial 1: Cell Annotation
Tutorial 2: Gene Function Prediction
BinaryclassGeneFunction.ipynb.
MulticlassGeneFunction.ipynb.
Tutorial 3: Batch Effect Correction
Tutorial 4: Perturbation
Tutorial 5: Identifying Cell-type-specific lncRNAs
IdentifyingCelltypelncRNAs.ipynb.
Limitations
Despite the advances in CellFM, several limitations remain to be explored. Firstly, the attention map in CellFM was limited in capturing gene relationships related to static or global biological knowledge. In the future, we will explore new explainability techniques to overcome this challenge. Furthermore, the current model is limited by the absence of multi-species data, which restricts its potential for broader biological contexts and cross-species comparisons. Finally, the model's construction did not leverage existing biological prior knowledge, which could affect its depth and accuracy in interpreting biological phenomena.
Training Data
CellFM was trained on a dataset of 19,914 samples, a total of 102,304,686 human cells from different organs and sequencing technologies. All training data utilized in this study were sourced from reputable public databases.
Evaluation
CellFM was evaluated on various single-cell datasets that were not included in the training set. The performance of the model is evaluated based on cell annotation, perturbation prediction, and gene function prediction.
Citation
BibTeX:
@article{CellFM,
title={CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells},
author={Yuansong Zeng, Jiancong Xie, Zhuoyi Wei, Yun Su, Ningyuan Shangguan, Shuangyu Yang, Chengyang Zhang, Wenbing Li, Jinbo Zhang, Nan Fang, Hongyu Zhang, Huiying Zhao, Yutong Lu, Jue Fan, Weijiang Yu, and Yuedong Yang},
journal={},
year={2024},
}