ShangguanNingyuan commited on
Commit
e8823d9
·
verified ·
1 Parent(s): 6855c46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -3
README.md CHANGED
@@ -1,3 +1,120 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ ---
4
+
5
+ # Model Card for CellFM
6
+ CellFM is a large-scale foundation model pre-trained on transcriptomics of 100 million human cells.
7
+
8
+ ## Model Details
9
+
10
+ CellFM is a large-scale foundation model that helps efficiently analyze single-cell data and exploit the rich knowledge contained in single-cell atlas datasets. CellFM was pre-trained on a dataset of approximately 100 million human cells. The model can be applied to a range of single-cell tasks, including cell type annotation, prediction of responses to perturbations and gene function prediction.
11
+
12
+
13
+ ### Model Description
14
+
15
+
16
+ - **Developed by:** Yuansong Zeng, Jiancong Xie, Ningyuan Shangguan, Zhuoyi Wei, Wenbing Li, Yun Su, Shuangyu Yang, Chengyang Zhang, Jinbo Zhang, Nan Fang, Hongyu Zhang, Yutong Lu, Weijiang Yu, Jue Fan, Huiying Zhao, Yuedong Yang
17
+ - **Model type:** RetNet-based foundation model
18
+ - **License:** cc-by-nc-nd-4.0
19
+ - **Github Repository:** [CellFM](https://github.com/biomed-AI/CellFM)
20
+
21
+
22
+ ## Uses
23
+
24
+ ### Installation
25
+
26
+ To reproduce **CellFM**, we suggest first create a conda environment by:
27
+
28
+ ~~~shell
29
+ conda create -n CellFM python=3.9
30
+ conda activate CellFM
31
+ ~~~
32
+
33
+ and then install the required packages below:
34
+
35
+ - mindspore=2.2.10
36
+ - scanpy=1.10
37
+ - scib=1.1.5
38
+ #### Optional
39
+ - gears
40
+ - torch
41
+ ### Data preprocessing
42
+
43
+ To run **CellFM**, we need to first preprocess the data in h5 or h5ad format. The preprocessing pipeline for different downstream tasks can refer to [process.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/process.ipynb). We recommend storing the processed datasets in the [Datasets](#Datasets).
44
+
45
+ ### Train on the new dataset
46
+
47
+ We provided a script [train.py](https://github.com/biomed-AI/CellFM/blob/main/train.py) for finetuning or training on new datasets. For example, we can train on the HumanPBMC dataset with a single NPU device by executing:
48
+
49
+ ~~~shell
50
+ # Train with single device
51
+ python train.py --data HumanPBMC --batch 4 --epoch 5 --load_pretrain [--fp16] [--lora LORA_RANK] [--workpath /DIR/TO/WORKSPACE]
52
+ ~~~
53
+
54
+ - --data: dataset name. Note that the dataset should be located in /DIR/TO/WORKSPACE/datasets with h5 or h5ad format.
55
+ - --batch: batch size.
56
+ - --epoch: the number of training epochs.
57
+ - --load_pretrain: load the pretrained weight of **CellFM**.
58
+ - --fp16: unnecessary. Set the training process under half-precision mode.
59
+ - --lora: unnecessary. Using the LoRA algorithm to update the weights using LORA_RANK as the hidden dimension of lora block, default 0 i.e. not use LoRA.
60
+ - --workpath: unnecessary when training with a single device. Set the **absolute directory** of the work path and default the directory containing codes.
61
+
62
+ We also provide a script to apply parallel training within one node. For the same example, the command below works the same as the command above except it will works on 8 devices while each device handle an input with batch size=4.
63
+
64
+ ```shell
65
+ # Train parallelly in one node
66
+ bash 1node_train.sh train 4 5 HumanPBMC
67
+ ```
68
+
69
+ ### Tutorials
70
+
71
+ We provide tutorials for CellFM applications.
72
+
73
+ #### Tutorial 1: Cell Annotation
74
+
75
+ [CellAnnotation](https://github.com/biomed-AI/CellFM/blob/main/tutorials/CellAnnotation).
76
+
77
+ #### Tutorial 2: Gene Function Prediction
78
+
79
+
80
+ [BinaryclassGeneFunction.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/BinaryclassGeneFunction.ipynb).
81
+ [MulticlassGeneFunction.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/MulticlassGeneFunction.ipynb).
82
+
83
+ #### Tutorial 3: Batch Effect Correction
84
+
85
+ [BatchIntegration.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/BatchIntegration/BatchIntegration.ipynb).
86
+
87
+ #### Tutorial 4: Perturbation
88
+
89
+ [GenePerturbation.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/Perturbation/GenePerturbation.ipynb).
90
+
91
+ #### Tutorial 5: Identifying Cell-type-specific lncRNAs
92
+
93
+ [IdentifyingCelltypelncRNAs.ipynb](https://github.com/biomed-AI/CellFM/blob/main/tutorials/IdentifyingCelltypelncRNAs.ipynb).
94
+
95
+ ## Limitations
96
+
97
+ Despite the advances in CellFM, several limitations remain to be explored. Firstly, the attention map in CellFM was limited in capturing gene relationships related to static or global biological knowledge. In the future, we will explore new explainability techniques to overcome this challenge. Furthermore, the current model is limited by the absence of multi-species data, which restricts its potential for broader biological contexts and cross-species comparisons. Finally, the model's construction did not leverage existing biological prior knowledge, which could affect its depth and accuracy in interpreting biological phenomena.
98
+
99
+
100
+ ## Training Data
101
+
102
+ CellFM was trained on a dataset of 19,914 samples, a total of 102,304,686 human cells from different organs and sequencing technologies. All training data utilized in this study were sourced from reputable public databases.
103
+
104
+
105
+ ## Evaluation
106
+
107
+ CellFM was evaluated on various single-cell datasets that were not included in the training set. The performance of the model is evaluated based on cell annotation, perturbation prediction, and gene function prediction.
108
+
109
+ ## Citation
110
+
111
+ **BibTeX:**
112
+
113
+ ~~~bibtex
114
+ @article{CellFM,
115
+ title={CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells},
116
+ author={Yuansong Zeng, Jiancong Xie, Zhuoyi Wei, Yun Su, Ningyuan Shangguan, Shuangyu Yang, Chengyang Zhang, Wenbing Li, Jinbo Zhang, Nan Fang, Hongyu Zhang, Huiying Zhao, Yutong Lu, Jue Fan, Weijiang Yu, and Yuedong Yang},
117
+ journal={},
118
+ year={2024},
119
+ }
120
+ ~~~