Question for mis-alignment between example_lengths_file and genecorpus_30M_2048.dataset

#61
by katarinayuan - opened

Hi,
I noticed that the provided "example_lengths_file" is recommended to be passed as argument example_lengths_file in GeneformerPretrainer, however the lengths in it are sorted. While in CustomDistributedLengthGroupedSampler, the lengths should be the length the same order in genecorpus_30M_2048.datasetinstead of being sorted. Could you help to clarify this? Thank you!

Thank you for your interest in Geneformer and for catching this! I added the unsorted lengths file to the datasets repository and updated the links in the example for pretraining. For future reference, if you need to extract a lengths file from a different .dataset, you can do so as below. We previously provided the lengths file pre-extracted from the .dataset because it was slow to extract dynamically during training with large datasets, but Huggingface has had some updates since then that I believe resolved this issue so it's less necessary now.

from datasets import load_from_disk
genecorpus30m = load_from_disk("/path/to/genecorpus_30M_2048.dataset/")
genecorpus30m_lengths = genecorpus30m["length"]

ctheodoris changed discussion status to closed

If I have a new dataset, how to put the ["length"] attribute into the dataset?

Thank you for your question. The provided transcriptome tokenizer does this for you.

I notice "Cells should be labeled with the total read count in the cell (column attribute "n_counts")", so we can do like this: TranscriptomeTokenizer({"n_counts": "length"}, nproc=4)?

No, these are different values. "n_counts" is the total read count in the cell. "length" is the length of the rank value encoding (number of genes in the final rank value encoding). The "length" will be added automatically by the transcriptome tokenizer - you don't need to do anything to add it.

Thanks a lot for your amazing fast reply πŸ˜„

About the tokenizer, "Genes should be labeled with Ensembl IDs (row attribute "ensembl_id"), which provide a unique identifer for conversion to tokens. Cells should be labeled with the total read count in the cell (column attribute "n_counts") to be used for normalization."
Could genes be labeled with gene names instead of Ensembl IDs? And the token_dictionary.pkl file may be as follows:
{'<pad>': 0, '<mask>': 1, gene_name1: 2, gene_name2: 3, ….gene_name100: 101}
Would this change influence the tokenizer's computation?
Thank you very much for your patient and fast reply.

Thank you for your question. Please convert the gene names in your dataset to Ensembl IDs (e.g. using Ensembl Biomart) as indicated in the provided instructions. We use Ensembl IDs so that they are unique and consistent, while gene names arise from various naming schemes. If you convert the token dictionary instead of your dataset, you risk changing a token to be a different gene then what it was trained as, and also the code expects this dictionary to be in the current format in various places so you would have to change many parts of the code to accept the new format.

When I changed the sorted version into the unsorted version, I found an interesting fact that after 3 epochs the loss of the unsorted one was bigger than the sorted one. Why this happened? I think the unsorted version may be closer to the actual situation. If I misunderstand, forgive me please...
However, another fact that the unsorted version only took half time was amazing! The two versions' format is the same, and the only difference is the order, even their file sizes are the same as well, so why did they take so much different time?
I really appreciate you responding so quickly every time O(∩_∩)O~

Thank you for your question. The length-grouped training improves the speed of training, as discussed in the manuscript, so using the matched lengths likely sped up the training by leading to better length-grouping than random, which would be the result of non-matched lengths. In our experience the model performance is essentially identical when using different seeds, etc., as shown in the manuscript. I would suggest you zoom out to see the entire training loss plot to confirm that the pattern is the same in terms of smooth improvement in loss and to confirm it's not just a scaling issue of the y axis. We provide the pretrained model in this repository so that users do not need to retrain it from scratch.

Hi, nice to see the discussion here. Just wanna mention something. I have similar observation as @allenxiao did, that converged loss when pretrained with matched lengths is larger than that when pretrained with unmatched lengths. I confirm the steps of loss curves are aligned so it's unlikely to be the problem of scaling issue of curves. Any guess on why it happens?

Thanks for your comment. If you look in the pretrainer module, you can see where the variable "lengths" is used. More randomization may generally lead to improved pretraining, though in our experience in subsets of the data it did not make a significant difference to warrant the slowdown of training. It also could be differentially affected by the batch size used (i.e. better randomization has bigger effect dependent on batch size) - I’m not sure if you changed that based on your resources. I also don’t know how significant the difference in loss was for you and whether this translates to improved performance in downstream tasks or not.

Thank you for your question. Please convert the gene names in your dataset to Ensembl IDs (e.g. using Ensembl Biomart) as indicated in the provided instructions. We use Ensembl IDs so that they are unique and consistent, while gene names arise from various naming schemes. If you convert the token dictionary instead of your dataset, you risk changing a token to be a different gene then what it was trained as, and also the code expects this dictionary to be in the current format in various places so you would have to change many parts of the code to accept the new format.

When converting the gene names to Ensembl IDs for some datasets, we found some IDs have been retired. In this situation, should we abandon the retired gene or just use the retired ID? If we use retired IDs in the model, would it have bad impact for the model training?
Thank you for your reply.

Thank you for your question. I cannot tell if you mean that your dataset has genes that are no longer within the annotations of Ensembl or if you mean that the token dictionary has retired IDs. The purpose of converting the gene names to Ensembl IDs is just to arrive at a common consistent and unique annotation for the tokenizer to use to convert the genes to tokens for use with the model. If the Ensembl IDs are not within the provided token dictionary, then those genes were not within the vocabulary of the pretrained model. This can happen with genes that were not detected in any of the 30 million cells used for pretraining and also with genes that were not protein-coding or miRNA genes, since that was our inclusion criteria. These genes can be added to the vocabulary but were just not pretrained. If you present the model with additional data during fine-tuning that contains these genes, then it will learn from that data.

Sorry, I didn't make myself clear. I mean, for the purpose of pretraining with my own dataset (not the Geneformer's 30 million cells) , when converting gene names to Ensembl IDs, there are some genes no longer within the annotations of Ensembl's latest version. For example, one gene ID was included in GRCh37, but when GRCh38 released, it was no longer included in GRCh38, so it's retired. However, cells' expression data related to the retired gene still exist, because the collected dataset is organized in unit of gene names. So could I use the old version Ensembl IDs (converted from gene names) for model pretraining? If yes, the dataset may have genes from both GRCh37 and GRCh38, and would it be allowed in the Geneformer's pretraining schema regarding the consistency( "The purpose of converting the gene names to Ensembl IDs is just to arrive at a common consistent and unique annotation for the tokenizer " )? Despite that, I think the uniqueness for gene ID annotation is satisfied.
Thank you for the great tool and your reply.

Thank you for the additional information. If I understand correctly, from a pure computational standpoint this will not pose an issue for the pretraining as long the same transcript is not labeled as two different Ensembl IDs depending on what genome assembly was used for the original data preprocessing. However, from a biological standpoint, it would be prudent to determine the reasons why the genes were retired as this may provide information on whether it would be best to exclude them or not.

Sign up or log in to comment