Update README.md

ea9e763 verified 9 months ago

5.85 kB

	---
	license: cc-by-nc-sa-4.0
	widget:
	- text: ACCTGA<mask>TTCTGAGTC
	tags:
	- DNA
	- biology
	- genomics
	- segmentation
	---
	# segment-nt-multi-species

	Segment-NT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
	elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
	but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.

	For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train Segment-NT, mainly because only this subset of annotations is
	available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
	splice acceptor and donor sites.


	Developed by: [InstaDeep](https://huggingface.co/InstaDeepAI)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
	- Paper: [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint

	### How to use

	<!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
	Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models:
	```bash
	pip install --upgrade git+https://github.com/huggingface/transformers.git
	```

	A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
	```
	⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, Segment-NT has
	been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor`
	argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
	(i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
	```

	```python
	# Load model and tokenizer
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
	model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

	# Choose the length to which the input sequences are padded. By default, the
	# model max length is chosen, but feel free to decrease it as the time taken to
	# obtain the embeddings increases significantly with it.
	# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
	# 2 to the power of the number of downsampling block, i.e 4.
	max_length = 12 + 1

	assert (max_length - 1) % 4 == 0, (
	"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
	"2 to the power of the number of downsampling block, i.e 4.")

	# Create a dummy dna sequence and tokenize it
	sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
	tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

	# Infer
	attention_mask = tokens != tokenizer.pad_token_id
	outs = model(
	tokens,
	attention_mask=attention_mask,
	output_hidden_states=True
	)

	# Obtain the logits over the genomic features
	logits = outs.logits.detach()
	# Transform them in probabilities
	probabilities = torch.nn.functional.softmax(logits, dim=-1)
	print(f"Probabilities shape: {probabilities.shape}")

	# Get probabilities associated with intron
	idx_intron = model.config.features.index("intron")
	probabilities_intron = probabilities[:,:,idx_intron]
	print(f"Intron probabilities shape: {probabilities_intron.shape}")
	```


	## Training data

	The segment-nt-multi-species model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
	validation for training monitoring and test for final evaluation.

	## Training procedure

	### Preprocessing

	The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

	```
	<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
	```

	### Training

	The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days.


	### Architecture

	The model is composed of the [nucleotide-transformer-v2-500m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed
	the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
	blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters
	to 562M.

	### BibTeX entry and citation info

	#TODO: Add bibtex citation here
	```bibtex

	```