--- metrics: - matthews_correlation - f1 tags: - biology - medical --- This version of DNABERT2 has been changed to be able to output the attention too, for attention analysis. **To the author of DNABERT2, feel free to use those modifications.** Use ```--model_name_or_path jaandoui/DNABERT2-AttentionExtracted``` instead of the original repository to have access to the attention. Most of the modifications were done in Bert_Layer.py. It has been modified especially for fine tuning and hasn't been tried for pretraining. Before or next to each modification, you can find "JAANDOUI" so to see al modifications, search for "JAANDOUI". "JAANDOUI TODO" means that if that part is going to be used, maybe something might be missing. Now in ```Trainer``` (or ```CustomTrainer``` if overwritten) in ```compute_loss(..)``` when defining the model: ```outputs = model(**inputs, return_dict=True, output_attentions=True)``` activate the extraction of attention: ```output_attentions=True``` (and ```return_dict=True``` (optional)). You can now extract the attention in ```outputs.attentions``` Read more about model outputs here: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/output#transformers.utils.ModelOutput The official link to DNABERT2 [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome ](https://arxiv.org/pdf/2306.15006.pdf). READ ME OF THE OFFICIAL DNABERT2: We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development. DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome. To load the model from huggingface: ``` import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) ``` To calculate the embedding of a dna sequence ``` dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] hidden_states = model(inputs)[0] # [1, sequence_length, 768] # embedding with mean pooling embedding_mean = torch.mean(hidden_states[0], dim=0) print(embedding_mean.shape) # expect to be 768 # embedding with max pooling embedding_max = torch.max(hidden_states[0], dim=0)[0] print(embedding_max.shape) # expect to be 768 ```