zhihan1996/DNABERT-2-117M · Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens.

Aug 13, 2023

•

edited Aug 14, 2023

Part 1: Tokenization and Dataset Preparation

from transformers import AutoTokenizer, BertForSequenceClassification
from torch.utils.data import Dataset

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = BertForSequenceClassification.from_pretrained("zhihan1996/DNABERT-2-117M", num_labels=8)

class DNADataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        seq, label = self.data[idx]
        inputs = self.tokenizer(seq, return_tensors='pt', padding='max_length', max_length=600, truncation=True)
        return {
            'input_ids': inputs["input_ids"].squeeze(),
            'label': label
        }

Part 2: Retrieving Model Configuration

from transformers import AutoConfig

config = AutoConfig.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
print(config.max_position_embeddings)

hengchuangyin changed discussion title from from the config of the model and my multiple attempts, the model here still cannot support sequences longer than 512. to Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens. Aug 14, 2023

jaandoui

Dec 30, 2023

Any news regarding that ? I'm working on this paper and my research topic involves using sequences longer than 512 tokens.

jaandoui

Jan 2

I tried it myself, IT DOES WORK on more than 512, make sure you have the right Transformers package:
pip/conda install transformers==4.29