Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens.
#3
by
hengchuangyin
- opened
Part 1: Tokenization and Dataset Preparation
from transformers import AutoTokenizer, BertForSequenceClassification
from torch.utils.data import Dataset
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = BertForSequenceClassification.from_pretrained("zhihan1996/DNABERT-2-117M", num_labels=8)
class DNADataset(Dataset):
def __init__(self, data, tokenizer):
self.data = data
self.tokenizer = tokenizer
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
seq, label = self.data[idx]
inputs = self.tokenizer(seq, return_tensors='pt', padding='max_length', max_length=600, truncation=True)
return {
'input_ids': inputs["input_ids"].squeeze(),
'label': label
}
Part 2: Retrieving Model Configuration
from transformers import AutoConfig
config = AutoConfig.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
print(config.max_position_embeddings)
hengchuangyin
changed discussion title from
from the config of the model and my multiple attempts, the model here still cannot support sequences longer than 512.
to Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens.
Any news regarding that ? I'm working on this paper and my research topic involves using sequences longer than 512 tokens.
I tried it myself, IT DOES WORK on more than 512, make sure you have the right Transformers package:
pip/conda install transformers==4.29