Crash while loading tokenizer

#1
by legraphista - opened
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)

results in

FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained('THUDM/LongCite-llama3.1-8b', trust_remote_code=True)

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:847, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    845     if os.path.isdir(pretrained_model_name_or_path):
    846         tokenizer_class.register_for_auto_class()
--> 847     return tokenizer_class.from_pretrained(
    848         pretrained_model_name_or_path, *inputs, trust_remote_code=trust_remote_code, **kwargs
    849     )
    850 elif config_tokenizer_class is not None:
    851     tokenizer_class = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:58, in TikTokenizer.from_pretrained(path, *inputs, **kwargs)
     56 @staticmethod
     57 def from_pretrained(path, *inputs, **kwargs):
---> 58     return TikTokenizer(vocab_file=os.path.join(path, "tokenizer.tiktoken"))

File ~/.cache/huggingface/modules/transformers_modules/THUDM/LongCite-llama3.1-8b/8265f5e5bceab232605db43e6e0c6579ff941354/tiktoken_tokenizer.py:67, in TikTokenizer.__init__(self, vocab_file)
     65 if vocab_file is not None:
     66     mergeable_ranks = {}
---> 67     with open(vocab_file) as f:
     68         for line in f:
     69             token, rank = line.strip().split()

FileNotFoundError: [Errno 2] No such file or directory: 'THUDM/LongCite-llama3.1-8b/tokenizer.tiktoken'

yes, the same issue.

A workaround is to download the model locally (with huggingface_cli download) and load it via path instead of model id

ok, thanks for your workaround!

Awesome Thank you for the Workaround.

Here is a bit more Detail for those who use paths instead of ids for the first time like me :)

  1. huggingface-cli download https://huggingface.co/THUDM/LongCite-llama3.1-8b/tree/main

  2. Adjust for local path ->! important to provide snapshot ! only /home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/ wont work
    tokenizer = AutoTokenizer.from_pretrained('/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/', trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
    '/home/someuser/.cache/huggingface/hub/models--THUDM--LongCite-llama3.1-8b/snapshots/58260b89bc2a547b814f44b89914b1e282b2d5cd/',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map='auto'
    )

To the developers: Thank you for this amazing model. I had high expectations, and they have been surpassed.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Thanks for pointing out this bug. We have fix it now.

NeoZ123 changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment