Trained on custom dataset not working
Hi authors,
I was trying to train on my custom dataset. After a while training successfully on my action data, I saved the weight to a local direction. I started running this code again:
import numpy as np
from transformers import AutoProcessor
# Load the tokenizer from the Hugging Face hub
tokenizer = AutoProcessor.from_pretrained("new_weight/fast_tokenizer")
# Tokenize & decode action chunks (we use dummy data here)
action_data = np.random.rand(1, 20, 19, 12).tolist() # one batch of action chunks
tokens = tokenizer(action_data) # tokens = list[int]
decoded_actions = tokenizer.decode(tokens)
Then, the following error occured:
Traceback (most recent call last):
File "/home/X/Desktop/robocasa/", line 10, in
tokens = tokenizer(action_data) # tokens = list[int]
File "/home/X/miniconda3/envs/robocasa/lib/python3.10/site-packages/transformers/", line 2868, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/X/miniconda3/envs/robocasa/lib/python3.10/site-packages/transformers/", line 2928, in _call_one
raise ValueError(
ValueError: text input must be of typestr
(single example),List[str]
(batch or single pretokenized example) orList[List[str]]
(batch of pretokenized examples)
Could you please look at this issue and suggest how to fix it? Thanks!
One thing is that your input action chunk should only be 3-dimensional ([batch, chunk_horizon, action_dim]), while yours is 4-dimensional now.
Hello, author,
I also tried training the weights, and this is my code.
tokenizer = AutoProcessor.from_pretrained("michaelyeah7/my_new_tokenizer", trust_remote_code=True)
action_data_numpy = []
for traj in dataset:
action = traj['action']
chunked_action = create_overlapping_chunks(action, chunk_size=4)
#tokenize chunked_action
chunked_action_np = chunked_action.numpy()
print("chunked_action_np shape",chunked_action_np.shape)
tokens = tokenizer(chunked_action_np)
My input action chunk is 3-dimensional but I still encountered a similar issue.
traj[action] Tensor("concat:0", shape=(None, 7), dtype=float32)
chunked_action_np shape (36, 4, 7)
Traceback (most recent call last):
File "/home/user/open-pi-zero/", line 232, in
tokens = tokenizer(chunked_action_np)
File "/home/user/miniconda3/envs/op0/lib/python3.10/site-packages/transformers/", line 2868, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/user/miniconda3/envs/op0/lib/python3.10/site-packages/transformers/", line 2928, in _call_one
raise ValueError(
ValueError: text input must be of typestr
(single example),List[str]
(batch or single pretokenized example) orList[List[str]]
(batch of pretokenized examples).
Could you please provide some suggestions? Thank you!
One thing is that your input action chunk should only be 3-dimensional ([batch, chunk_horizon, action_dim]), while yours is 4-dimensional now.
Actually, I just played with the code. I tried both 1,2,3,4-dimensional input and none of them works.
Seems like it's just loading as a regular PreTrainedTokenizerFast
when you load from disk.
For now @XuZhao2025 @andvg3 you could try the following, as a workaround:
tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True).from_pretrained("michaelyeah7/my_new_tokenizer")
I believe they have some misalignments between