[CLS] Token
#1
by
insaf-im
- opened
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
list(last_hidden_states.shape)
[1, 1568, 768]
The output of the VideoMAE encoder is 768 features of length 1568. (1 is the batch size)
Can I please know which is the [CLS] token?
Hi,
VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.
Hence, in this case: (16//2) * (224 // 16)**2 = 1568.
To get a representation of an entire video, you can simply average pool the last hidden states along the sequence dimension:
import torch
video_features = torch.mean(last_hidden_state, dim=1)