How do you encode an image in only 81 tokens?

by ChristineLai - opened 19 days ago

Discussion

ChristineLai

19 days ago

How do you encode an image in only 81 tokens?

alexchen4ai

Nexa AI org 19 days ago

The trick is inside the projector part, we use reshape mechanism to convert 729 image tokens into 81 tokens.

asdfjghkjhiu

15 days ago

i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?

alexchen4ai

Nexa AI org 15 days ago

Yes, that is right.

alexchen4ai

Nexa AI org 15 days ago

i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?

Yes, that is right

asdfjghkjhiu

15 days ago

Is that specifically because of the nature of QWEN2 LLM, or do you think such strategy will work with other LLM? Maybe you had experiments?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment