How do you encode an image in only 81 tokens?

#2
by ChristineLai - opened

How do you encode an image in only 81 tokens?

Nexa AI org

The trick is inside the projector part, we use reshape mechanism to convert 729 image tokens into 81 tokens.

i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?

Nexa AI org

Yes, that is right.

Nexa AI org

i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?

Yes, that is right

Is that specifically because of the nature of QWEN2 LLM, or do you think such strategy will work with other LLM? Maybe you had experiments?

Sign up or log in to comment