How do you encode an image in only 81 tokens?
How do you encode an image in only 81 tokens?
The trick is inside the projector part, we use reshape mechanism to convert 729 image tokens into 81 tokens.
i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?
Yes, that is right.
i saw from your blog post, that you do [batch, 729, hidden_size] -> [batch, 81, hidden_size9] to image features. Is that done before projection, so that later projection layer is [hidden_size9, hidden_size] to get to LLM embedding tomain?
Yes, that is right
Is that specifically because of the nature of QWEN2 LLM, or do you think such strategy will work with other LLM? Maybe you had experiments?