Newer llava versions 1.6 with multi image training!
#1
by
ibadami
- opened
This is amazing! Do you plan to add newer LLAVA versions trained with multi-image support?
Well, actually we have already tried using LLaVA next verions models. However, due to the dynamic image resolution features, LLaVA-next (LLaVA 1.6) models usually converted a single image into about 2880 images on average. (see https://huggingface.co/blog/idefics2). As a comparison, LLaVA 1.5 converted each token into 576 tokens, which is acceptable for our experiments. This is too consuming for multi-image reasoning and our context length cannot support such a big number of tokens for a single image only.
We are on the way designing more token-efficient way for multi-image reasoning, like the use of resampler. Methods are still under exploration.
wenhu
changed discussion status to
closed