TIGER-Lab/Mantis-llava-7b · Newer llava versions 1.6 with multi image training!

Apr 26

This is amazing! Do you plan to add newer LLAVA versions trained with multi-image support?

TIGER-Lab org Apr 26

Well, actually we have already tried using LLaVA next verions models. However, due to the dynamic image resolution features, LLaVA-next (LLaVA 1.6) models usually converted a single image into about 2880 images on average. (see https://huggingface.co/blog/idefics2). As a comparison, LLaVA 1.5 converted each token into 576 tokens, which is acceptable for our experiments. This is too consuming for multi-image reasoning and our context length cannot support such a big number of tokens for a single image only.

We are on the way designing more token-efficient way for multi-image reasoning, like the use of resampler. Methods are still under exploration.

wenhu changed discussion status to closed May 18