THUDM/cogvlm2-llama3-chat-19B · multiple image support?

wamozart

May 30, 2024

Hi.

Great work. Does it support multiple images as input as well?

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 4, 2024

no yet. Just One image

alextsgnv

Jun 6, 2024

•

edited Jun 6, 2024

Hello, if the question is still relevant, then this can be done.

https://github.com/THUDM/CogVLM/issues/143

It worked in the previous version, and it still works in this one, but I did not evaluate the quality.

wamozart

Jun 6, 2024

@alextsgnv Were you able to do that? If so, can you share the code snippet?

alextsgnv

Jun 6, 2024

@wamozart Yes, I did it, the code is in the link.
https://github.com/THUDM/CogVLM/issues/143

deleted

Jul 16, 2024

•

edited Jul 16, 2024

I'm glad I found this model as well, thank you so much for sharing, and I don't think no one still realized the GRAVITY of how powerful this Vision Model is!!!
And like @wamozart said, I was looking to do a batch image processing capability for my own dataset to finetune a new SDXL model and I used GPT4V captioner by @JiayeV
https://github.com/jiayev/GPT4V-Image-Captioner
Its really worth it to do a High Quality Lora Training, which I did, https://civitai.com/user/Ababiya , and turned out really great! but it will be pricy for my 30k Dataset to finetune a Checkpoint, and that's where CoGVLM2 comes in, and thank you @alextsgnv for the link. ill look in to it. and Pls MAKE IT SUPPORT Batch image processing capability Like @JiayeV , that would be a Game changer.
I also did a small comparison between this 2 models & a result, I used https://civitai.com/models/133005/juggernaut-xl , to train the dataset & ComfyUI workflow and you be the judge.