Some things that struck me as odd.
I was even able to move this mmproj to the gguf folder of a fine-tuned YI-34B-based model, then treat it as a vision model, and it worked normally.
This is mysterious to me and I don't know how it was done. Maybe I should look for and read the relevant materials.
Anyway, thanks for everything you have done!
I believe this is as expected! There could be some issues, but I believe that since it's a finetune of the same base model, the projector maps the embedding space of the image onto the embedding space of the model. I don't believe with finetunes the embedding space changes much. From my limited understanding, conceptually we can create this embedding space that the LLM and the Vision part of the model both speak. Now that they speak the same language we basically just use that language everywhere. So when the model receives an image, it gets converted into this 'language' and take tokens just like putting text in does.
I don't 100% know this to be the case, but have read it a few places. I'm certainly not an expert, but conceptually it makes sense to me. Either reading the LLaVA paper itself, or maybe the CLIP paper/relevant docs would be most helpful here. I've had someone explain me something like this in the past, but I've not done too much digging independently other than just trying to make things work hahah
Thanks for your explanation. I roughly understand it.