@merve on Hugging Face: "OmniVision-968M: a new local VLM for edge devices, fast & small but performant…"

I think you forgot to add the actual audio processor ~

Right now its incomplete bro ! basically it is a llava model ..
for it to be the anymodal you need to add the input processor for the audio as well:

Your on the right track : begining with Input methods first :
Get over this first hurdle : Add the Audio processor : Here there are two methods : Speech input (basic) and Audio( Stable audio ) ( identfys sounds ) ...

Then we can design the correct outputs :

First we will need to train the model on these inputs : Just to be able to return text is fine as we are dealing with a llm model firstly and need to get this input processes embedded into the model space :

SO
given a Image and text input
or
given a speech and image input
Or
given a text and sound input

All output to text only !! ( very important stage ) wwe need to get the model super fit on this task before moving to generation :

For the opposite modal we use the exact same training set but this time we provide the media output as well as the text :
embedding both tasks : Even creating clones and merging these models to begin the process again on the merged model : this provides a mass scattering of tensors and thier activations : so that you can fdo fine tuning training instead of full model training ! finen tuning each task hence the requirement to over fit the task ( its only going ot over fit onn those select parameters , ie : 13,7878908 params ) so onn the next pass a new set of parameters will be randomly chosen hence seeds !

you have a small journey ahead . you will find this model very heavy ! so keep you model as small as possible

Join the conversation