Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merve 
posted an update Nov 16, 2024
Post
5088
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
💨 a new vision language model with 9x less image tokens, super efficient
📖 aligned with DPO for reducing hallucinations
⚡️ Apache 2.0 license 🔥

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M

Looks great! I am currently working on simplifying the training/fine-tuning multimodal LLMs in Torch: https://github.com/ritabratamaiti/AnyModal
79e74e0b-daa0-46fa-896b-eb107ca760e2.png

·

I think you forgot to add the actual audio processor ~

Right now its incomplete bro ! basically it is a llava model ..
for it to be the anymodal you need to add the input processor for the audio as well:

Your on the right track : begining with Input methods first :
Get over this first hurdle : Add the Audio processor : Here there are two methods : Speech input (basic) and Audio( Stable audio ) ( identfys sounds ) ...

Then we can design the correct outputs :

First we will need to train the model on these inputs : Just to be able to return text is fine as we are dealing with a llm model firstly and need to get this input processes embedded into the model space :

SO
given a Image and text input
or
given a speech and image input
Or
given a text and sound input

All output to text only !! ( very important stage ) wwe need to get the model super fit on this task before moving to generation :

For the opposite modal we use the exact same training set but this time we provide the media output as well as the text :
embedding both tasks : Even creating clones and merging these models to begin the process again on the merged model : this provides a mass scattering of tensors and thier activations : so that you can fdo fine tuning training instead of full model training ! finen tuning each task hence the requirement to over fit the task ( its only going ot over fit onn those select parameters , ie : 13,7878908 params ) so onn the next pass a new set of parameters will be randomly chosen hence seeds !

you have a small journey ahead . you will find this model very heavy ! so keep you model as small as possible

deleted
This comment has been hidden