Building Your Own Multimodal Large Model from Scratch

For the Chinese version of the README, please refer to δΈ­ζ–‡ζ–‡ζ‘£.

Model Architecture πŸ€–

In the VLM (Visual Language Model), the visual component utilizes the CLIP or SIGLIP models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the forward method of the QWenModel, the corresponding image tokens are replaced with visual features.

GitHub Repository 🏠

The code for running the model can be found at Basic-Visual-Language-Model.

References πŸ“š

Special thanks to the following projects for their great work πŸ™Œ:

Contact βœ‰

If you have any questions or ideas, feel free to reach out to me 😊:

hsinyanghuang7@gmail.com

I will respond as soon as I see your email!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.