Visual Question Answering
English
nms05 commited on
Commit
8bb8a28
1 Parent(s): af0e838

Update README.md

Browse files

Weights (align and finetune stag) for DinoV2-SigLIP-Phi3(LoRA) model. The model details are as follows,

* **Vision Encoder** - DinoV2 + SigLIP @384px resolution.
* **Connector** - MLP (Dino and SigLIP features are concatenated and then projected to Phi3 representation space)
* **Language Model** - Phi3 + LoRA
* **Pre-train (Align) Dataset** - LLaVA-CC3M-Pretrain-595K
* **Fine-tune (Instruction) Dataset** - LLAVA-v1.5-Instruct + LRV-Instruct

Scripts to build and train the model is available at [DinoV2-SigLIP-Phi3-LoRA-VLM](https://github.com/NMS05/DinoV2-SigLIP-Phi3-LoRA-VLM).

Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # DinoV2-SigLIP-Phi3(LoRA)
2
+
3
+ ## Model and Dataset Details
4
+
5
+ * **Vision Encoder** - DinoV2 + SigLIP @384px resolution.
6
+ * **Connector** - MLP (Dino and SigLIP features are concatenated and then projected to Phi3 representation space)
7
+ * **Language Model** - Phi3 + LoRA
8
+ * **Pre-train (Align) Dataset** - LLaVA-CC3M-Pretrain-595K
9
+ * **Fine-tune (Instruction) Dataset** - LLAVA-v1.5-Instruct + LRV-Instruct