zhiqings
/

LLaVA-RLHF-13b-v1.5-336

Model card Files Files and versions Community

zhiqings commited on Sep 27, 2023

Commit

91ba418

·

1 Parent(s): b67d89e

Create README.md

Files changed (1) hide show

README.md +61 -0

README.md ADDED Viewed

	@@ -0,0 +1,61 @@

+---
+license: apache-2.0
+inference: false
+---
+# LLaVA-RLHF Model Card
+## Model details
+**Model type:**
+LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4.
+Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.
+**Usage:**
+**NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type.**
+Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.
+```python
+dtype = torch.bfloat16
+model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model"
+lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model"
+model = LlavaLlamaForCausalLM.from_pretrained(
+    model_path,
+    device_map={"": "cuda:0"},
+    torch_dtype=dtype,
+)
+model = PeftModel.from_pretrained(
+    model,
+    lora_path,
+)
+```
+**Model date:**
+LLaVA was trained in Sept 2024.
+**Paper or resources for more information:**
+https://llava-rlhf.github.io/
+**License:**
+Apache License 2.0
+**Where to send questions or comments about the model:**
+https://github.com/Edward-Sun/LLaVA-RLHF/issues
+## Intended use
+**Primary intended uses:**
+The primary use of LLaVA-RLHF is research on large multimodal chatbots.
+**Primary intended users:**
+The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
+## Training dataset
+595K filtered image-text pairs from CC3M.
+150K GPT-generated multimodal instruction-following chat data.
+83K VQA v2 instruction-following VQA data.
+16K A-OKVQA instruction-following CoT-VQA data.
+23K FLICKR instruction-following spotting captioning data.
+10K LLaVA-based human preference data