File size: 1,900 Bytes
91ba418 fe13a32 91ba418 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: apache-2.0
inference: false
---
# LLaVA-RLHF Model Card
## Model details
**Model type:**
LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4.
Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.
**Usage:**
**NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type.**
Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.
```python
dtype = torch.bfloat16
model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model"
lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model"
model = LlavaLlamaForCausalLM.from_pretrained(
model_path,
device_map={"": "cuda:0"},
torch_dtype=dtype,
)
model = PeftModel.from_pretrained(
model,
lora_path,
)
```
**Model date:**
LLaVA was trained in Sept 2024.
**Paper or resources for more information:**
https://llava-rlhf.github.io/
**License:**
Apache License 2.0
**Where to send questions or comments about the model:**
https://github.com/llava-rlhf/LLaVA-RLHF/issues
## Intended use
**Primary intended uses:**
The primary use of LLaVA-RLHF is research on large multimodal chatbots.
**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
## Training dataset
595K filtered image-text pairs from CC3M.
150K GPT-generated multimodal instruction-following chat data.
83K VQA v2 instruction-following VQA data.
16K A-OKVQA instruction-following CoT-VQA data.
23K FLICKR instruction-following spotting captioning data.
10K LLaVA-based human preference data
|