zhiqings commited on
Commit
91ba418
1 Parent(s): b67d89e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ inference: false
4
+ ---
5
+
6
+ # LLaVA-RLHF Model Card
7
+
8
+ ## Model details
9
+
10
+ **Model type:**
11
+ LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4.
12
+ Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.
13
+
14
+ **Usage:**
15
+ **NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type.**
16
+ Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.
17
+
18
+ ```python
19
+ dtype = torch.bfloat16
20
+
21
+ model_path = "LLaVA-RLHF-13b-v1.5-336/sft_model"
22
+ lora_path = "LLaVA-RLHF-13b-v1.5-336/rlhf_lora_adapter_model"
23
+
24
+ model = LlavaLlamaForCausalLM.from_pretrained(
25
+ model_path,
26
+ device_map={"": "cuda:0"},
27
+ torch_dtype=dtype,
28
+ )
29
+
30
+ model = PeftModel.from_pretrained(
31
+ model,
32
+ lora_path,
33
+ )
34
+ ```
35
+
36
+ **Model date:**
37
+ LLaVA was trained in Sept 2024.
38
+
39
+ **Paper or resources for more information:**
40
+ https://llava-rlhf.github.io/
41
+
42
+ **License:**
43
+ Apache License 2.0
44
+
45
+ **Where to send questions or comments about the model:**
46
+ https://github.com/Edward-Sun/LLaVA-RLHF/issues
47
+
48
+ ## Intended use
49
+ **Primary intended uses:**
50
+ The primary use of LLaVA-RLHF is research on large multimodal chatbots.
51
+
52
+ **Primary intended users:**
53
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
54
+
55
+ ## Training dataset
56
+ 595K filtered image-text pairs from CC3M.
57
+ 150K GPT-generated multimodal instruction-following chat data.
58
+ 83K VQA v2 instruction-following VQA data.
59
+ 16K A-OKVQA instruction-following CoT-VQA data.
60
+ 23K FLICKR instruction-following spotting captioning data.
61
+ 10K LLaVA-based human preference data