Xkev commited on
Commit
519623a
1 Parent(s): cef4116

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -1
README.md CHANGED
@@ -5,4 +5,84 @@ language:
5
  base_model:
6
  - meta-llama/Llama-3.2-11B-Vision-Instruct
7
  pipeline_tag: visual-question-answering
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - meta-llama/Llama-3.2-11B-Vision-Instruct
7
  pipeline_tag: visual-question-answering
8
+ ---
9
+ # Model Card for Model ID
10
+
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+ This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
14
+
15
+ ## Model Details
16
+
17
+ <!-- Provide a longer summary of what this model is. -->
18
+
19
+ - **License:** apache-2.0
20
+ - **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
21
+
22
+ ## Reproduction
23
+
24
+ <!-- This section describes the evaluation protocols and provides the results. -->
25
+
26
+ To reproduce our results, you should use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and the following settings.
27
+
28
+ | Parameter | Value |
29
+ |-------------------|---------|
30
+ | do_sample | True |
31
+ | temperature | 0.6 |
32
+ | top_p | 0.9 |
33
+ | max_new_tokens | 2048 |
34
+
35
+ You may change them in [this file](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/llama_vision.py), line 80-83, and modify the max_new_tokens throughout the file.
36
+
37
+ Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048.
38
+
39
+ After you get the results, you should filter the model output and only **keep the outputs between \<CONCLUSION\> and \</CONCLUSION\>**.
40
+
41
+ This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes.
42
+
43
+ By keeping the outputs between \<CONCLUSION\> and \</CONCLUSION\>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.
44
+
45
+ ## How to Get Started with the Model
46
+
47
+ You can use the inference code for Llama-3.2-11B-Vision-Instruct.
48
+
49
+ ## Training Details
50
+
51
+ ### Training Data
52
+
53
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
54
+
55
+ The model is trained on the LLaVA-o1-100k dataset (to be released).
56
+
57
+ ### Training Procedure
58
+
59
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
60
+
61
+ The model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes) with the following settings.
62
+ Using the same setting should accurately reproduce our results.
63
+
64
+ | Parameter | Value |
65
+ |-------------------------------|---------------------------------------------------|
66
+ | FSDP | enabled |
67
+ | lr | 1e-5 |
68
+ | num_epochs | 3 |
69
+ | batch_size_training | 4 |
70
+ | use_fast_kernels | True |
71
+ | run_validation | False |
72
+ | batching_strategy | padding |
73
+ | context_length | 4096 |
74
+ | gradient_accumulation_steps | 1 |
75
+ | gradient_clipping | False |
76
+ | gradient_clipping_threshold | 1.0 |
77
+ | weight_decay | 0.0 |
78
+ | gamma | 0.85 |
79
+ | seed | 42 |
80
+ | use_fp16 | False |
81
+ | mixed_precision | True |
82
+
83
+ ## Bias, Risks, and Limitations
84
+
85
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
86
+
87
+ The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data.
88
+ Technically, the model's performance in aspects like instruction following still falls short of leading industry models.