Reverb commited on
Commit
3ed7e0b
·
verified ·
1 Parent(s): 65a9c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -14
README.md CHANGED
@@ -29,20 +29,30 @@ tags:
29
 
30
  Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.
31
 
32
- # Model Summary
33
-
34
- - **Developed by:** Hugging Face
35
- - **Fine Tuned by:** Basel Anaya
36
- - **Model type:** Multi-modal model (image+text)
37
- - **Language(s) (NLP):** en
38
- - **License:** Apache 2.0
39
- - **Parent Models:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) and [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
40
- - **Resources for more information:**
41
- - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
42
- ](https://huggingface.co/papers/2306.16527)
43
- - Paper: [What matters when building vision-language models?
44
- ](https://huggingface.co/papers/2405.02246)
45
-
 
 
 
 
 
 
 
 
 
 
46
 
47
  # Technical summary
48
 
@@ -73,6 +83,16 @@ Idefics2 is trained in 2 stages for maximum efficiency. In a first stage, images
73
 
74
  We use DoRA to train the parameters initialized from pre-trained backbones and full fine-tuning for newly initialized parameters (modality connector), as we find this strategy to be more stable as well as more computationally efficient.
75
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  # How to Get Started
78
 
 
29
 
30
  Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.
31
 
32
+ ## Model Information
33
+ - Base Model: [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)
34
+ - Dataset Used: [DocVQA dataset](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
35
+ - Introduced in Mathew et al. (2021)
36
+ - Consists of 50,000 questions defined on 12,000+ document images
37
+ - For further information, visit the [challenge page](https://rrc.cvc.uab.es/?ch=17) and [paper](https://arxiv.org/abs/2007.00398)
38
+
39
+ ## Training Details
40
+ - The training process took approximately 38hours on an A100 80GB GPU, and model was fine-tuned using QLoRA.
41
+ - Trained with 39.5k train dataset from [DocVQA single page questions](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
42
+ - Training Log:
43
+
44
+ | Epoch | Loss | Grad Norm | Learning Rate |
45
+ |-------|-------|-----------|---------------|
46
+ | 0.01 | 2.3776| 10.40 | 4.8e-05 |
47
+ | 0.25 | 0.5029| 6.10 | 9.5412e-05 |
48
+ | 0.50 | 0.434 | 5.74 | 7.5973e-05 |
49
+ | 0.75 | 0.4608| 7.46 | 7.3925e-05 |
50
+ | 1.0 | 0.3846| 4.77 | 5.0369e-05 |
51
+ | 1.25 | 0.3226| 3.63 | 4.9857e-05 |
52
+ | 1.5 | 0.3175| 5.03 | 2.5277e-05 |
53
+ | 1.75 | 0.2918| 5.63 | 2.5789e-05 |
54
+
55
+ {'train_runtime': 141781.6786, 'train_samples_per_second': 0.557, 'train_steps_per_second': 0.035, 'train_loss': 0.3973848872424526, 'epoch': 2.0}
56
 
57
  # Technical summary
58
 
 
83
 
84
  We use DoRA to train the parameters initialized from pre-trained backbones and full fine-tuning for newly initialized parameters (modality connector), as we find this strategy to be more stable as well as more computationally efficient.
85
 
86
+ # Vision Encoder Efficiency
87
+
88
+ Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
89
+
90
+ 1. **Deactivate image splitting**: To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the SFT model has been trained with image splitting.
91
+
92
+ 2. **Decrease maximum image resolution**: To do so, add `size={"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is 980). We recommend using values that are multiples of 14. There are no changes required on the model side.
93
+
94
+ `do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
95
+
96
 
97
  # How to Get Started
98