lmms-lab
/

llava-onevision-qwen2-72b-ov-chat

Image-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

txiong23 commited on Oct 9, 2024

Commit

7e3c011

·

verified ·

1 Parent(s): 14f5f83

Update README.md

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -117,6 +117,7 @@ print(text_outputs)
 - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
 - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
 - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
 - **Precision:** bfloat16
 ## Hardware & Software
@@ -131,4 +132,14 @@ print(text_outputs)
 @article{li2024llavaonevision,
       title={LLaVA-OneVision},
 }
 ```

 - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
 - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
 - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
+- **Critic / Preference Learning Stage:** 9.4k question-image input from [LLaVA-RLHF](https://llava-rlhf.github.io/) with self-generated responses, reward signal from [llava-critic-72b](https://huggingface.co/lmms-lab/llava-critic-72b), iterative DPO for 3 rounds, full model
 - **Precision:** bfloat16
 ## Hardware & Software
 @article{li2024llavaonevision,
       title={LLaVA-OneVision},
 }
+@article{xiong2024llavacritic,
+  title={LLaVA-Critic: Learning to Evaluate Multimodal Models},
+  author={Xiong, Tianyi and Wang, Xiyao and Guo, Dong and Ye, Qinghao and Fan, Haoqi and Gu, Quanquan and Huang, Heng and Li, Chunyuan},
+  year={2024},
+  eprint={2410.02712},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2410.02712},
+}
 ```