SPHINX-V Model Card

Model type:

SPHINX-V is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.

Paper or resources for more information:

Project Page: Draw-and-Understand
Paper: https://arxiv.org/abs/2403.20271
Code: https://github.com/AFeng-x/Draw-and-Understand
Dataset: MDVP-Data & MDVP-Bench

Intended use

Primary intended uses: The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots.

Primary intended users: The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence.

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Citations

@misc{lin2024drawandunderstand,
      title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
      year={2024},
      eprint={2403.20271},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.