YipengZhang
/

LLaVA-UHD-v2

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

YipengZhang commited on 27 days ago

Commit

4985301

·

verified ·

1 Parent(s): 2c2750d

Update README.md

Files changed (1) hide show

README.md +10 -13

README.md CHANGED Viewed

@@ -5,39 +5,36 @@ pipeline_tag: image-text-to-text
 <br>
-# LLaVA Model Card
 ## Model details
 **Model type:**
-LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
-It is an auto-regressive language model, based on the transformer architecture.
 **Model date:**
-LLaVA-v1.5-7B was trained in September 2023.
 **Paper or resources for more information:**
-https://llava-vl.github.io/
 ## License
 Llama 2 is licensed under the LLAMA 2 Community License,
 Copyright (c) Meta Platforms, Inc. All Rights Reserved.
 **Where to send questions or comments about the model:**
-https://github.com/haotian-liu/LLaVA/issues
 ## Intended use
 **Primary intended uses:**
-The primary use of LLaVA is research on large multimodal models and chatbots.
 **Primary intended users:**
 The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
 ## Training dataset
 - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
-- 158K GPT-generated multimodal instruction-following data.
-- 450K academic-task-oriented VQA data mixture.
-- 40K ShareGPT data.
-## Evaluation dataset
-A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

 <br>
+# LLaVA-UHD v2 Model Card
 ## Model details
 **Model type:**
+LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity
+by constructing and integrating a high resolution feature pyramid.
 **Model date:**
+LLaVA-UHD v2 was trained in November 2024.
 **Paper or resources for more information:**
+https://github.com/thunlp/LLaVA-UHD
 ## License
 Llama 2 is licensed under the LLAMA 2 Community License,
 Copyright (c) Meta Platforms, Inc. All Rights Reserved.
 **Where to send questions or comments about the model:**
+https://github.com/thunlp/LLaVA-UHD/issues
 ## Intended use
 **Primary intended uses:**
+The primary use of LLaVA-UHD v2 is research on large multimodal models and chatbots.
 **Primary intended users:**
 The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
 ## Training dataset
 - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
+- JBU Pretrain: MS-COCO stuff 2017
+- Pretrain: LLaVA-Pretrain 558K (filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.)
+- SFT: 858k-mixed dataset in https://huggingface.co/datasets/YipengZhang/LLaVA-UHD-v2-SFT-Data