PrismCaptioner Model Card
Model details
PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.
PrismCaptioner-7B details
- Vision Backbone: google/siglip-so400m-patch14-384
- Language Backbone: internlm/internlm2-7b
- Dataset: 1x ALLaVA-Caption-[LAION/VFLAN]
Paper and codebase for more information: [Paper] [Code]
Intended uses
- Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
- Effective Captioner: The model can produce high-quality captions for given images.
Model usage
Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.
# In the Prism repo folder
from decouple import supported_VLM
model = supported_VLM['prismcaptioner-7b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])
Inference API (serverless) does not yet support prismcaptioner models for this pipeline type.