mlpc-lab
/

BLIVA_FlanT5

Visual Question Answering

Model card Files Files and versions Community

BLIVA_FlanT5 / README.md

gordonhu's picture

Update README.md

9a7f2d7 over 1 year ago

|

1.69 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: visual-question-answering
	library_name: transformers
	---

	<br>
	<br>

	# BLIVA Model Card

	## Model details

	Model type:
	BLIVA is an open-source Vision-Languagde model trained by initializing from InstructBLIP and alignment with Vicuna on multimodal instruction-finetuning data.
	It composes of an EVA-CLIP vision encoder, a Q-Former, a projection layer and an auto-regressive language model, based on the decoder only transformer architecture.

	Model date:
	BLIVA_FlanT5 was trained in July 2023.

	Paper or resources for more information:
	https://gordonhu608.github.io/bliva/

	License:
	Apache 2.0 License

	Where to send questions or comments about the model:
	https://github.com/mlpc-ucsd/BLIVA

	## Intended use
	Primary intended uses:
	The primary use of BLIVA FlanT5 is for commercial use on large multimodal models.

	Primary intended users:
	The primary intended users of this model is for commercial companies in computer vision, natural language processing, machine learning, and artificial intelligence.

	## Training dataset
	Pre-train data: 558K filtered image-text pairs from LAION,CC-3M, and SBU. Selected by LLaVA.

	Instruction-finetuning data: COCO-Caption, TextCaps, VQAv2, OKVQA, A-OKVQA, LLaVA-150K, OCR-VQA.

	## Evaluation dataset
	For zero-shot evaluation on general image task, we selected Nocaps, Flickr30K, VizWiz, Visual Spaial Reasoning (VSR), IconQA, Visual Dialog, ScienceQA, MSRVTT QA, TextVQA and Hateful Memes.

	For zero-shot evaluation on text-rich image OCR task, we selected ST-VQA, OCR-VQA, Text-VQA, and Doc-VQA.

	More detials are in our github, https://github.com/mlpc-ucsd/BLIVA