PreFLMR model card

PreFLMR is an open-source model for multimodal knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.

Model Details

PreFLMR_ViT-L_ENCN is based on PreFLMR_ViT-L, and the text_encoder is replaced with bge-m3 for training. The training dataset includes Chinese and English datasets.

Model Description

Model type: FLMRModelForRetrieval
Language(s) (NLP): English Chinese
License: MIT License

Paper and resources for more detail

Blog Post for quick overview: https://www.jinghong-chen.net/preflmr-sota-open-sourced-multi/
Paper: https://arxiv.org/abs/2402.08327
Gradio Demo: https://u60544-b8d4-53eaa55d.westx.seetacloud.com:8443/
Repository: https://github.com/LinWeizheDragon/FLMR
Project Page: https://preflmr.github.io/

Uses

Direct Use

This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval usage can be found in the official implementation.

Downstream Use

This model can be used combined with language models to create a retrieval-augmented language model. The use for Knowledge-based VQA can be found in RAVQA

How to Get Started with the Model

For details of training, indexing, and performing retrieval, please refer to here.

Training datasets

The model is pre-trained on three types of tasks with a total of nine datasets:

Image to Text retrieval: WIT, KVQA, and CC3M
Question to Text retrieval: MSMARCO
Image & Question to Text retrieval: LLaVA, OVEN, OKVQA, Infoseek and E-VQA

These datasets were converted to retrieval format. For details on the dataset split and conversion process, please refer to the paper PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers. We will release the proprocessed datasets soon.

Evaluation datasets

We evaluate our models on WIT, LLaVA, OVEN, KVQA, IGLUE (subset of WIT), Infoseek, E-VQA, OKVQA and MSMARCO.

Model	Vision Encoder	Text Encoder	Checkpoint Name	No. Param.	WIT(EN)	WIT(CN)	LLaVA(EN)	LLaVA(CN)	OVEN(EN)	OVEN(CN)	KVQA(EN)	KVQA(EN)	IGLUE(EN)	Infoseek(EN)	Infoseek(CN)	EVQA(EN)	EVQA(CN)	OKVQA(EN)	OKVQA(CN)	MSMARCO(EN)	MSMARCO(CN)
PreFLMR	ViT-L	Base-v2	LinWeizheDragon/PreFLMR_ViT-L	543M	60.5	10.9	71.8	3.2	59.8	6.6	43.6	3.2	69.2	57.9	7.9	70.8	2.8	68.5	2.1	78.7	10.3
PreFLMR	Vit-L_ENCN	bge-m3	LinWeizheDragon/PreFLMR_ViT-L_ENCN	883M	60.8	83.4	71.11	58.93	60.8	58.83	41.05	37.27		41.91	39.70	57.97	46.64	13.87	13.32	82.6	82.33

For the evaluation metrics, WIT uses Recall@10, IGLUE uses Recall@1, and all the rest datasets use Recall@5.

Citation

BibTeX:

@article{Lin_Mei_Chen_Byrne_2024, 
        title={PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers}, 
        url={http://arxiv.org/abs/2402.08327}, 
        number={arXiv:2402.08327}, 
        publisher={arXiv}, 
        author={Lin, Weizhe and Mei, Jingbiao and Chen, Jinghong and Byrne, Bill}, 
        year={2024}}