metadata
library_name: transformers
tags:
- KBVQA
- Multimodal
- Retrieval
- Knowledge Retrieval
- RAG
- FLMR
- PreFLMR
- ColBERT
Model Card for Model ID
Model Details
Model Description
This is the PreFLMR model
- Model type: PreFLMR is an open-source model for general knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.
- Language(s) (NLP): English
- License: [More Information Needed]
Model Sources
- Repository: https://github.com/LinWeizheDragon/FLMR
- Paper: https://arxiv.org/abs/2402.08327
- Demo: http://region-3.seetacloud.com:38703/
- Blog Post: https://www.jinghong-chen.net/preflmr-sota-open-sourced-multi/
- Project Page: https://preflmr.github.io/
Uses
Direct Use
This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval useage can be found in the official implementation.
Downstream Use
This model can be used combined with language models to create a retrieval-augmented language model. The useage for Knowledge-based VQA can be found in https://github.com/linweizhedragon/retrieval-augmented-visual-question-answering
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoConfig, AutoModel, AutoImageProcessor, AutoTokenizer
import torch
checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer", trust_remote_code=True)
context_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="context_tokenizer", trust_remote_code=True)
model = AutoModel.from_pretrained(checkpoint_path,
query_tokenizer=query_tokenizer,
context_tokenizer=context_tokenizer,
trust_remote_code=True,
)
image_processor = AutoImageProcessor.from_pretrained(image_processor_name)
Q_encoding = query_tokenizer(["Using the provided image, obtain documents that address the subsequent question: What is the capital of France?", "Extract documents linked to the question provided in conjunction with the image: What is the capital of China?"])
D_encoding = context_tokenizer(["Paris is the capital of France.", "Beijing is the capital of China.",
"Paris is the capital of France.", "Beijing is the capital of China."])
Q_pixel_values = torch.zeros(2, 3, 224, 224)
inputs = dict(
query_input_ids=Q_encoding['input_ids'],
query_attention_mask=Q_encoding['attention_mask'],
query_pixel_values=Q_pixel_values,
context_input_ids=D_encoding['input_ids'],
context_attention_mask=D_encoding['attention_mask'],
use_in_batch_negatives=True,
)
res = model.forward(**inputs)
print(res)
Training datasets
The model is trained on a combination of eight image-text datasets and a text-only dataset.
Citation
BibTeX:
@article{Lin_Mei_Chen_Byrne_2024,
title={PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers},
url={http://arxiv.org/abs/2402.08327},
number={arXiv:2402.08327},
publisher={arXiv},
author={Lin, Weizhe and Mei, Jingbiao and Chen, Jinghong and Byrne, Bill},
year={2024}}