PreFLMR_ViT-G / README.md
Jingbiao's picture
Update README.md
f77e4be verified
|
raw
history blame
4.11 kB
---
library_name: transformers
tags: [KBVQA, Multimodal, Retrieval, Knowledge Retrieval, RAG, FLMR, PreFLMR, ColBERT]
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the PreFLMR model
- **Model type:** PreFLMR is an open-source model for general knowledge retrieval. It is a transformer-based model that uses a combination of text and image inputs to retrieve relevant documents from a large corpus.
- **Language(s) (NLP):** English
- **License:** [More Information Needed]
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/LinWeizheDragon/FLMR
- **Paper:** https://arxiv.org/abs/2402.08327
- **Demo:** http://region-3.seetacloud.com:38703/
- **Blog Post:** https://www.jinghong-chen.net/preflmr-sota-open-sourced-multi/
- **Project Page:** https://preflmr.github.io/
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
This model can be used directly to retrieve documents from a large corpus using a combination of text and image input queries. The retrieval useage can be found in the [official implementation](https://github.com/LinWeizheDragon/FLMR).
### Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
This model can be used combined with language models to create a retrieval-augmented language model. The useage for Knowledge-based VQA can be found in https://github.com/linweizhedragon/retrieval-augmented-visual-question-answering
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoConfig, AutoModel, AutoImageProcessor, AutoTokenizer
import torch
checkpoint_path = "LinWeizheDragon/PreFLMR_ViT-L"
image_processor_name = "openai/clip-vit-large-patch14"
query_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="query_tokenizer", trust_remote_code=True)
context_tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, subfolder="context_tokenizer", trust_remote_code=True)
model = AutoModel.from_pretrained(checkpoint_path,
query_tokenizer=query_tokenizer,
context_tokenizer=context_tokenizer,
trust_remote_code=True,
)
image_processor = AutoImageProcessor.from_pretrained(image_processor_name)
Q_encoding = query_tokenizer(["Using the provided image, obtain documents that address the subsequent question: What is the capital of France?", "Extract documents linked to the question provided in conjunction with the image: What is the capital of China?"])
D_encoding = context_tokenizer(["Paris is the capital of France.", "Beijing is the capital of China.",
"Paris is the capital of France.", "Beijing is the capital of China."])
Q_pixel_values = torch.zeros(2, 3, 224, 224)
inputs = dict(
query_input_ids=Q_encoding['input_ids'],
query_attention_mask=Q_encoding['attention_mask'],
query_pixel_values=Q_pixel_values,
context_input_ids=D_encoding['input_ids'],
context_attention_mask=D_encoding['attention_mask'],
use_in_batch_negatives=True,
)
res = model.forward(**inputs)
print(res)
```
## Training datasets
The model is trained on a combination of eight image-text datasets and a text-only dataset.
## Citation
**BibTeX:**
```
@article{Lin_Mei_Chen_Byrne_2024,
title={PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers},
url={http://arxiv.org/abs/2402.08327},
number={arXiv:2402.08327},
publisher={arXiv},
author={Lin, Weizhe and Mei, Jingbiao and Chen, Jinghong and Byrne, Bill},
year={2024}}
```