Model Card for LoTLIP ViT-B/32
Model Details
Model Description
LoTLIP ViT-B/32 model pre-trained on 100M scale dataset.
Direct Use
Zero-shot long text-image retrieval, short text-image retrieval, and image classification, among others.
How to Get Started with the Model
Use the code to get started with the model.
Training Details
Training Data
The models are trained with 100M scale dataset which contains long text-image pairs.
Evaluation
Please refer to https://github.com/wuw2019/LoTLIP.
Testing Details
Testing Data
The testing is performed with DCI, IIW and ShareGPT4V for long text-image retrieval and ImageNet1k for classification.
Results
Model | Pre-training Data Scale | DCI I2T | DCI T2I | IIW I2T | IIW T2I | SV-10k I2T | SV-10k T2I |
---|---|---|---|---|---|---|---|
LoTLIP-ViT-B-32 | 100M | 59.90 | 56.36 | 93.14 | 91.83 | 83.76 | 78.97 |
Citation
BibTeX:
@inproceedings{LoTLIP,
title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
booktitle={arXiv},
year={2024}
}