Model Card for LoTLIP ViT-B/32

Model Details

Model Description

LoTLIP ViT-B/32 model pre-trained on 100M scale dataset.

Direct Use

Zero-shot long text-image retrieval, short text-image retrieval, and image classification, among others.

How to Get Started with the Model

Use the code to get started with the model.

Training Details

Training Data

The models are trained with 100M scale dataset which contains long text-image pairs.

Evaluation

Please refer to https://github.com/wuw2019/LoTLIP.

Testing Details

Testing Data

The testing is performed with DCI, IIW and ShareGPT4V for long text-image retrieval and ImageNet1k for classification.

Results

Model Pre-training Data Scale DCI I2T DCI T2I IIW I2T IIW T2I SV-10k I2T SV-10k T2I
LoTLIP-ViT-B-32 100M 59.90 56.36 93.14 91.83 83.76 78.97

Citation

BibTeX:

@inproceedings{LoTLIP,
  title={LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
  author={Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
  booktitle={arXiv},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .