Zero-Shot Classification

Unmasked Token Alignment (UTA) for Efficient Visual-Language Representation Learning

This repository provides the inference code for our TMLR paper "Enhancing Vision-Language Model with Unmasked Token Alignment".

Abstract:

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks.

Models:

We release three pre-trained models:

Model Zero-shot Accuracy (ImageNet-1K) Link
UTA-B 77.0% weights
UTA-L-pix336 81.4% weights
UTA-g-pix336 83.9% weights

Getting Started:

  1. Clone this repository:

    git clone https://huggingface.co/jjjjh/UTA
    cd UTA
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Download the pre-trained models: You can download the pre-trained models from weights.

  4. Run inference: The inference code is provided in imagenet_zeroshot_eval.py. You can use the following command to run ImageNet zeroshot eval:

    python imagenet_zeroshot_eval.py --imagenet-path [path to imagenet] --model [model name] --ckpt-path [path to checkpoint]
    

Citation:

If you find this work helpful, please cite our paper:

@article{
    liu2024enhancing,
    title={Enhancing Vision-Language Model with Unmasked Token Alignment},
    author={Jihao Liu and Jinliang Zheng and Boxiao Liu and Yu Liu and Hongsheng Li},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2024},
    url={https://openreview.net/forum?id=JkFEVbW6wE},
    note={}
}

Contributing:

Contributions to this repository are welcome. Please feel free to open an issue or submit a pull request.

Contact:

If you have any questions or suggestions, please feel free to contact Jihao Liu (email).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Datasets used to train jjjjh/UTA