--- license: apache-2.0 datasets: - nlphuji/flickr30k base_model: - answerdotai/ModernBERT-base pipeline_tag: zero-shot-image-classification --- # Model Card Fine tune HF's modernBERT-base as a text encoder for Contrastive Language-Image Pretraining (CLIP)!
Use natural language to search for images.
# How to Get Started To use a pretrained model to search through a directory of images, go to demo.py. For training, see train.py.
# Model Details **Text encoder:** modernBERT-base
https://huggingface.co/answerdotai/ModernBERT-base
**Vision encoder:** IdeficsV3 variant extracted from HF's smolVLM!
https://huggingface.co/blog/smolvlm
# Model Description ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space. It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (extracted from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs. Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script, # Datasets flickr30k: https://huggingface.co/datasets/nlphuji/flickr30 (training)
Coco-captioning: https://cocodataset.org/#captions-2015 (demo)
# Training Procedure Vision embeddings are precomputed and stored as .npy files. The model is trained using the InfoNCE contrastive loss, which encourages positive pairs, i.e. matching text and image embeddings, to be closer in the shared embedding space while pushing negative pairs apart. # Hardware Nvidia 3080 Ti