metadata

license: mit
tags:
  - vision
  - language
  - fashion
  - ecommerce

Model Card: Fashion CLIP

Disclaimer: The model card adapts the model card from here.

Model Details

FashionCLIP is a CLIP-based model developed to produce general product representations for fashion concepts. Leveraging the pre-trained checkpoint (ViT-B/32) released by OpenAI, we train FashionCLIP on a large, high-quality novel fashion dataset to study whether domain specific fine-tuning of CLIP-like models is sufficient to produce product representations that are zero-shot transferable to entirely new datasets and tassks. FashionCLIP was not developed for model deplyoment - to do so, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.

Model Date

March 2023

Model Type

The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained, starting from a pre-trained checkpoint, to maximize the similarity of (image, text) pairs via a contrastive loss on a fashion dataset containing 800K products.

Documents

Data

The model was trained on (image, text) pairs obtained from the Farfecth dataset[^1 Awaiting official release.], an English dataset comprising over 800K fashion products, with more than 3K brands across dozens of object types. The image used for encoding is the standard product image, which is a picture of the item over a white background, with no humans. The text used is a concatenation of the highlight (e.g., “stripes”, “long sleeves”, “Armani”) and short description (“80s styled t-shirt”)) available in the Farfetch dataset.

Limitations, Bias and Fiarness

We acknowledge certain limitations of FashionCLIP and expect that it inherits certain limitations and biases present in the original CLIP model. We do not expect our fine tuning to significantly augment these limitation: we acknowledge that the fashion data we use makes explicit assumptions about the notion of gender as in "blue shoes for woman" that inevitably to associate aspects of clothing to specific people.

Our investingations also suggests that the data used introduces certain limitaions in FashionCLIP. From the textual modality, given that most captions dervied from the Farfetch dataset are long, we observe that FashionCLIP maybe more performant in longer queries than shorter ones. From the image modality, FashionCLIP is also biased towards standard product images (centered, white background).

Model selection, i.e. selecting an appropariate stopping critera during fine-tuning, remains an open challenge. We observed that using loss on an in-domain (i.e. same distribution as test) validation dataset is a poor selection critera when out-of-domain generalization (i.e. across different datasets) is desired, even when the dataset used is relatively diverse and large.