|
--- |
|
language: |
|
- en |
|
license: cc-by-nc-4.0 |
|
--- |
|
## Model Details |
|
The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining. |
|
|
|
# Model Date |
|
January 2024 |
|
|
|
# Model Type |
|
The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder. |
|
These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training. |
|
It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss. |
|
|
|
Input: image and text |
|
|
|
output: image and text representation |
|
|
|
|
|
## Uses |
|
|
|
### Use with Transformers |
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoProcessor |
|
|
|
processor = AutoProcessor.from_pretrained("humane-lab/CFT-CLIP") |
|
model = AutoModel.from_pretrained("humane-lab/CFT-CLIP") |
|
|
|
|
|
image = "cat.jpg" |
|
image = Image.open(image) |
|
inputs = processor(text=["this is a cat"], images=image, return_tensors="pt") |
|
|
|
outputs = model(**inputs) |
|
text_embeds = outputs.text_embeds |
|
image_embeds = outputs.image_embeds |
|
``` |
|
|
|
### Intended Use |
|
The model is intended as a research output for research communities. |
|
|
|
### Primary intended uses |
|
The primary intended users of these models are AI researchers. |
|
|
|
### Out-of-Scope Use Cases |
|
The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases. |
|
|
|
|
|
## Factors |
|
### Relevant factors |
|
We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler. |
|
The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set. |
|
Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations. |
|
|
|
### Evaluation factors |
|
We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation. |
|
|
|
## Metrics |
|
Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels. |
|
|
|
Decision thresholds: Validation cosine-similarity based. |
|
|
|
Approaches to uncertainty and variability: Measure by changing the random seed 5 times |
|
|
|
|
|
## Data |
|
### Training Data |
|
The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/). |
|
The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset. |
|
|
|
### Evaluation Data |
|
In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set. |
|
For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24). |
|
|
|
## Evaluation |
|
we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT. |
|
|
|
|Model|F1|Spearman| |
|
|---|---|---| |
|
|CFT-CLIP|**0.815+-0.003**|**0.491+-0.005**| |
|
|CLIPAdapt|0.767+-0.006|0.459+-0.004| |
|
|CLIP|0.763|0.409| |
|
|BLIP|0.737|0.408| |
|
|BLIP-2|0.707|0.415| |
|
|BLIP-2+SBERT|0.694|0.341| |
|
|
|
## Ethical Considerations |
|
For pretraining, this study used publicly available news articles shared by news media. |
|
While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news. |
|
Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP. |
|
A user should be cautious about applying the method to problems in a general context and be aware of a potential bias. |