File size: 4,067 Bytes
58d60c9 8d4c0cf 58d60c9 e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef f7100ec e856cef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
language:
- en
license: cc-by-nc-4.0
---
## Model Details
The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.
# Model Date
January 2024
# Model Type
The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder.
These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training.
It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.
Input: image and text
output: image and text representation
## Uses
### Use with Transformers
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
processor = AutoProcessor.from_pretrained("humane-lab/cft-clip")
model = AutoModel.from_pretrained("humane-lab/cft-clip")
image = "cat.jpg"
image = Image.open(image)
inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
text_embeds = outputs.text_embeds
image_embeds = outputs.image_embeds
```
### Intended Use
The model is intended as a research output for research communities.
### Primary intended uses
The primary intended users of these models are AI researchers.
### Out-of-Scope Use Cases
The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.
## Factors
### Relevant factors
We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler.
The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set.
Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.
### Evaluation factors
We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation.
## Metrics
Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.
Decision thresholds: Validation cosine-similarity based.
Approaches to uncertainty and variability: Measure by changing the random seed 5 times
## Data
### Training Data
The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/).
The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset.
### Evaluation Data
In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set.
For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24).
## Evaluation
we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT.
|Model|F1|Spearman|
|---|---|---|
|CFT-CLIP|**0.815+-0.003**|**0.491+-0.005**|
|CLIPAdapt|0.767+-0.006|0.459+-0.004|
|CLIP|0.763|0.409|
|BLIP|0.737|0.408|
|BLIP-2|0.707|0.415|
|BLIP-2+SBERT|0.694|0.341|
## Ethical Considerations
For pretraining, this study used publicly available news articles shared by news media.
While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news.
Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP.
A user should be cautious about applying the method to problems in a general context and be aware of a potential bias. |