CFT-CLIP / README.md

Update README.md

1ae5c5c verified 6 months ago

4.07 kB

	---
	language:
	- en
	license: cc-by-nc-4.0
	---
	## Model Details
	The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.

	# Model Date
	January 2024

	# Model Type
	The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder.
	These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training.
	It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.

	Input: image and text

	output: image and text representation


	## Uses

	### Use with Transformers
	```python
	import torch
	from PIL import Image
	from transformers import AutoModel, AutoProcessor

	processor = AutoProcessor.from_pretrained("humane-lab/CFT-CLIP")
	model = AutoModel.from_pretrained("humane-lab/CFT-CLIP")


	image = "cat.jpg"
	image = Image.open(image)
	inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")

	outputs = model(**inputs)
	text_embeds = outputs.text_embeds
	image_embeds = outputs.image_embeds
	```

	### Intended Use
	The model is intended as a research output for research communities.

	### Primary intended uses
	The primary intended users of these models are AI researchers.

	### Out-of-Scope Use Cases
	The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.


	## Factors
	### Relevant factors
	We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler.
	The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set.
	Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.

	### Evaluation factors
	We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation.

	## Metrics
	Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.

	Decision thresholds: Validation cosine-similarity based.

	Approaches to uncertainty and variability: Measure by changing the random seed 5 times


	## Data
	### Training Data
	The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/).
	The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset.

	### Evaluation Data
	In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set.
	For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24).

	## Evaluation
	we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT.

	\|Model\|F1\|Spearman\|
	\|---\|---\|---\|
	\|CFT-CLIP\|0.815+-0.003\|0.491+-0.005\|
	\|CLIPAdapt\|0.767+-0.006\|0.459+-0.004\|
	\|CLIP\|0.763\|0.409\|
	\|BLIP\|0.737\|0.408\|
	\|BLIP-2\|0.707\|0.415\|
	\|BLIP-2+SBERT\|0.694\|0.341\|

	## Ethical Considerations
	For pretraining, this study used publicly available news articles shared by news media.
	While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news.
	Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP.
	A user should be cautious about applying the method to problems in a general context and be aware of a potential bias.