sachin
/

vit2distilgpt2

vision-encoder-decoder

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

vit2distilgpt2 / README.md

sachin's picture

Adding WandB results

51be2b2 almost 3 years ago

|

history blame contribute delete

1.78 kB

	---
	language:
	- en
	tags:
	- image-to-text
	license: mit
	datasets:
	- coco2017
	---

	# Vit2-DistilGPT2
	This model takes in an image and outputs a caption. It was trained using the Coco dataset and the full training script can be found in [this kaggle kernel](https://www.kaggle.com/sachin/visionencoderdecoder-model-training)

	## Usage
	```python
	import Image
	from transformers import AutoModel, GPT2Tokenizer, ViTFeatureExtractor
	model = AutoModel.from_pretrained("sachin/vit2distilgpt2")
	vit_feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
	# make sure GPT2 appends EOS in begin and end
	def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
	outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
	return outputs

	GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
	gpt2_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
	# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
	gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token
	image = (Image.open(image_path).convert("RGB"), return_tensors="pt").pixel_values
	encoder_outputs = model.generate(image.unsqueeze(0))
	generated_sentences = gpt2_tokenizer.batch_decode(encoder_outputs, skip_special_tokens=True)
	```
	Note that the output sentence may be repeated, hence a post processing step may be required.

	## Bias Warning
	This model may be biased due to dataset, lack of long training and the model itself. The following gender bias is an example.
	![](https://i.imgur.com/9zVN022.png)

	## Results
	<iframe src="https://wandb.ai/sachinruk/Vit2GPT2/reports/Shared-panel-22-01-27-23-01-56--VmlldzoxNDkyMTM3?highlightShare" style="border:none;height:1024px;width:100%">