Update README.md

7dae687 almost 2 years ago

3.75 kB


	## CLIP-Vit-Bert-Chinese pretrained model
	这是中文版本的CLIP预训练模型，基于LiT-tuning（Locked-image Text tuning）的策略，使用140万中文图文对数据进行多模态对比学习预训练。

	Github: [CLIP-Chinese](https://github.com/yangjianxin1/CLIP-Chinese)

	Bolg: [CLIP-Chinese：中文多模态对比学习CLIP预训练模型](https://mp.weixin.qq.com/s/6gQX91M-Lt7eiMimhYRJEw)

	## Model and Training Detail
	该模型主要由文本编码器与图像编码器组成，其中文本编码器为Bert，图像编码器为Vit，我们将该模型称为BertCLIP模型。训练时，Bert使用Langboat/mengzi-bert-base的权重进行初始化，Vit使用openai/clip-vit-large-patch32
	的权重进行初始化。采用LiT-tuning（Locked-image Text tuning）的策略进行训练，也就是冻结Vit的权重，训练BertCLIP模型剩余的权重。

	## Usage
	首先将项目clone到本地，并且安装依赖包
	```bash
	git clone https://github.com/yangjianxin1/CLIP-Chinese
	pip install -r requirements.txt
	```

	使用如下脚本，就可成功加载预训练权重，对图片和文本进行预处理，并且得到模型的输出
	```python
	from transformers import CLIPProcessor
	from component.model import BertCLIPModel
	from PIL import Image
	import requests

	model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
	# 加载预训练模型权重
	model = BertCLIPModel.from_pretrained(model_name_or_path)
	CLIPProcessor.tokenizer_class = 'BertTokenizerFast'
	# 初始化processor
	processor = CLIPProcessor.from_pretrained(model_name_or_path)
	# 预处理输入
	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	inputs = processor(text=["一只小狗在摇尾巴", "一只小猪在吃饭"], images=image, return_tensors="pt", padding=True)
	inputs.pop('token_type_ids') # 输入中不包含token_type_ids

	outputs = model(**inputs)

	# 对于每张图片，计算其与所有文本的相似度
	logits_per_image = outputs.logits_per_image # image-text的相似度得分
	probs = logits_per_image.softmax(dim=1) # 对分数进行归一化

	# 对于每个文本，计算其与所有图片的相似度
	logits_per_text = outputs.logits_per_text # text-image的相似度得分
	probs = logits_per_text.softmax(dim=1) # 对分数进行归一化

	# 获得文本编码
	text_embeds = outputs.text_embeds
	# 获得图像编码
	image_embeds = outputs.image_embeds
	```

	单独加载图像编码器，进行下游任务
	```python
	from PIL import Image
	import requests
	from transformers import CLIPProcessor, CLIPVisionModel

	model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
	model = CLIPVisionModel.from_pretrained(model_name_or_path)
	CLIPProcessor.tokenizer_class = 'BertTokenizerFast'
	processor = CLIPProcessor.from_pretrained(model_name_or_path)

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	inputs = processor(images=image, return_tensors="pt")

	outputs = model(**inputs)
	last_hidden_state = outputs.last_hidden_state
	pooled_output = outputs.pooler_output
	```

	单独加载文本编码器，进行下游任务

	```python
	from component.model import BertCLIPTextModel
	from transformers import BertTokenizerFast

	model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
	model = BertCLIPTextModel.from_pretrained(model_name_or_path)
	tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path)

	inputs = tokenizer(["一只小狗在摇尾巴", "一只小猪在吃饭"], padding=True, return_tensors="pt")
	inputs.pop('token_type_ids') # 输入中不包含token_type_ids

	outputs = model(**inputs)
	last_hidden_state = outputs.last_hidden_state
	pooled_output = outputs.pooler_output
	```