patomp
/

thai-light-multimodal-clip-and-distill

Feature Extraction

Model card Files Files and versions Community

thai-light-multimodal-clip-and-distill / README.md

patomp's picture

Update README.md

0c6b5e0 over 1 year ago

|

history blame contribute delete

2.48 kB

	---
	license: cc-by-4.0
	datasets:
	- patomp/thai-mscoco-2014-captions
	metrics:
	- recall
	---
	## Requirements

	```bash
	pip install pythainlp
	pip install gensim>=4.3.1
	pip install git+https://github.com/openai/CLIP.git
	```

	## Usage

	Encode a text by
	```python
	from transformers import AutoModel

	text = 'หมากำลังวิ่งในสนามหญ้า'
	model = AutoModel.from_pretrained("patomp/thai-light-multimodal-clip-and-distill", trust_remote_code=True)

	embeddings = model(text)
	print("Text features shape:", embeddings.shape)

	```

	Encode an image by
	```python
	import torch
	import clip
	import requests
	from PIL import Image

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model, preprocess = clip.load("ViT-B/32", device=device)

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	image = preprocess(image).unsqueeze(0).to(device)

	with torch.no_grad():
	image_features = model.encode_image(image)

	print("Image features shape:", image_features.shape)
	```

	## Benchmark

	On the test set of [Thai MS COCO 2014 dataset](https://huggingface.co/datasets/patomp/thai-mscoco-2014-captions)

	\| Model \ Metrics \| text-find-image recall@1 \| text-find-image recall@10 \| image-find-text recall@1 \| image-find-text recall@10 \| # text samples per second* \|
	\| :--- \| --- \| --- \| --- \| --- \| --- \|
	\| Multilingual Encoder \| \| \| \| \| \|
	\| [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) \| 0.075 \| 0.242 \| 0.096 \| 0.286 \| 251 \|
	\| [XLM-Roberta-Large-Vit-B-32](https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32) \| 0.226 \| 0.565 \| 0.265 \| 0.596 \| 20 \|
	\| Thai Encoder (WangchanBERTa-based) \| \| \| \| \| \|
	\| [Thai-Cross-CLIP](https://github.com/vikimark/Thai-Cross-CLIP) \| 0.167 \| 0.475 \| 0.197 \| 0.523 \| 48 \|
	\| Thai Encoder (Thai2Fit-based) \| \| \| \| \| \|
	\| [thai-light-multimodal-clip-and-distill](https://huggingface.co/patomp/thai-light-multimodal-clip-and-distill) \| 0.082 \| 0.328 \| 0.118 \|0.401\| 450 \|
	\| [thai-light-multimodal-distill](https://huggingface.co/patomp/thai-light-multimodal-distill) \| 0.084 \| 0.319 \| 0.122 \|0.401\| 450 \|

	## Reference

	Some part of this content referenced from https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-32.

	For more detail, please visit https://github.com/calzonelover/Lightweight-Multi-modal-Encoder-for-Thai.