Rodeszones
/

CogVLM-grounding-generalist-hf-quant4

Object Detection

text-generation

4-bit precision

Model card Files Files and versions Community

CogVLM-grounding-generalist-hf-quant4 / README.md

Rodeszones's picture

Update README.md

7b16d06 verified 10 months ago

|

2.98 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: visual-question-answering
	---

	# CogVLM

	CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. CogVLM can also [chat with you](http://36.103.203.44:7861/) about images.

	<div align="center">
	<img src="https://github.com/THUDM/CogVLM/raw/main/assets/metrics-min.png" alt="img" style="zoom: 50%;" />
	</div>

	# Dependencies

	```base
	pip install torch==2.1.0 transformers==4.35.0 accelerate==0.24.1 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.22.post7 triton==2.1.0
	```

	# Qiuckstart

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM, LlamaTokenizer

	model_path = 'Model/folder/path/here'


	tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True
	).eval()


	# chat example
	query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
	image = Image.open("your/image/path/here").convert('RGB')
	inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
	inputs = {
	'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
	'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
	'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
	'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
	}
	gen_kwargs = {"max_length": 2048, "do_sample": False}

	with torch.no_grad():
	outputs = model.generate(inputs, gen_kwargs)
	outputs = outputs[:, inputs['input_ids'].shape[1]:]
	print(tokenizer.decode(outputs[0]))

	```

	# (License）

	The code in this repository is open source under the [Apache-2.0 license](https://github.com/THUDM/CogVLM/raw/main/LICENSE), while the use of the CogVLM model weights must comply with the [Model License](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE).



	# (Citation）

	If you find our work helpful, please consider citing the following papers
	```
	@article{wang2023cogvlm,
	title={CogVLM: Visual Expert for Pretrained Language Models},
	author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
	year={2023},
	eprint={2311.03079},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```