--- license: mit datasets: - laion/laion2B-en - laion/laion-coco - laion/laion2B-multi - kakaobrain/coyo-700m - conceptual_captions - wanng/wukong100m --- # Model card for InternViT-6B-224px ## Model Details - **Model Type:** feature backbone - **Model Stats:** - Params (M): 5903 - Image size: 224 x 224 - **Papers:** - InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks - **GitHub:** - https://github.com/OpenGVLab/InternVL - **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi ## Model Usage ### Image Embeddings ```python import torch from PIL import Image from transformers import AutoModel, CLIPImageProcessor model = AutoModel.from_pretrained( 'OpenGVLab/InternViT-6B-224px', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval() image = Image.open('./examples/image1.jpg').convert('RGB') image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px') pixel_values = image_processor(images=image, return_tensors='pt').pixel_values pixel_values = pixel_values.to(torch.bfloat16).cuda() outputs = model(pixel_values) ```