|
--- |
|
license: mit |
|
datasets: |
|
- laion/laion2B-en |
|
- laion/laion-coco |
|
- laion/laion2B-multi |
|
- kakaobrain/coyo-700m |
|
- conceptual_captions |
|
- wanng/wukong100m |
|
--- |
|
|
|
# Model card for InternViT-6B-224px |
|
|
|
## Model Details |
|
- **Model Type:** feature backbone |
|
- **Model Stats:** |
|
- Params (M): 5903 |
|
- Image size: 224 x 224 |
|
- **Papers:** |
|
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
|
- **GitHub:** |
|
- https://github.com/OpenGVLab/InternVL |
|
- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi |
|
|
|
## Model Usage |
|
|
|
### Image Embeddings |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, CLIPImageProcessor |
|
|
|
model = AutoModel.from_pretrained( |
|
'OpenGVLab/InternViT-6B-224px', |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True).cuda().eval() |
|
|
|
image = Image.open('./examples/image1.jpg').convert('RGB') |
|
|
|
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px') |
|
|
|
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values |
|
pixel_values = pixel_values.to(torch.bfloat16).cuda() |
|
|
|
outputs = model(pixel_values) |
|
``` |