Instructions to use OpenGVLab/InternVL2_5-78B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenGVLab/InternVL2_5-78B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL2_5-78B", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenGVLab/InternVL2_5-78B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenGVLab/InternVL2_5-78B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenGVLab/InternVL2_5-78B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenGVLab/InternVL2_5-78B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenGVLab/InternVL2_5-78B
- SGLang
How to use OpenGVLab/InternVL2_5-78B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenGVLab/InternVL2_5-78B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenGVLab/InternVL2_5-78B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenGVLab/InternVL2_5-78B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenGVLab/InternVL2_5-78B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenGVLab/InternVL2_5-78B with Docker Model Runner:
docker model run hf.co/OpenGVLab/InternVL2_5-78B
Strange CUDA Error for Multi-GPU Setup
#6
by floschne - opened
Hi,
First of all thanks for sharing the model and providing such detailed docs!
I'm experiencing a strange CUDA error when running the model on 4 A40 (46GB) GPUs with the device map code you provided.
025-02-08 23:57:22.458 | ERROR | __main__:main:210 - Error during response generation for sample 0: varlen_fwd(): incompatible function arguments. The following argument types are supported: | 0/3012 [00:00<?, ?it/s]
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: Optional[torch.Tensor], arg4: torch.Tensor, arg5: torch.Tensor, arg6: Optional[torch.Tensor], arg7: Optional[torch.Tensor], arg8: Optional[torch.Tensor], arg9: Optional[torch.Tensor], arg10: int, arg11: int, arg12: float, arg13: float, arg14: $
ool, arg15: bool, arg16: int, arg17: int, arg18: float, arg19: bool, arg20: Optional[torch.Generator]) -> list[torch.Tensor]
Invoked with: tensor([[[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., -0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., -0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., 0., -0., 0.],
[0., 0., 0., ..., -0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.],
[0., 0., -0., ..., 0., 0., 0.]],
...,
[-0.0012, -0.0014, -0.0009, ..., 0.0038, 0.0004, -0.0022]],
[[ 0.0004, -0.0002, -0.0007, ..., -0.0007, -0.0022, 0.0021],
[ 0.0006, -0.0008, 0.0017, ..., -0.0023, -0.0007, 0.0018],
[-0.0007, -0.0018, 0.0015, ..., -0.0010, -0.0012, -0.0011],
...,
[ 0.0023, -0.0028, 0.0023, ..., 0.0049, 0.0030, -0.0028],
[-0.0020, 0.0014, 0.0004, ..., 0.0001, -0.0033, -0.0050],
[-0.0012, -0.0014, -0.0009, ..., 0.0038, 0.0004, -0.0022]]],
device='cuda:1', dtype=torch.bfloat16), None, tensor([1748], device='cuda:1', dtype=torch.int32), tensor([1748], device='cuda:1', dtype=torch.int32), None, None, None, None, 4575867776795852673, 4575867776795852673, 0.0, 0.08838834764831845, False, True, -1, -1, 0.0, False, None
After some googling, I think it's somehow related to flash-attention, but I'm afraid I can't fix it w/o some serious monkey patching... I'm using flash-attn==2.7.3, CUDA==12.1, transformers==4.48.0. Do you know how to fix this?