Instructions to use microsoft/git-base-vatex with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/git-base-vatex with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="microsoft/git-base-vatex")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("microsoft/git-base-vatex")
model = AutoModelForMultimodalLM.from_pretrained("microsoft/git-base-vatex")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/git-base-vatex with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/git-base-vatex"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/git-base-vatex",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/git-base-vatex

SGLang

How to use microsoft/git-base-vatex with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/git-base-vatex" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/git-base-vatex",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/git-base-vatex" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/git-base-vatex",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/git-base-vatex with Docker Model Runner:
```
docker model run hf.co/microsoft/git-base-vatex
```

GIT (GenerativeImage2Text), base-sized, fine-tuned on VATEX

GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on VATEX. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs.

The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens.

The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token.

This allows the model to be used for tasks like:

image and video captioning
visual question answering (VQA) on images and videos
even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).

Intended uses & limitations

You can use the raw model for video captioning. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

From the paper:

We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

=> however this is for the model referred to as "GIT" in the paper, which is not open-sourced.

This checkpoint is "GIT-base", which is a smaller variant of GIT trained on 10 million image-text pairs.

Next, the model was fine-tuned on VATEX.

See table 11 in the paper for more details.

Preprocessing

We refer to the original repo regarding details for preprocessing during training.

During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed-size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.