Fine-Tune Meta Llama 3.2-Vision-Instruct Multimodal LLM on Intel Accelerators
In this article, I will show you - with code - how to fine-tune a Multimodal Large Language Model (MLLM), Meta Llama-3.2 11B Vision Instruct, on an image-caption dataset. Yes, Llama 3.3 was recently released! But not the Vision Instruct version yet. Though significant, I am not spending time in this article on how to extract the image and caption from PDF documents; instead, I am focused on the actual fine-tuning, and how to accelerate the process using Intel® Gaudi® 2 AI processors. For efficient training, I use Low-Rank Adaptation (LoRA). I prepared a Docker container and will share my strategy and code so you also can fine-tune the Llama-3.2 model, or another multimodal model on Hugging Face.
Outline:
- What is a Multimodal LLM?
- Llama Models - Gain access to Meta Llama models on Hugging Face.
- Docker – Set up the environment to build a Docker image, and run a Docker container.
- Dataset - Use an image-caption dataset from Hugging Face.
- Fine-Tuning - Fine-tune the Llama-3.2 11B Vision Instruct model.
- Inference - Test the models before and after fine-tuning.
- Try it yourself - Try out the latest Intel AI accelerators on the Intel Tiber AI Cloud.
What is a Multimodal LLM?
Multimodal Large Language Models (MLLMs) are an extension of Large Language Models (LLMs) like OpenAI GPT-4 and Llama 3 to incorporate other kinds of media beyond text. MLLMs expand the reasoning of LLMs to encompass images, audio, and video inputs and outputs. In the case of fine-tuning for this article, I am focused on a specific subcategory of MLLMs, called Visual Language Models, or VLMs for short. VLMs learn from image and text, and generate text output that describes image content or answers questions based on the image and text inputs. The VLM of interest here is Llama-3.2 11B Vision Instruct, which was trained specifically to answer questions about images. For example, one could prompt the VLM with “Describe the process being depicted in the image.”
Llama Models
To use Meta Llama models, you need to first request access on Hugging Face. You can follow the instructions on Meta's Hugging Face organization card here: https://huggingface.co/meta-llama to gain access to their models. Then, generate an access token from within your Hugging Face profile settings. Select your profile avatar in the top right of the screen and select "Access Tokens". Click "+ Create new token" and then select "Read access to contents of all public gated repos you can access". I would recommend keeping the generated token in a hidden file.
Docker
Next, let's go through the environment and Docker setup. I am running in a Linux environment with Ubuntu 22.04, and have 8x Intel Gaudi 2 cards. Similar to nvidia-smi
for NVIDIA GPUs, I can run hl-smi
to see the status of the 8x Intel Gaudi cards. The output should be something similar to:
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.18.0-fw-53.1.1.1 |
| Driver Version: 1.18.0-ee698fb |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:cc:00.0 N/A | 0 |
| N/A 26C N/A 90W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 24C N/A 88W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 28C N/A 89W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 27C N/A 85W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 24C N/A 66W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 25C N/A 87W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:43:00.0 N/A | 26 |
| N/A 29C N/A 278W / 600W | 98304MiB / 98304MiB | 41% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 25C N/A 95W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Create a requirements.txt
file with the necessary Python packages:
huggingface_hub[cli]==0.27.0
optimum-habana==1.15.0
peft==0.14.0
Levenshtein==0.26.1
sentencepiece!=0.1.92
git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
I do assume that you have installed Docker on your Linux system. If not, you can follow the steps from the Docker Docs. Looking at the guide from the guide of the Intel Gaudi documentation here, I created a Dockerfile
so that I could easily replicate my work and run the fine-tuning within a container. Please note that you may need to update the version numbers in the Dockerfile
. I first create the image, then enter the Docker container interactively to experiment with the scripts I am running. If you want, you can just have the Docker container run all your scripts for you, but I wanted to interact with it manually after setting up the environment.
# Get the Gaudi Docker image for Ubuntu 22.04
FROM vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
# Set environment variables
ENV HABANA_VISIBLE_DEVICES=all
ENV OMPI_MCA_btl_vader_single_copy_mechanism=none
# Set working directory in container
WORKDIR /app/home
# Copy the current directory into the container at /app/home. This ensures we copy requirements.txt to be able to install packages later.
COPY . .
# Specify a mount point in the container
VOLUME ["/app/home"]
# Clone optimum-habana to easily run some of the scripts
RUN git clone https://github.com/huggingface/optimum-habana
# Install python packages
RUN pip install -r requirements.txt
Now, we are ready to create our Docker image, and then run and create a Docker container based on that image. To create the Docker image, you can run
docker build -t llama32-visioninstruct-image .
Then, to create and run the container interactively, you can run
docker run -it --runtime=habana llama32-visioninstruct-image
Here is a short clip showing getting Docker started on the command line and the 8 available Gaudi cards.
Now, you should be inside your Docker container with a command prompt. You can now enter your Hugging Face token to gain access to the Llama model. Since the Hugging Face CLI was installed from requirements.txt
file, run the login command with:
huggingface-cli login
Then copy-paste your Hugging Face token, and you should have access to the Llama model.
Dataset
The dataset used for fine-tuning the Llama-3.2 11B Vision Instruct model was image-caption pairs from nielsr/docvqa_1200_examples . There were a total of 1000 image-caption pairs in the training set and 200 image-caption pairs in the test set. The best way to describe the dataset is by showing an example of an image-caption pair. Here is one example from the training set (train_1125) with the corresponding caption beneath it:
Fine-Tuning
To make fine-tuning more efficient by not having to re-train the entire architecture from scratch, I used Low-Rank Adaptation (LoRA):
[LoRA] freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
The Python code to run the example is taken directly from Optimum for Intel® Gaudi® AI Accelerator image-to-text examples. I ran the following script within the Docker container for fine-tuning. Here is the example code:
python3 optimum-habana/examples/gaudi_spawn.py \
--world_size 8 --use_mpi optimum-habana/examples/image-to-text/run_image2text_lora_finetune.py \
--model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
--dataset_name nielsr/docvqa_1200_examples \
--bf16 True \
--output_dir ./model_lora_llama_11b \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--weight_decay 0.01 \
--logging_steps 25 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 5e-5 \
--warmup_steps 50 \
--lr_scheduler_type "constant" \
--input_column_names 'image' 'query' \
--output_column_names 'answers' \
--remove_unused_columns False \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--lora_rank=8 \
--lora_alpha=8 \
--lora_dropout=0.1 \
--low_cpu_mem_usage True \
--max_seq_length=512 \
After downloading the foundation model (around 10 minutes), the training took around 20 minutes on 8x Intel Gaudi 2 cards. The end of the training output was as follows:
[2025-01-28 16:09:05,995] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[INFO|trainer.py:826] 2025-01-28 16:09:10,247 >> ***** Running training *****
[INFO|trainer.py:827] 2025-01-28 16:09:10,247 >> Num examples = 1,000
[INFO|trainer.py:828] 2025-01-28 16:09:10,247 >> Num Epochs = 2
[INFO|trainer.py:829] 2025-01-28 16:09:10,247 >> Instantaneous batch size per device = 2
[INFO|trainer.py:832] 2025-01-28 16:09:10,247 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:833] 2025-01-28 16:09:10,247 >> Gradient Accumulation steps = 16
[INFO|trainer.py:834] 2025-01-28 16:09:10,247 >> Total optimization steps = 6
[INFO|trainer.py:835] 2025-01-28 16:09:10,255 >> Number of trainable parameters = 26,214,400
100%|██████████| 6/6 [07:20<00:00, 46.81s/it][INFO|trainer.py:1123] 2025-01-28 16:16:30,543 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 451.3828, 'train_samples_per_second': 4.431, 'train_steps_per_second': 0.013, 'train_loss': 2.194087028503418, 'epoch': 1.52, 'memory_allocated (GB)': 47.11, 'max_memory_allocated (GB)': 92.95, 'total_memory_available (GB)': 94.62}
100%|██████████| 6/6 [07:31<00:00, 75.24s/it]
[INFO|trainer.py:1656] 2025-01-28 16:16:41,712 >> Saving model checkpoint to ./model_lora_llama_11b
[INFO|configuration_utils.py:125] 2025-01-28 16:16:57,573 >> Configuration saved in ./model_lora_llama_11b/gaudi_config.json
***** train metrics *****
epoch = 1.5238
max_memory_allocated (GB) = 92.95
memory_allocated (GB) = 47.11
total_flos = 44372460GF
total_memory_available (GB) = 94.62
train_loss = 2.1941
train_runtime = 0:07:31.38
train_samples_per_second = 4.431
train_steps_per_second = 0.013
01/28/2025 16:17:12 - INFO - __main__ - generated: ['\n\nBengaluru']
100%|██████████| 200/200 [04:10<00:00, 1.25s/it]
***** eval metrics *****
eval_accuracy = 0.9142
Inference
Now that we have a fine-tuned version of the Llama-3.2 11B Vision Instruct model, we should test both models on the unseen (test) data. The test dataset has 200 samples total. I used the same prompt for both models:
Answer briefly. Which brand has 10x Vitamin E in the picture?
Here is a test image (val_146), the answers from both models, and the original caption.
Llama-3.2 11B Vision Instruct: “The brand with 10x Vitamin E in the picture is Vivel.” |
Fine-Tuned model: “Vivel” |
The actual caption labeled: “vivel” |
Both models performed well in this case. The Llama-3.2 model is more detailed in its answer, but we asked the model in our prompt to be brief. The fine-tuned model is likely adhering to its labeled training data, where the answers are more concise, as it only answers with one word.
Try it yourself
We were able to walk through a hands-on example of fine-tuning the Llama-3.2 11B Vision Instruct multimodal LLM on an Intel AI Accelerator. Try one of the powerful Intel accelerators yourself on the Intel® Tiber™ AI Cloud. The documentation to get started with a new account can be found here, and for more Intel AI software documentation and tutorials, see Intel AI Development Resources.
Come and chat with other developers on Intel’s Discord server: Intel DevHub.
Also, feel free to comment on the article below or reached me on LinkedIn. Happy coding!