How to run Llama 3.2 11B Vision with Hugging Face Transformers on a cloud GPU

Meta’s Llama foundation models have spurred AI innovation by making AI accessible to numerous developers and startups. The Llama family of large language models (LLMs) is now the leading open source model system, having been downloaded nearly 350 million times. Over the past few months, visual language (VL) models such as Mistral’s Pixtral and Qwen2 VL have added a new dimension to open source AI innovation by providing the ability to query images, understand them and combine this understanding with language capabilities. We recently analyzed Pixtral 12B and found its image recognition, text extraction and multimodal processing capabilities impressive. In this article, we’ll be exploring how to run Meta’s new multimodal model, Llama 3.2 11B with Vision on the Ori GPU cloud, and also see how it compares with Pixtral. Here’s a quick rundown of Llama 3.2 11B Vision specifications:
| Attribute | Llama 3.2 Multimodal |
|---|---|
| Architecture | Natively multimodal (trained on text-image pairs) adapter that combines pre-trained 3.2 vision model with pre-trained Llama 3.1 language model |
| Parameters | 2 models with 90B and 11B parameters respectively |
| Model Variants | Instruction tuned: For visual recognition, image reasoning, captioning, and assistant-like chat with images Pre-trained models: Adapted for a variety of image reasoning tasks. |
| Sequence length | 128k tokens |
| License | Llama 3.2 Community: Commercial and research |
Performance benchmark results shared by Meta portray the 11B model matching state-of-the-art (SOTA) performance and the 90B model outperforming models such as Claude 3-Haiku and GPT-4o-mini. Although smaller than some of the other recent Llama models such as the Llama 3.1 405b, these medium sized models pack a punch with their strong multimodal performance.

Source: Meta AI Blog
Llama 3.2 Vision shows potential applications across a multitude of use cases such as Visual Question Answering (VQA), Document VQA, image captioning, text retrieval, visual grounding to pinpoint objects and segments, and Optical Character Recognition (OCR).
How to use Llama 3.2 with Hugging Face Transformers on an Ori virtual machine
Pre-requisites
Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA L40S with 48 GB VRAM and 90GiB of system memory for this demo because of its strong inference performance. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 3.2 11B Vision Instruct model.
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.
Step 1: SSH into your VM, install Python and create a virtual environment
1apt install python3.10-venv
2python3.10 -m venv llama-env
3source llama-env/bin/activateStep 2: Install Pytorch if you didn’t use the corresponding init script
1pip3 install torch torchvision torchaudioStep 3: Install Hugging face CLI and log in
1pip install -U "huggingface_hub[cli]"
2huggingface-cli loginStep 4: Install Transformers package
1pip install --upgrade transformersStep 5: Open a python file with an editor of your choice
1vim llama_demo.pyStep 6: Use the following Python code to run Hugging Face Transformers. For our demo, we’ll use the “Vision-Instruct” model as it is finetuned for vision related instructions.
We’ve used the URL of the image for the model to access the raw stream of input image tokens. However, you can also use an image from a local directory as shown in the commented code.
1import os
2import requests
3import torch
4from PIL import Image
5from transformers import MllamaForConditionalGeneration, AutoProcessor
6
7model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
8
9model = MllamaForConditionalGeneration.from_pretrained(
10 model_id,
11 torch_dtype=torch.bfloat16,
12 device_map="auto",
13)
14processor = AutoProcessor.from_pretrained(model_id)
15
16url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg"
17#local_path ="/root/bird.jpg"
18image = Image.open(requests.get(url, stream=True).raw)
19#image = Image.open(local_path)
20
21messages = [
22 {"role": "user", "content": [
23 {"type": "image"},
24 {"type": "text", "text": "Describe this image in detail."}
25 ]}
26]
27input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
28inputs = processor(image, input_text, return_tensors="pt").to(model.device)
29
30output = model.generate(**inputs, max_new_tokens=28000)
31print(processor.decode(output[0]))This code snippet is based on the example provided by Meta here. However, we’ve made a couple of minor edits: changed the # of tokens to 28,000 and added support for local images.
Step 7: Run the python file and the model input will be shown on the terminal
1python3.10 llama_demo.pyAlternative ways to run Llama 3.2 on the cloud
vLLM
Use vLLM to run Pixtral on a cloud GPU or on a local machine. Metal has shared a code example here.
Jupyter notebook
Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here
1pip3 install notebook
2jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.
How good is Llama 3.2 11B Vision Instruct?
We ran a series of visual language prompts to put Llama 3.2 through the paces. Here are some results of queries about open source images

Image Source: Pixabay
Prompt: What is the price of the service?
Response: The price of the service is 14.00 euros.
Prompt: What is the product or service sold?
Response: The product or service sold is a ticket.
Prompt: Who is selling the product or service?
Response: The product or service is being sold by the personenschiffahrt, as indicated by the text on the ticket.
Prompt: When was this ticket sold?
Response: The ticket was sold on September 13, 2013.
Prompt: Based on the information in this image, how much do 10 tickets cost?Response: To find the cost of 10 tickets, we need to multiply the cost of one ticket by 10. **Cost of one ticket:** 14.00 euros **Cost of 10 tickets:**14.00 euros x 10 = 140 euros Therefore, 10 tickets cost 140 euros. Llama 3.2 did a stellar job of recognizing the text fields in the image and connecting the dots, it also showed us the stepwise calculation for the price of 10 tickets.

Image Source: Pixabay
Prompt: What is the maximum speed at this intersection?
Response: The maximum speed at this intersection is 20 miles per hour. Although no units are specified in the image, this is the correct response because it is in Alaska, USA where mph is used.

Image Source: Pixabay
Prompt: What is the maximum speed on this road?
Response: The maximum speed on this road is 100 kilometers per hour. Although no units are specified in the image, this is the correct response because it is in Chile which uses km/h.

Image Source: Wikimedia
Prompt: How much will this project cost in total?
Response: **Calculating the Total Cost of the Truss Bridge Project** To determine the total cost of the Truss Bridge Project, we need to sum up the costs of all the materials listed in the table. **Cost Breakdown:** * **3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)**: $23.76* **3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses)**: $5.52* **3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses)**: $5.88* **Syringe100ml and 120cm handy plastic tubing (hydraulic system)**: $17.70* **Syringe 100CC/ML (hydraulic system)**: $15.98* **8 oz. Carpenter's Wood Glue**: $3.47 **Total Cost Calculation:** $23.76 + $5.52 + $5.88 + $17.70 + $15.98 + $3.47 = **$72.31** The total cost of the Truss Bridge Project is **$72.31**.
Prompt: How much will this project cost if the price of glue increased by 50%
Response: To calculate the new cost of the project with a 50% increase in glue price, we first need to find the original cost of the glue. The original cost of the glue is $3.47. To find the new cost, we multiply the original cost by 1.5 (1 + 0.5 = 1.5). **New Cost of Glue = Original Cost x 1.5****New Cost of Glue = $3.47 x 1.5****New Cost of Glue = $5.21** Now that we have the new cost of the glue, we can calculate the total cost of the project. The total cost of the project is the sum of the costs of all the materials and shipping. The total cost of the project is $72.31. To find the new total cost, we subtract the original cost of the glue from the total cost and then add the new cost of the glue. **New Total Cost = Total Cost - Original Cost of Glue + New Cost of Glue****New Total Cost = $72.31 - $3.47 + $5.21****New Total Cost = $73.05** The new total cost of the project with a 50% increase in glue price is $73.05.
Prompt: Suggest a way to reduce the total cost of the project by 20%
Response: To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31. **Step 1: Calculate the 20% reduction amount** $72.31 * 0.20 = $14.46 **Step 2: Find ways to reduce the cost** The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed. **Step 3: Calculate the new total cost** $72.31 - $0.69 = $71.62 By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%. When we asked Llama 3.2 to find ways to reduce the project cost by 20%, we observed a hallucination. Instead of reducing the cost of the whole project, it picked one of the project components and reduced its price by 20%, which was not enough to bring down the cost as a whole. Although Llama 3.2 excelled in OCR and analyzing images, we felt that processing of the extracted information could be enhanced.

Image Source: Pixabay
Prompt: what is the city in this image?
Response: The city in this image is London, England. The prominent building in the background is the Shard, a 72-story skyscraper located in Southwark, London.

Image Source: Pixabay
Prompt: what is the mountain in this image?
Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India.

Image Source: Pixabay
Prompt: How many types of berries are there in this image?
Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants.

Image Source: Pixabay
Prompt: Provide a one-line caption for this image
Response: This image depicts a vast expanse of cracked salt-covered terrain, with a mountain range in the background, set against a sky transitioning from blue to orange hues.
Llama 3.2 did well in analyzing all the images above, however it tends to be verbose sometimes.
Llama 3.2 11B Vision Instruct vs Pixtral 12B
To compare Llama 3.2 vs Pixtral, we ran the same prompts that we used for our Pixtral demo blog post, and found that Llama 3.2 Vision Instruct was equally good. However, Llama 3.2 showed slightly better prompt adherence when asked to restrict the image description to a single line.
On the other hand, Llama 3.2 often provided more details than needed, but described the process of arriving at a response with steps which could be helpful in many scenarios. Overall, both models are quite similar in how well they perform, and here are a few examples to illustrate some of the differences between them.
See how Llama 3.2 compares with Pixtral

Image Source: Pixabay
Prompt: How many people in this image?
Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.
Pixtral Response: There are three people in the image.

Image Source: Pixabay
Prompt: Translate to English
Llama 3.2 Response: The image is a digital graphic with a flower and leaves, accompanied by German text. The purpose of the image appears to be a welcome message or greeting. * A flower: + The flower is white with yellow center. + It has green leaves. + The flower is placed on the right side of the image.* Leaves: + There are three leaves. + They are green and have a slight sheen to them. + They are placed behind the flower.* Text: + The text is in white cursive font. + It reads "Lieber Gast... Herzlich Willkommen!" which translates to "Dear guest... Welcome!" + The text is placed on the left side of the image.
Pixtral Response: Sure, here is the translation of the text from German to English: "The text reads:'Dear Guest...Youare welcome!'" So, the whole translation is:"Dear Guest...You are welcome!"

Image Source: NPS
Prompt: Describe this image in one sentence
Llama 3.2 Response: The image shows a mountainous landscape with a river running through it, set against a cloudy blue sky.
Pixtral Response: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue. In the above example, Llama 3.2 was able to adhere to the prompt more closely and limited its response to a single sentence whereas Pixtral couldn't. However, both these models are very close in terms of performance and in general excel in multimodal tasks.
Chart your own AI reality with Ori
Ori Global Cloud provides flexible AI infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables AI startups and businesses to:
- Deploy AI-Optimized GPU instances for training, finetuning and inference workloads.
- Scale effortlessly from fractional GPUs to bespoke private clouds
- Reduce your GPU costs by up to 75% when compared to hyperscale cloud providers.

