neuralmagic
/

Llama-3.2-90B-Vision-Instruct-FP8-dynamic

Text Generation

compressed-tensors

Model card Files Files and versions Community

mgoin commited on Sep 26

Commit

731ba5d

•

1 Parent(s): d8006b8

Update README.md

Files changed (1) hide show

README.md +33 -0

README.md CHANGED Viewed

@@ -49,6 +49,39 @@ Only the weights and activations of the linear operators within transformers blo
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
 vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
 ```

 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
+from vllm import LLM, SamplingParams
+from vllm.assets.image import ImageAsset
+# Initialize the LLM
+model_name = "neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic"
+llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True, tensor_parallel_size=4)
+# Load the image
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+# Create the prompt
+question = "If I had to write a haiku for this one, it would be: "
+prompt = f"<|image|><|begin_of_text|>{question}"
+# Set up sampling parameters
+sampling_params = SamplingParams(temperature=0.2, max_tokens=30)
+# Generate the response
+inputs = {
+    "prompt": prompt,
+    "multi_modal_data": {
+        "image": image
+    },
+}
+outputs = llm.generate(inputs, sampling_params=sampling_params)
+# Print the generated text
+print(outputs[0].outputs[0].text)
+```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+```
 vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
 ```