thwin27
/

Aria-sequential_mlp-FP8-dynamic

Image-Text-to-Text

text-generation

compressed-tensors

Model card Files Files and versions Community

thwin27 commited on about 1 month ago

Commit

ef43790

•

1 Parent(s): 88aa751

Update README.md

Files changed (1) hide show

README.md +54 -3

README.md CHANGED Viewed

@@ -6,12 +6,63 @@ base_model:
 pipeline_tag: image-text-to-text
 ---
 # Aria-sequential_mlp-FP8-dynamic
-#### Warning: There is no inference code for transformers/vLLM yet!
-FP8-Dynamic quantization from [Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) made with [LLM Compressor](https://github.com/vllm-project/llm-compressor).
-Generated with the following code:
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
 from llmcompressor.modifiers.quantization import QuantizationModifier

 pipeline_tag: image-text-to-text
 ---
 # Aria-sequential_mlp-FP8-dynamic
+FP8-Dynamic quantization from [Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) made with [llm-compressor](https://github.com/vllm-project/llm-compressor), requires about xx.x GB of VRAM.
+### Installation
+```
+pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow compressed-tensors
+pip install flash-attn --no-build-isolation
+```
+### Inference
+Run this model with:
+``` python
+import requests
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
+torch.cuda.set_device(0)
+model_id_or_path = "thwin27/Aria-sequential_mlp-bnb_FP8-dynamic"
+model = AutoModelForCausalLM.from_pretrained(model_id_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
+image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
+image = Image.open(requests.get(image_path, stream=True).raw)
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"text": None, "type": "image"},
+            {"text": "what is the image?", "type": "text"},
+        ],
+    }
+]
+text = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=text, images=image, return_tensors="pt")
+inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
+    output = model.generate(
+        **inputs,
+        max_new_tokens=500,
+        stop_strings=["<|im_end|>"],
+        tokenizer=processor.tokenizer,
+        do_sample=True,
+        temperature=0.9,
+    )
+    output_ids = output[0][inputs["input_ids"].shape[1]:]
+    result = processor.decode(output_ids, skip_special_tokens=True)
+print(result)
+print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')
+```
+### Quantization
 ```python
 from transformers import AutoProcessor, AutoModelForCausalLM
 from llmcompressor.modifiers.quantization import QuantizationModifier