thwin27 commited on
Commit
ef43790
1 Parent(s): 88aa751

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -6,12 +6,63 @@ base_model:
6
  pipeline_tag: image-text-to-text
7
  ---
8
  # Aria-sequential_mlp-FP8-dynamic
9
- #### Warning: There is no inference code for transformers/vLLM yet!
 
 
 
 
 
10
 
11
- FP8-Dynamic quantization from [Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) made with [LLM Compressor](https://github.com/vllm-project/llm-compressor).
 
 
 
 
 
 
 
12
 
13
- Generated with the following code:
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ```python
16
  from transformers import AutoProcessor, AutoModelForCausalLM
17
  from llmcompressor.modifiers.quantization import QuantizationModifier
 
6
  pipeline_tag: image-text-to-text
7
  ---
8
  # Aria-sequential_mlp-FP8-dynamic
9
+ FP8-Dynamic quantization from [Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) made with [llm-compressor](https://github.com/vllm-project/llm-compressor), requires about xx.x GB of VRAM.
10
+ ### Installation
11
+ ```
12
+ pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow compressed-tensors
13
+ pip install flash-attn --no-build-isolation
14
+ ```
15
 
16
+ ### Inference
17
+ Run this model with:
18
+ ``` python
19
+ import requests
20
+ import torch
21
+ from PIL import Image
22
+ from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
23
+ torch.cuda.set_device(0)
24
 
25
+ model_id_or_path = "thwin27/Aria-sequential_mlp-bnb_FP8-dynamic"
26
 
27
+ model = AutoModelForCausalLM.from_pretrained(model_id_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
28
+ processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
29
+
30
+ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
31
+
32
+ image = Image.open(requests.get(image_path, stream=True).raw)
33
+
34
+ messages = [
35
+ {
36
+ "role": "user",
37
+ "content": [
38
+ {"text": None, "type": "image"},
39
+ {"text": "what is the image?", "type": "text"},
40
+ ],
41
+ }
42
+ ]
43
+
44
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
45
+ inputs = processor(text=text, images=image, return_tensors="pt")
46
+ inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
47
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
48
+
49
+ with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
50
+ output = model.generate(
51
+ **inputs,
52
+ max_new_tokens=500,
53
+ stop_strings=["<|im_end|>"],
54
+ tokenizer=processor.tokenizer,
55
+ do_sample=True,
56
+ temperature=0.9,
57
+ )
58
+ output_ids = output[0][inputs["input_ids"].shape[1]:]
59
+ result = processor.decode(output_ids, skip_special_tokens=True)
60
+
61
+ print(result)
62
+ print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')
63
+ ```
64
+
65
+ ### Quantization
66
  ```python
67
  from transformers import AutoProcessor, AutoModelForCausalLM
68
  from llmcompressor.modifiers.quantization import QuantizationModifier