File size: 1,378 Bytes
770ad2a cb4d5ec 770ad2a cb4d5ec 770ad2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
license: apache-2.0
datasets:
- google/docci
- gokaygokay/random_instruct_docci
language:
- en
pipeline_tag: image-text-to-text
---
Fine tuned version of [moondream2](https://huggingface.co/vikhyatk/moondream2) model using [gokaygokay/random_instruct_docci](https://huggingface.co/datasets/gokaygokay/random_instruct_docci) dataset. Which gives extremely detailed captions of the images.
```
pip install transformers timm einops bitsandbytes accelerate flash-attn
```
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
DEVICE = "cuda"
DTYPE = (
torch.float32 if DEVICE == "cpu" else torch.float16
) # CPU doesn't support float16
revision = "3ec40c7b6b5d87bc0c51edee45e21f5f29b449d8"
tokenizer = AutoTokenizer.from_pretrained(
"fal-ai/moondream2-docci-instruct",
trust_remote_code=True,
revision=revision
)
moondream = AutoModelForCausalLM.from_pretrained(
"fal-ai/moondream2-docci-instruct",
trust_remote_code=True,
torch_dtype=DTYPE,
device_map={"": DEVICE},
attn_implementation="flash_attention_2",
revision=revision
)
moondream.eval()
image_path = "<your_image_path>"
image = Image.open(image_path).convert("RGB")
md_answer = moondream.answer_question(
moondream.encode_image(image),
"what is this picture about",
tokenizer=tokenizer,
)
print(md_answer)
``` |