|
--- |
|
license: llama3 |
|
language: |
|
- en |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- text-generation-inference |
|
|
|
extra_gated_fields: |
|
First Name: text |
|
Last Name: text |
|
Country: country |
|
Affiliation: text |
|
I want to use this model for: |
|
type: select |
|
options: |
|
- Research |
|
- Education |
|
- label: Other |
|
value: other |
|
I agree to use this model in accordance to META LLAMA 3 COMMUNITY LICENSE AGREEMENT: checkbox |
|
--- |
|
|
|
# Dragonfly Model Card |
|
|
|
**Note: Users are permitted to use this model in accordance with the Llama 3 Community License Agreement.** |
|
|
|
## Model Details |
|
|
|
Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3. |
|
|
|
- **Developed by:** [Together AI](https://www.together.ai/) |
|
- **Model type:** An autoregressive visual-language model based on the transformer architecture |
|
- **License:** [Llama 3 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE) |
|
- **Finetuned from model:** [Llama 3](https://github.com/meta-llama/llama3) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/togethercomputer/Dragonfly |
|
- **Blog:** https://www.together.ai/blog/dragonfly-v1 |
|
- **Paper:** https://arxiv.org/abs/2406.00977 |
|
|
|
## Uses |
|
|
|
The primary use of Dragonfly is research on large visual-language models. |
|
It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
### ๐ฟ Installation |
|
|
|
Create a conda environment and install necessary packages |
|
```bash |
|
conda env create -f environment.yml |
|
conda activate dragonfly_env |
|
``` |
|
|
|
Install flash attention |
|
```bash |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
As a final step, please run the following command. |
|
```bash |
|
pip install --upgrade -e . |
|
``` |
|
|
|
### ๐ง Inference |
|
|
|
If you have successfully completed the installation process, then you should be able to follow the steps below. |
|
|
|
Question: Summarize the visual content of the image. |
|
|
|
![Skateboard](skateboard.png) |
|
|
|
Load necessary packages |
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoProcessor, AutoTokenizer |
|
|
|
from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM |
|
from dragonfly.models.processing_dragonfly import DragonflyProcessor |
|
from pipeline.train.train_utils import random_seed |
|
``` |
|
|
|
Instantiate the tokenizer, processor, and model. |
|
```python |
|
device = torch.device("cuda:0") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-v1") |
|
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") |
|
image_processor = clip_processor.image_processor |
|
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd") |
|
|
|
model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-v1") |
|
model = model.to(torch.bfloat16) |
|
model = model.to(device) |
|
``` |
|
|
|
Now, lets load the image and process them. |
|
```python |
|
image = Image.open("./test_images/skateboard.png") |
|
image = image.convert("RGB") |
|
images = [image] |
|
# images = [None] # if you do not want to pass any images |
|
|
|
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nSummarize the visual content of the image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
|
|
inputs = processor(text=[text_prompt], images=images, max_length=2048, return_tensors="pt", is_generate=True) |
|
inputs = inputs.to(device) |
|
``` |
|
|
|
Finally, let us generate the responses from the model |
|
```python |
|
temperature = 0 |
|
|
|
with torch.inference_mode(): |
|
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True) |
|
|
|
generation_text = processor.batch_decode(generation_output, skip_special_tokens=False) |
|
``` |
|
|
|
An example response. |
|
```plaintext |
|
In the heart of a vibrant skatepark, a skateboarder is caught in a moment of pure exhilaration. The skateboarder, dressed in a black t-shirt adorned with a yellow graphic and black pants, is suspended in mid-air, performing an impressive trick on a concrete ramp. The skateboarder's arms are outstretched, adding balance to the daring stunt. |
|
|
|
The skatepark itself is a concrete playground, with the skateboarder's ramp being the main focus. In the background, palm trees sway gently, adding a touch of nature to the urban setting. A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder. |
|
|
|
The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.<|eot_id|> |
|
``` |
|
|
|
## Training Details |
|
|
|
See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977). |
|
|
|
## Evaluation |
|
|
|
See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977). |
|
|
|
## ๐ Credits |
|
|
|
We would like to acknowledge the following resources that were instrumental in the development of Dragonfly: |
|
|
|
- [Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): We utilized the Llama 3 model as our foundational language model. |
|
- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI. |
|
- Our codebase is built upon the following two codebases: |
|
- [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter) |
|
- [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD) |
|
|
|
## ๐ BibTeX |
|
|
|
```bibtex |
|
@misc{chen2024dragonfly, |
|
title={Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model}, |
|
author={Kezhen Chen and Rahul Thapa and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou}, |
|
year={2024}, |
|
eprint={2406.00977}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
Rahul Thapa, Kezhen Chen, Rahul Chalamala |
|
|
|
## Model Card Contact |
|
Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai) |