File size: 3,250 Bytes
bbd9c83 4cd1b29 bbd9c83 542c935 bbd9c83 542c935 bbd9c83 542c935 bbd9c83 542c935 bbd9c83 542c935 bbd9c83 dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d dbc314a 0f8496d bbd9c83 542c935 bbd9c83 342ce8b 542c935 bbd9c83 542c935 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
datasets:
- "google/DOCCI"
language:
- en
library_name: peft
tags:
- florence-2
- lora
- adapter
- image-captioning
- peft
model-index:
- name: Florence-2-DOCCI-FT
results:
- task:
type: image-to-text
name: Image Captioning
dataset:
name: foundation-multimodal-models/DetailCaps-4870
type: other
metrics:
- type: meteor
value: 0.267
- type: bleu
value: 0.185
- type: cider
value: 0.086
- type: capture
value: 0.576
- type: rouge-l
value: 0.287
---
# Florence-2 DOCCI-FT LoRA Adapter
This repository contains a LoRA adapter trained on google/docci for the Florence-2-base-FT model. It's designed to enhance the model's captioning capabilities, providing more detailed captions.
## Usage
To use this LoRA adapter, you'll need to load it along with the Florence-2-base model using the PEFT library. Here's an example of how to use it:
```python
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import requests
def caption(image):
base_model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
prompt = "<MORE_DETAILED_CAPTION>"
adapter_name = "NikshepShetty/Florence-2-DOCCI-FT"
model = PeftModel.from_pretrained(base_model, adapter_name, trust_remote_code=True)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<MORE_DETAILED_CAPTION>", image_size=(image.width, image.height))
print(parsed_answer)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
caption(image)
```
This code demonstrates how to:
1. Load the base Florence-2 model
2. Load the LoRA adapter
3. Process an image and generate a detailed caption
Note: Make sure you have the required libraries installed: transformers, peft, einops, flash_attn, timm, Pillow, and requests.
## Evaluation results
Our LoRA adapter shows improvements over the base Florence-2 model across all metrics for MORE_DETAILED_CAPTION tag for 1000 images on the foundation-multimodal-models/DetailCaps-4870 dataset:
| Metric | Base Model | Adapted Model | Improvement |
|---------|------------|---------------|-------------|
| CAPTURE | 0.546 | 0.576 | +5.5% |
| METEOR | 0.213 | 0.267 | +25.4% |
| BLEU | 0.110 | 0.185 | +68.2% |
| CIDEr | 0.031 | 0.086 | +177.4% |
| ROUGE-L | 0.275 | 0.287 | +4.4% |
These results demonstrate that our LoRA adapter enhances the image captioning capabilities of the Florence-2 base model, particularly in generating more detailed and accurate captions. |