Spaces:
Running
Running
BoltzmannEntropy
commited on
Commit
•
7b9a2e3
1
Parent(s):
318be25
reaeme
Browse files
README.md
CHANGED
@@ -33,8 +33,9 @@ It is available on HF at the follwing HF space: https://huggingface.co/spaces/Bo
|
|
33 |
## Abstract
|
34 |
|
35 |
|
36 |
-
This research introduces a
|
37 |
|
|
|
38 |
|
39 |
## Introduction
|
40 |
|
|
|
33 |
## Abstract
|
34 |
|
35 |
|
36 |
+
This research introduces a technique in the realm of automatic image captioning that leverages the collaborative potential between Vision-Language Models (VLMs) and Large Language Models (LLMs). Our method utilizes a two-tiered framework: initially, various VLMs provide diverse verbal descriptions of an image based on unique prompts; subsequently, these array of captions are integrated by a central 'judge' LLM to produce a cohesive and comprehensive caption that encapsulates the essence of the image. The objective of this synergistic approach is to elevate the precision, richness, and contextual appropriateness of image descriptions by pooling together diverse model capabilities. We validate the effectiveness of our dual-model strategy across a spectrum of VLMs—namely Qwen2-VL, Phi-3-vision, and Moondream2—and assess its performance on different datasets. Our empirical results indicate that this ensemble method yields image captions with substantially higher quality and depth than traditional single-model approaches, marking a significant stride in the evolution of computer vision technology.
|
37 |
|
38 |
+
**Note: The 'judge' LLM to code has not been fully completed yet. **
|
39 |
|
40 |
## Introduction
|
41 |
|