Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /OWLv2 /OWLv2.md

lbourdois

Upload 174 files

94e735e verified 5 months ago

preview code

raw

history blame

2.64 kB

	Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉🧶

	![image_1](image_1.jpg)

	OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first.
	📝 OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training.
	👀 What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.

	![image_2](image_2.jpg)

	Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP).
	They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune.

	![image_3](image_3.jpg)

	During fine-tuning for object detection, they calculate the loss over bipartite matches.
	Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.

	OWL-ViT is very scalable.
	One can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need supervision.
	Moreover, only scaling the encoders creates a bottleneck after a while.

	![image_1](image_1.jpg)

	The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data. (see below)

	![image_4](image_4.jpg)

	Thanks to this, OWLv2 scaled very well and is tops leaderboards on open vocabulary object detection 👑

	![image_5](image_5.jpg)

	Want to try OWL models? I've created a [notebook](https://t.co/ick5tA6nyx ) for you to see how to use it with 🤗 Transformers.
	If you want to play with it directly, you can use this [Space](https://t.co/oghdLOtoa5).
	All the models and the applications of OWL-series is in this [collection](https://huggingface.co/collections/merve/owl-series-65aaac3114e6582c300544df).

	> [!TIP]
	Ressources:
	[Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683)
	by Matthias Minderer, Alexey Gritsenko, Neil Houlsby (2023)
	[GitHub](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)
	[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/owlv2)


	> [!NOTE]
	[Original tweet](https://twitter.com/mervenoyann/status/1748411972675150040) (January 19, 2024)