pages/Painter/Painter .md · merve/vision

I read the Painter paper by BAAIBeijing to convert the weights to transformers, and I absolutely loved the approach they took so I wanted to take time to unfold it here!

so essentially this model takes inspiration from in-context learning, as in, in LLMs you give an example input output and give the actual input that you want model to complete (one-shot learning) they adapted this to images, thus the name "images speak in images"

this model doesn't have any multimodal parts, it just has an image encoder and a decoder head (linear layer, conv layer, another linear layer) so it's a single modality

the magic sauce is the data: they input the task in the form of image and associated transformation and another image they want the transformation to take place and take smooth l2 loss over the predictions and ground truth this is like T5 of image models 😀

What is so cool about it is that it can actually adapt to out of domain tasks, meaning, in below chart, it was trained on the tasks above the dashed line, and the authors found out it generalized to the tasks below the line, image tasks are well generalized 🤯

Ressources:
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
by Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang (2022)
GitHub

Original tweet (March 23, 2024)