lbourdois's picture
Upload 174 files
94e735e verified
|
raw
history blame
3.51 kB
DINOv2 is the king for self-supervised learning in images 🦖🦕 But how does it work? I've tried to explain how it works but let's expand on it 🧶
![image_1](image_1.jpg)
DINOv2 is essentially DINO on steroids, so let's talk about DINO first.
🦕 It's essentially a pre-training technique to train ViTs with self-supervision, that uses an unusual way of distillation 🧟‍♂️🧑🏻‍🏫
Distillation is a technique where there's a large pre-trained model (teacher), and you have a smaller model (student) initialized randomly.
Then during training the student, you take both models'outputs, calculate divergence between them and then update the loss accordingly.
In this case, we have no labels! And the teacher is not pretrained!!!! 🤯
Well, the outputs here are the distributions, and teacher is iteratively updated according to student, which is called exponential moving average.
![image_2](image_2.jpg)
DINO doesn't use any contrastive loss or clustering but only cross entropy loss (again, what a paper) which leads the model to collapse.
This can be avoided by normalizing the teacher output multiple times, but authors center (to squish logits) and sharpen (through temperature) the teacher outputs.
Finally, local and global crops are given to student and only global crops are given to teacher and this sort of pushes student to identify context from small parts of the image.
![image_3](image_3.jpg)
How does DINOv2 improve DINO? ⚡️
More efficient thanks to FSDP and Flash Attention 🦖 Has a very efficient data augmentation technique that apparently scales to 100M+ images (put below) 🧑🏻‍🏫
Uses ViT-g instead of training from scratch
![image_4](image_4.jpg)
The model is so powerful that you can use DINOv2 even with knn or linear classifiers without need to fine-tuning!
But if you'd like DINOv2 to work even better, [NielsRogge](https://twitter.com/NielsRogge) has built a [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DINOv2/Fine\_tune\_DINOv2\_for\_image\_classification\_%5Bminimal%5D.ipynb) to fine-tune it using `Trainer`.
📖 He also has a [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DINOv2/Train\_a\_linear\_classifier\_on\_top\_of\_DINOv2\_for\_semantic\_segmentation.ipynb) if you feel like training a linear classifier only
📔 All the different DINO/v2 model checkpoints are [here](https://huggingface.co/models?search=dinoLastly).
Special thanks to [ykilcher](https://twitter.com/ykilcher) as I couldn't make sense of certain things in the paper and watched his awesome [tutorial](https://youtube.com/watch?v=h3ij3F) 🤩
> [!TIP]
Ressources:
[DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski (2023)
[GitHub](https://github.com/facebookresearch/dinov2)
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/dinov2)
> [!NOTE]
[Original tweet](https://twitter.com/mervenoyann/status/1743290724672495827) (January 5, 2024)