Made a small write up and experimental finetuning guide for MetaCLIP2 for Image Classification on Downstream Tasks. The blog titled Fine Tuning MetaCLIP 2 for Image Classification on Downstream Tasks demonstrates the step by step finetuning using CIFAR10 and is also flexible for adapting to other datasets. For more details, check out the linked blog below. š¤āļø
ššļøš New Research Alert - ICCV 2025 (Poster)! ššļøš š Title: Is Less More? Exploring Token Condensation as Training-Free Test-Time Adaptation š
š Description: Token Condensation as Adaptation (TCA) improves the performance and efficiency of Vision Language Models in zero-shot inference by introducing domain anchor tokens.
š„ Authors: Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo
ššļøš New Research Alert - ICCV 2025 (Poster)! ššļøš š Title: Is Less More? Exploring Token Condensation as Training-Free Test-Time Adaptation š
š Description: Token Condensation as Adaptation (TCA) improves the performance and efficiency of Vision Language Models in zero-shot inference by introducing domain anchor tokens.
š„ Authors: Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo
ššļøš New Research Alert - ICCV 2025 (Oral)! ššļøš š Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching š
š Description: The proposed method enhances stereo matching by efficiently combining unbiased monocular priors from vision foundation models. This method addresses misalignment and local optima issues using a binary local ordering map and pixel-wise linear regression.
ššļøš New Research Alert - ICCV 2025 (Oral)! ššļøš š Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching š
š Description: The proposed method enhances stereo matching by efficiently combining unbiased monocular priors from vision foundation models. This method addresses misalignment and local optima issues using a binary local ordering map and pixel-wise linear regression.
ššš New Research Alert - ICCV 2025 (Oral)! šš¤š š Title: Understanding Co-speech Gestures in-the-wild š
š Description: JEGAL is a tri-modal model that learns from gestures, speech and text simultaneously, enabling devices to interpret co-speech gestures in the wild.
š„ Authors: @sindhuhegde, K R Prajwal, Taein Kwon, and Andrew Zisserman
ššš New Research Alert - ICCV 2025 (Oral)! šš¤š š Title: Understanding Co-speech Gestures in-the-wild š
š Description: JEGAL is a tri-modal model that learns from gestures, speech and text simultaneously, enabling devices to interpret co-speech gestures in the wild.
š„ Authors: @sindhuhegde, K R Prajwal, Taein Kwon, and Andrew Zisserman