Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper β’ 2501.04001 β’ Published 8 days ago β’ 40