File size: 2,507 Bytes
94e735e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import streamlit as st
from streamlit_extras.switch_page_button import switch_page

st.title("PLLaVA")

st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1786336055425138939) (May 3, 2024)""", icon="ℹ️")
st.markdown(""" """)

st.markdown("""Parameter-free LLaVA for video captioning works like magic! 🤩 Let's take a look!
""")
st.markdown(""" """)

st.image("pages/PLLaVA/image_1.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""Most of the video captioning models work by downsampling video frames to reduce computational complexity and memory requirements without losing a lot of information in the process.  
PLLaVA on the other hand, uses pooling! 🤩  

How? 🧐  
It takes in frames of video, passed to ViT and then projection layer, and then output goes through average pooling where input shape is (# frames, width, height, text decoder input dim) 👇 
""")
st.markdown(""" """)

st.image("pages/PLLaVA/image_2.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""Pooling operation surprisingly reduces the loss of spatial and temporal information. See below some examples on how it can capture the details 🤗 
""")
st.markdown(""" """)

st.image("pages/PLLaVA/image_3.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""According to authors' findings, it performs way better than many of the existing models (including proprietary VLMs) and scales very well (on text decoder). 
""")
st.markdown(""" """)

st.image("pages/PLLaVA/image_4.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Model repositories 🤗 [7B](https://t.co/AeSdYsz1U7), [13B](https://t.co/GnI1niTxO7), [34B](https://t.co/HWAM0ZzvDc)  
Spaces🤗 [7B](https://t.co/Oms2OLkf7O), [13B](https://t.co/C2RNVNA4uR)  
""")
st.markdown(""" """)

st.info("""
Ressources:  
[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994) 
by Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng (2024)  
[GitHub](https://github.com/magic-research/PLLaVA)""", icon="📚")

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
    if st.button('Previous paper', use_container_width=True):
        switch_page("DocOwl 1.5")
with col2:
    if st.button('Home', use_container_width=True):
        switch_page("Home")
with col3:
    if st.button('Next paper', use_container_width=True):
        switch_page("CuMo")