Spaces:

merve
/

vision_papers

Running

File size: 5,699 Bytes

1d4f55a

import streamlit as st
from streamlit_extras.switch_page_button import switch_page


translations = {
'en': {'title': 'CuMo',
    'original_tweet': 
       """
       [Original tweet](https://twitter.com/mervenoyann/status/1790665706205307191) (May 15, 2024)
       """,
    'tweet_1':
        """
        It's raining vision language models ☔️  
        CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part 🤓  
        """,
    'tweet_2':
        """
        The authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts.  
        """,
    'tweet_3':
        """
        The mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning 👇  
        """,
    'tweet_4':
        """
        It works very well (also tested myself) that it outperforms the previous SOTA of it's size <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a>! 😍  
        I wonder how it would compare to IDEFICS2-8B You can try it yourself [here](https://t.co/MLIYKVh5Ee).  
        """,
    'ressources':
        """
        Ressources:  
        [CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949) 
        by Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen (2024)  
        [GitHub](https://github.com/SHI-Labs/CuMo)
        """
      },
'fr': {
    'title': 'CuMo',
    'original_tweet': 
       """
       [Tweet de base](https://twitter.com/mervenoyann/status/1790665706205307191) (en anglais) (15 mai 2024)
       """,
    'tweet_1':
        """
        Il pleut des modèles de langage/vision ☔️  
        CuMo est un nouveau modèle de langage/vision qui intègre le MoE à chaque étape du VLM (encodeur d'images, MLP et décodeur de texte) et utilise Mistral-7B pour la partie décodeur 🤓          
        """,
    'tweet_2':
        """
        Les auteurs ont tout d'abord effectué un pré-entraînement du MLP en gelant l'encodeur d'images et le décodeur de texte, puis ils ont réchauffé l'ensemble du réseau en le réglant avec précision, ce qui, selon eux, permet de stabiliser le réglage des instructions visuelles lors de l'intervention des experts.          
        """,
    'tweet_3':
        """
        Le mélange d'experts de blocs MLP ci-dessus est simplement le même bloc MLP initialisé à partir du MLP unique qui a été entraîné pendant le pré-entraînement et finetuné dans le pré-finetuning 👇
        """,
    'tweet_4':
        """     
        Cela fonctionne très bien (je l'ai testé moi-même) et surpasse le précédent SOTA de taille équivalente, <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a> ! 😍  
        Je me demande comment il se compare à IDEFICS2-8B. Vous pouvez l'essayer vous-même [ici](https://t.co/MLIYKVh5Ee).  
        """,
    'ressources':
        """
        Ressources :  
        [CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts](https://arxiv.org/abs/2405.05949) 
        de Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen (2024)  
        [GitHub](https://github.com/SHI-Labs/CuMo)
        """
    }
}    


def language_selector():
    languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
    selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
    return 'en' if selected_lang == 'EN' else 'fr'

left_column, right_column = st.columns([5, 1])

# Add a selector to the right column
with right_column:
    lang = language_selector()

# Add a title to the left column
with left_column:
    st.title(translations[lang]["title"])
    
st.success(translations[lang]["original_tweet"], icon="ℹ️")
st.markdown(""" """)

st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/CuMo/image_1.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/CuMo/image_2.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/CuMo/image_3.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/CuMo/image_4.jpg", use_column_width=True)
st.markdown(""" """)

st.info(translations[lang]["ressources"], icon="📚")  

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
    if lang == "en":
        if st.button('Previous paper', use_container_width=True):
            switch_page("PLLaVA")
    else:
        if st.button('Papier précédent', use_container_width=True):
            switch_page("PLLaVA")
with col2:
    if lang == "en":
        if st.button("Home", use_container_width=True):
            switch_page("Home")
    else:
        if st.button("Accueil", use_container_width=True):
            switch_page("Home")
with col3:
    if lang == "en":
        if st.button("Next paper", use_container_width=True):
            switch_page("DenseConnector")
    else:
        if st.button("Papier suivant", use_container_width=True):
            switch_page("DenseConnector")