Spaces:

merve
/

vision_papers

Running

File size: 8,642 Bytes

1d4f55a

import streamlit as st
from streamlit_extras.switch_page_button import switch_page


translations = {
'en': {'title': 'DocOwl 1.5',
    'original_tweet': 
       """
       [Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)
       """,
    'tweet_1':
        """
        DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝  
        Time to dive in and learn more 🧶 
        """,
    'tweet_2':
        """
        This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself.  
        Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM. 
        """,
    'tweet_3':
        """
        Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen.  
        Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM. 
        """,
    'tweet_4':
        """
        Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓 
        """,
    'tweet_5':
        """
        They train the model various downstream tasks including:  
        - document understanding (DUE benchmark and more)  
        - table parsing (TURL, PubTabNet)  
        - chart parsing (PlotQA and more)  
        - image parsing (OCR-CC)  
        - text localization (DocVQA and more) 
        """,
    'tweet_6':
        """
        They contribute a new model called DocOwl 1.5-Chat by:  
        1. creating a new document-chat dataset with questions from document VQA datasets  
        2. feeding them to ChatGPT to get long answers  
        3. fine-tune the base model with it (which IMO works very well!) 
        """,
    'tweet_7':
        """
        Resulting generalist model and the chat model are pretty much state-of-the-art 😍  
        Below you can see how it compares to fine-tuned models.
        """,
    'tweet_8':
        """
        All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR).  
        The [Space](https://t.co/57E9DbNZXf). 
        """,
    'ressources':
        """
        Ressources:  
        [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) 
        by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)  
        [GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)
        """
      },
'fr': {
    'title': 'DocOwl 1.5',
    'original_tweet': 
       """
       [Tweet de base](https://twitter.com/mervenoyann/status/1782421257591357824) (en anglais) (22 avril 2024)
       """,
    'tweet_1':
        """
        DocOwl 1.5 est le modèle de compréhension de documents d'Alibaba sous licence Apache 2.0 😍📝  
        Il est temps de découvrir ce modèle 🧶 
        """,
    'tweet_2':
        """
        Ce modèle se compose d'un encodeur visuel basé sur un ViT qui prend en compte les crops de l'image et l'image originale elle-même.  
        Les sorties de l'encodeur passent ensuite par un modèle convolutif, après quoi les sorties sont fusionnées avec le texte, puis transmises au LLM. 
        """,
    'tweet_3':
        """
        Au départ, les auteurs n'entraînent que la partie basée sur la convolution (appelée H-Reducer) et l'encodeur de vision tout en gardant le LLM gelé.  
        Ensuite, pour le finetuning (légendage d'image, VQA, etc.), ils gèlent l'encodeur de vision et entraînent le H-Reducer et le LLM. 
        """,
    'tweet_4':
        """  
        Ils utilisent également une simple projection linéaire sur le texte et les documents. Vous pouvez voir ci-dessous comment ils modélisent les prompts et les sorties textuelles 🤓 
        """,
    'tweet_5':
        """
        Ils entraînent le modèle pour diverses tâches en aval, notamment  
        - la compréhension de documents (DUE benchmark et autres)  
        - analyse de tableaux (TURL, PubTabNet)  
        - analyse de graphiques (PlotQA et autres)  
        - analyse d'images (OCR-CC)  
        - localisation de textes (DocVQA et autres) 
        """,
    'tweet_6':
        """
        Ils contribuent à un nouveau modèle appelé DocOwl 1.5-Chat en :  
        1. créant un nouveau jeu de données document-chat avec des questions provenant de jeux de données VQA
        2. en les envoyant à ChatGPT pour obtenir des réponses longues  
        3. en finetunant le modèle de base à l'aide de ce dernier (qui fonctionne très bien selon moi) 
        """,
    'tweet_7':
        """
        Le modèle généraliste qui en résulte et le modèle de chat sont pratiquement à l'état de l'art 😍  
        Ci-dessous, vous pouvez voir comment ils se comparent aux modèles finetunés.
        """,
    'tweet_8':
        """
        Tous les modèles et jeux de données (y compris certains jeux de données d'évaluation sur les tâches susmentionnées !) se trouvent dans cette [organisation](https://t.co/sJdTw1jWTR).          Le [Space](https://t.co/57E9DbNZXf). 
        """,
    'ressources':
        """
        Ressources :  
        [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) 
        de Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)  
        [GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)
        """
    }
}    


def language_selector():
    languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
    selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
    return 'en' if selected_lang == 'EN' else 'fr'

left_column, right_column = st.columns([5, 1])

# Add a selector to the right column
with right_column:
    lang = language_selector()

# Add a title to the left column
with left_column:
    st.title(translations[lang]["title"])
    
st.success(translations[lang]["original_tweet"], icon="ℹ️")
st.markdown(""" """)

st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_1.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_2.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_3.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_4.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_5.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_6"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_6.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_7"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_7.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_8"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/DocOwl_1.5/image_8.jpeg", use_column_width=True)
st.markdown(""" """)

st.info(translations[lang]["ressources"], icon="📚")  

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
    if lang == "en":
        if st.button('Previous paper', use_container_width=True):
            switch_page("Grounding DINO")
    else:
        if st.button('Papier précédent', use_container_width=True):
            switch_page("Grounding DINO")
with col2:
    if lang == "en":
        if st.button("Home", use_container_width=True):
            switch_page("Home")
    else:
        if st.button("Accueil", use_container_width=True):
            switch_page("Home")
with col3:
    if lang == "en":
        if st.button("Next paper", use_container_width=True):
            switch_page("MiniGemini")
    else:
        if st.button("Papier suivant", use_container_width=True):
            switch_page("MiniGemini")