Spaces:
Running
Running
File size: 3,612 Bytes
94e735e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
import streamlit as st
from streamlit_extras.switch_page_button import switch_page
st.title("DocOwl 1.5")
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)""", icon="βΉοΈ")
st.markdown(""" """)
st.markdown("""DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license ππ
Time to dive in and learn more π§Ά
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_1.jpg", use_column_width=True)
st.markdown(""" """)
st.markdown("""This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself.
Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM.
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_2.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen.
Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM.
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_3.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs π€
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_4.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""They train the model various downstream tasks including:
- document understanding (DUE benchmark and more)
- table parsing (TURL, PubTabNet)
- chart parsing (PlotQA and more)
- image parsing (OCR-CC)
- text localization (DocVQA and more)
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_5.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
They contribute a new model called DocOwl 1.5-Chat by:
1. creating a new document-chat dataset with questions from document VQA datasets
2. feeding them to ChatGPT to get long answers
3. fine-tune the base model with it (which IMO works very well!)
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_6.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""
Resulting generalist model and the chat model are pretty much state-of-the-art π
Below you can see how it compares to fine-tuned models.
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_7.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown("""All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR).
The [Space](https://t.co/57E9DbNZXf).
""")
st.markdown(""" """)
st.image("pages/DocOwl_1.5/image_8.jpeg", use_column_width=True)
st.markdown(""" """)
st.info("""
Ressources:
[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895)
by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)
[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)""", icon="π")
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
if st.button('Previous paper', use_container_width=True):
switch_page("Grounding DINO")
with col2:
if st.button('Home', use_container_width=True):
switch_page("Home")
with col3:
if st.button('Next paper', use_container_width=True):
switch_page("PLLaVA") |