Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /14_DocOwl_1.5.py

merve HF staff

Multilingual version (#1)

1d4f55a verified 6 days ago

raw

history blame

8.64 kB

	import streamlit as st
	from streamlit_extras.switch_page_button import switch_page


	translations = {
	'en': {'title': 'DocOwl 1.5',
	'original_tweet':
	"""
	[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)
	""",
	'tweet_1':
	"""
	DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝
	Time to dive in and learn more 🧶
	""",
	'tweet_2':
	"""
	This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself.
	Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM.
	""",
	'tweet_3':
	"""
	Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen.
	Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM.
	""",
	'tweet_4':
	"""
	Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓
	""",
	'tweet_5':
	"""
	They train the model various downstream tasks including:
	- document understanding (DUE benchmark and more)
	- table parsing (TURL, PubTabNet)
	- chart parsing (PlotQA and more)
	- image parsing (OCR-CC)
	- text localization (DocVQA and more)
	""",
	'tweet_6':
	"""
	They contribute a new model called DocOwl 1.5-Chat by:
	1. creating a new document-chat dataset with questions from document VQA datasets
	2. feeding them to ChatGPT to get long answers
	3. fine-tune the base model with it (which IMO works very well!)
	""",
	'tweet_7':
	"""
	Resulting generalist model and the chat model are pretty much state-of-the-art 😍
	Below you can see how it compares to fine-tuned models.
	""",
	'tweet_8':
	"""
	All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR).
	The [Space](https://t.co/57E9DbNZXf).
	""",
	'ressources':
	"""
	Ressources:
	[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895)
	by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)
	[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)
	"""
	},
	'fr': {
	'title': 'DocOwl 1.5',
	'original_tweet':
	"""
	[Tweet de base](https://twitter.com/mervenoyann/status/1782421257591357824) (en anglais) (22 avril 2024)
	""",
	'tweet_1':
	"""
	DocOwl 1.5 est le modèle de compréhension de documents d'Alibaba sous licence Apache 2.0 😍📝
	Il est temps de découvrir ce modèle 🧶
	""",
	'tweet_2':
	"""
	Ce modèle se compose d'un encodeur visuel basé sur un ViT qui prend en compte les crops de l'image et l'image originale elle-même.
	Les sorties de l'encodeur passent ensuite par un modèle convolutif, après quoi les sorties sont fusionnées avec le texte, puis transmises au LLM.
	""",
	'tweet_3':
	"""
	Au départ, les auteurs n'entraînent que la partie basée sur la convolution (appelée H-Reducer) et l'encodeur de vision tout en gardant le LLM gelé.
	Ensuite, pour le finetuning (légendage d'image, VQA, etc.), ils gèlent l'encodeur de vision et entraînent le H-Reducer et le LLM.
	""",
	'tweet_4':
	"""
	Ils utilisent également une simple projection linéaire sur le texte et les documents. Vous pouvez voir ci-dessous comment ils modélisent les prompts et les sorties textuelles 🤓
	""",
	'tweet_5':
	"""
	Ils entraînent le modèle pour diverses tâches en aval, notamment
	- la compréhension de documents (DUE benchmark et autres)
	- analyse de tableaux (TURL, PubTabNet)
	- analyse de graphiques (PlotQA et autres)
	- analyse d'images (OCR-CC)
	- localisation de textes (DocVQA et autres)
	""",
	'tweet_6':
	"""
	Ils contribuent à un nouveau modèle appelé DocOwl 1.5-Chat en :
	1. créant un nouveau jeu de données document-chat avec des questions provenant de jeux de données VQA
	2. en les envoyant à ChatGPT pour obtenir des réponses longues
	3. en finetunant le modèle de base à l'aide de ce dernier (qui fonctionne très bien selon moi)
	""",
	'tweet_7':
	"""
	Le modèle généraliste qui en résulte et le modèle de chat sont pratiquement à l'état de l'art 😍
	Ci-dessous, vous pouvez voir comment ils se comparent aux modèles finetunés.
	""",
	'tweet_8':
	"""
	Tous les modèles et jeux de données (y compris certains jeux de données d'évaluation sur les tâches susmentionnées !) se trouvent dans cette [organisation](https://t.co/sJdTw1jWTR). Le [Space](https://t.co/57E9DbNZXf).
	""",
	'ressources':
	"""
	Ressources :
	[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895)
	de Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024)
	[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)
	"""
	}
	}


	def language_selector():
	languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
	selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
	return 'en' if selected_lang == 'EN' else 'fr'

	left_column, right_column = st.columns([5, 1])

	# Add a selector to the right column
	with right_column:
	lang = language_selector()

	# Add a title to the left column
	with left_column:
	st.title(translations[lang]["title"])

	st.success(translations[lang]["original_tweet"], icon="ℹ️")
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_1.jpg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_2.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_3.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_4.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_5.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_6"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_6.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_7"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_7.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_8"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/DocOwl_1.5/image_8.jpeg", use_column_width=True)
	st.markdown(""" """)

	st.info(translations[lang]["ressources"], icon="📚")

	st.markdown(""" """)
	st.markdown(""" """)
	st.markdown(""" """)
	col1, col2, col3= st.columns(3)
	with col1:
	if lang == "en":
	if st.button('Previous paper', use_container_width=True):
	switch_page("Grounding DINO")
	else:
	if st.button('Papier précédent', use_container_width=True):
	switch_page("Grounding DINO")
	with col2:
	if lang == "en":
	if st.button("Home", use_container_width=True):
	switch_page("Home")
	else:
	if st.button("Accueil", use_container_width=True):
	switch_page("Home")
	with col3:
	if lang == "en":
	if st.button("Next paper", use_container_width=True):
	switch_page("MiniGemini")
	else:
	if st.button("Papier suivant", use_container_width=True):
	switch_page("MiniGemini")