Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /0_KOSMOS-2.py

merve HF staff

Multilingual version (#1)

1d4f55a verified 6 days ago

raw

history blame

9.22 kB

	import streamlit as st
	from streamlit_extras.switch_page_button import switch_page


	translations = {
	'en': {'title': 'KOSMOS-2',
	'original_tweet':
	"""
	[Original tweet](https://x.com/mervenoyann/status/1720126908384366649) (November 2, 2023)
	""",
	'tweet_1':
	"""
	New 🤗 Transformers release includes a very powerful Multimodel Large Language Model (MLLM) by @Microsoft called KOSMOS-2! 🤩
	The highlight of KOSMOS-2 is grounding, the model is incredibly accurate! 🌎
	Play with the demo [here](https://huggingface.co/spaces/ydshieh/Kosmos-2) by [@ydshieh](https://x.com/ydshieh).
	But how does this model work? Let's take a look! 👀🧶
	""",
	'tweet_2':
	"""
	Grounding helps machine learning models relate to real-world examples. Including grounding makes models more performant by means of accuracy and robustness during inference. It also helps reduce the so-called "hallucinations" in language models.
	""",
	'tweet_3':
	"""
	In KOSMOS-2, model is grounded to perform following tasks and is evaluated on 👇
	- multimodal grounding & phrase grounding, e.g. localizing the object through natural language query
	- multimodal referring, e.g. describing object characteristics & location
	- perception-language tasks
	- language understanding and generation
	""",
	'tweet_4':
	"""
	The dataset used for grounding, called GRiT is also available on [Hugging Face Hub](https://huggingface.co/datasets/zzliang/GRIT).
	Thanks to 🤗 Transformers integration, you can use KOSMOS-2 with few lines of code 🤩
	See below! 👇
	""",
	'ressources':
	"""
	Ressources:
	[Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824)
	by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023)
	[GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2)
	[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/kosmos-2)
	"""
	},
	'fr': {
	'title': 'KOSMOS-2',
	'original_tweet':
	"""
	[Tweet de base](https://x.com/mervenoyann/status/1720126908384366649) (en anglais) (2 novembre 2023)
	""",
	'tweet_1':
	"""
	La nouvelle version de 🤗 Transformers inclut un très puissant <i>Multimodel Large Language Model</i> (MLLM) de @Microsoft appelé KOSMOS-2 ! 🤩
	Le point fort de KOSMOS-2 est l'ancrage, le modèle est incroyablement précis ! 🌎
	Jouez avec la démo [ici](https://huggingface.co/spaces/ydshieh/Kosmos-2) de [@ydshieh](https://x.com/ydshieh).
	Mais comment fonctionne t'il ? Jetons un coup d'œil ! 👀🧶
	""",
	'tweet_2':
	"""
	L'ancrage permet aux modèles d'apprentissage automatique d'être liés à des exemples du monde réel. L'inclusion de l'ancrage rend les modèles plus performants en termes de précision et de robustesse lors de l'inférence. Cela permet également de réduire les « hallucinations » dans les modèles de langage. """,
	'tweet_3':
	"""
	Dans KOSMOS-2, le modèle est ancré pour effectuer les tâches suivantes et est évalué sur 👇
	- l'ancrage multimodal et l'ancrage de phrases, par exemple la localisation de l'objet par le biais d'une requête en langage naturel
	- la référence multimodale, par exemple la description des caractéristiques et de l'emplacement de l'objet
	- tâches de perception-langage
	- compréhension et génération du langage
	""",
	'tweet_4':
	"""
	Le jeu de données utilisé pour l'ancrage, appelé GRiT, est également disponible sur le [Hub d'Hugging Face](https://huggingface.co/datasets/zzliang/GRIT).
	Grâce à l'intégration dans 🤗 Transformers, vous pouvez utiliser KOSMOS-2 avec quelques lignes de code 🤩.
	Voir ci-dessous ! 👇
	""",
	'ressources':
	"""
	Ressources :
	[Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824)
	de Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023)
	[GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2)
	[Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/kosmos-2)
	"""
	}
	}


	def language_selector():
	languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
	selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
	return 'en' if selected_lang == 'EN' else 'fr'

	left_column, right_column = st.columns([5, 1])

	# Add a selector to the right column
	with right_column:
	lang = language_selector()

	# Add a title to the left column
	with left_column:
	st.title(translations[lang]["title"])

	st.success(translations[lang]["original_tweet"], icon="ℹ️")
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.video("pages/KOSMOS-2/video_1.mp4", format="video/mp4")
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/KOSMOS-2/image_1.jpg", use_column_width=True)
	st.markdown(""" """)

	with st.expander ("Code"):
	if lang == "en":
	st.code("""
	from transformers import AutoProcessor, AutoModelForVision2Seq

	model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda")
	processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

	image_input = Image.open(user_image_path)
	# prepend different preprompts optionally to describe images
	brief_preprompt = "<grounding>An image of"
	detailed_preprompt = "<grounding>Describe this image in detail:"


	inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda")

	generated_ids = model.generate(
	pixel_values=inputs["pixel_values"],
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image_embeds=None,
	image_embeds_position_mask=inputs["image_embeds_position_mask"],
	use_cache=True,
	max_new_tokens=128,
	)

	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	processed_text, entities = processor.post_process_generation(generated_text)

	# check out the Space for inference with bbox drawing
	""")
	else:
	st.code("""
	from transformers import AutoProcessor, AutoModelForVision2Seq

	model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda")
	processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

	image_input = Image.open(user_image_path)
	# ajouter différents préprompts facultatifs pour décrire les images
	brief_preprompt = "<grounding>An image of"
	detailed_preprompt = "<grounding>Describe this image in detail:"


	inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda")

	generated_ids = model.generate(
	pixel_values=inputs["pixel_values"],
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	image_embeds=None,
	image_embeds_position_mask=inputs["image_embeds_position_mask"],
	use_cache=True,
	max_new_tokens=128,
	)

	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	processed_text, entities = processor.post_process_generation(generated_text)

	# consultez le Space pour l'inférence avec le tracé des bbox
	""")
	st.markdown(""" """)

	st.info(translations[lang]["ressources"], icon="📚")

	st.markdown(""" """)
	st.markdown(""" """)
	st.markdown(""" """)
	col1, col2, col3= st.columns(3)
	with col1:
	if lang == "en":
	if st.button('Previous paper', use_container_width=True):
	switch_page("Home")
	else:
	if st.button('Papier précédent', use_container_width=True):
	switch_page("Home")
	with col2:
	if lang == "en":
	if st.button("Home", use_container_width=True):
	switch_page("Home")
	else:
	if st.button("Accueil", use_container_width=True):
	switch_page("Home")
	with col3:
	if lang == "en":
	if st.button("Next paper", use_container_width=True):
	switch_page("MobileSAM")
	else:
	if st.button("Papier suivant", use_container_width=True):
	switch_page("MobileSAM")