Spaces:
Running
Running
File size: 9,434 Bytes
9e25742 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
import streamlit as st
from streamlit_extras.switch_page_button import switch_page
translations = {
'en': {'title': 'KOSMOS-2',
'original_tweet':
"""
[Original tweet](https://x.com/mervenoyann/status/1720126908384366649) (November 2, 2023)
""",
'tweet_1':
"""
New 🤗 Transformers release includes a very powerful Multimodel Large Language Model (MLLM) by @Microsoft called KOSMOS-2! 🤩
The highlight of KOSMOS-2 is grounding, the model is *incredibly* accurate! 🌎
Play with the demo [here](https://huggingface.co/spaces/ydshieh/Kosmos-2) by [@ydshieh](https://x.com/ydshieh).
But how does this model work? Let's take a look! 👀🧶
""",
'tweet_2':
"""
Grounding helps machine learning models relate to real-world examples. Including grounding makes models more performant by means of accuracy and robustness during inference. It also helps reduce the so-called "hallucinations" in language models.
""",
'tweet_3':
"""
In KOSMOS-2, model is grounded to perform following tasks and is evaluated on 👇
- multimodal grounding & phrase grounding, e.g. localizing the object through natural language query
- multimodal referring, e.g. describing object characteristics & location
- perception-language tasks
- language understanding and generation
""",
'tweet_4':
"""
The dataset used for grounding, called GRiT is also available on [Hugging Face Hub](https://huggingface.co/datasets/zzliang/GRIT).
Thanks to 🤗 Transformers integration, you can use KOSMOS-2 with few lines of code 🤩
See below! 👇
""",
'ressources':
"""
Ressources:
[Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824)
by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023)
[GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2)
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/kosmos-2)
"""
},
'fr': {
'title': 'KOSMOS-2',
'original_tweet':
"""
[Tweet de base](https://x.com/mervenoyann/status/1720126908384366649) (en anglais) (2 novembre 2023)
""",
'tweet_1':
"""
La nouvelle version de 🤗 Transformers inclut un très puissant <i>Multimodel Large Language Model</i> (MLLM) de @Microsoft appelé KOSMOS-2 ! 🤩
Le point fort de KOSMOS-2 est l'ancrage, le modèle est *incroyablement* précis ! 🌎
Jouez avec la démo [ici](https://huggingface.co/spaces/ydshieh/Kosmos-2) de [@ydshieh](https://x.com/ydshieh).
Mais comment fonctionne t'il ? Jetons un coup d'œil ! 👀🧶
""",
'tweet_2':
"""
L'ancrage permet aux modèles d'apprentissage automatique d'être liés à des exemples du monde réel. L'inclusion de l'ancrage rend les modèles plus performants en termes de précision et de robustesse lors de l'inférence. Cela permet également de réduire les « hallucinations » dans les modèles de langage. """,
'tweet_3':
"""
Dans KOSMOS-2, le modèle est ancré pour effectuer les tâches suivantes et est évalué sur 👇
- l'ancrage multimodal et l'ancrage de phrases, par exemple la localisation de l'objet par le biais d'une requête en langage naturel
- la référence multimodale, par exemple la description des caractéristiques et de l'emplacement de l'objet
- tâches de perception-langage
- compréhension et génération du langage
""",
'tweet_4':
"""
Le jeu de données utilisé pour l'ancrage, appelé GRiT, est également disponible sur le [Hub d'Hugging Face](https://huggingface.co/datasets/zzliang/GRIT).
Grâce à l'intégration dans 🤗 Transformers, vous pouvez utiliser KOSMOS-2 avec quelques lignes de code 🤩.
Voir ci-dessous ! 👇
""",
'ressources':
"""
Ressources :
[Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824)
de Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023)
[GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2)
[Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/kosmos-2)
"""
}
}
def language_selector():
languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
return 'en' if selected_lang == 'EN' else 'fr'
left_column, right_column = st.columns([5, 1])
# Add a selector to the right column
with right_column:
lang = language_selector()
# Add a title to the left column
with left_column:
st.title(translations[lang]["title"])
st.success(translations[lang]["original_tweet"], icon="ℹ️")
st.markdown(""" """)
st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
st.markdown(""" """)
st.video("pages/KOSMOS-2/video_1.mp4", format="video/mp4")
st.markdown(""" """)
st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
st.markdown(""" """)
st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
st.markdown(""" """)
st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/KOSMOS-2/image_1.jpg", use_container_width=True)
st.markdown(""" """)
with st.expander ("Code"):
if lang == "en":
st.code("""
from transformers import AutoProcessor, AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
image_input = Image.open(user_image_path)
# prepend different preprompts optionally to describe images
brief_preprompt = "<grounding>An image of"
detailed_preprompt = "<grounding>Describe this image in detail:"
inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda")
generated_ids = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image_embeds=None,
image_embeds_position_mask=inputs["image_embeds_position_mask"],
use_cache=True,
max_new_tokens=128,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
processed_text, entities = processor.post_process_generation(generated_text)
# check out the Space for inference with bbox drawing
""")
else:
st.code("""
from transformers import AutoProcessor, AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
image_input = Image.open(user_image_path)
# ajouter différents préprompts facultatifs pour décrire les images
brief_preprompt = "<grounding>An image of"
detailed_preprompt = "<grounding>Describe this image in detail:"
inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda")
generated_ids = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image_embeds=None,
image_embeds_position_mask=inputs["image_embeds_position_mask"],
use_cache=True,
max_new_tokens=128,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
processed_text, entities = processor.post_process_generation(generated_text)
# consultez le Space pour l'inférence avec le tracé des bbox
""")
st.markdown(""" """)
st.info(translations[lang]["ressources"], icon="📚")
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
if lang == "en":
if st.button('Previous paper', use_container_width=True):
switch_page("Home")
else:
if st.button('Papier précédent', use_container_width=True):
switch_page("Home")
with col2:
if lang == "en":
if st.button("Home", use_container_width=True):
switch_page("Home")
else:
if st.button("Accueil", use_container_width=True):
switch_page("Home")
with col3:
if lang == "en":
if st.button("Next paper", use_container_width=True):
switch_page("MobileSAM")
else:
if st.button("Papier suivant", use_container_width=True):
switch_page("MobileSAM")
|