Spaces:

Ekimetrics
/

climate-question-answering

Running

App Files Files Community

timeki

TheoLvs commited on Dec 11, 2024

Commit

bcc8503

verified ·

1 Parent(s): 14a5a97

Add content recommandation (#17)

Browse files

- First commit CQA with Agents (481f3b1453fde4c19018915d101d575b6ea25a3e)
- Connecting to front (088e816846227b694f2d56ca3af739cc010de4bc)
- Update app.py (fd67e156abd0293625d2b73765bda2d3905fa5de)
- agents mode (99e91d83efb40b6cfec5a887f0d464eaffd09431)
- bugfixs (72edd2d9e6ad64e3ecb59505b744cd415b9a6776)
- Update requirements.txt (ae857ef845ac5b3baed5ef7de1e1b8b63874947e)
- Update requirements.txt (25e32e6bdf0ca289bef8617d92ad77d7edeac19f)
- add dora graph recommandation (6b43c8608bbebdffdebdbd315d70c7df60199fab)
- Merge branch 'bugfix/add_dummy_searchs' into feature/graph_recommandation (aa904c191cf4dd783e3ec870883a06746fe52bf8)
- Update .gitignore (8b71e5e7c71f665a8515d8bdbfee913fdaff12f0)
- Update .gitignore (9dd246e7f975322a6be247188bffb7aa0f6d954e)
- add message types (bed4e9bbfb6f7c823789daf54c443f3f27198b45)
- WIP (6a39d69f772f97cef8b0b551a888ace822713753)
- WIP (7da1a3ac2237ded7b9891fdcda32d0674a9b7b4a)
- add steps and fix css for gradio 5 (df5d08d8710beacb04a9c0c281a195c6dd7cc800)
- remove unused code (4ab651938b1c2af7c0d8f155488820e47b42c6c8)
- Update .gitignore (5228f5c0f26f78825d572edbe82200ee3ece6a60)
- fixs (ccd4b9e7b0d2b6d7c0a8b9f2a6513609b4bfe3e1)
- Merge branch 'pr/14' into feature/graph_recommandation (6edd6c2f1c5215b2e3e72b2c36759b934d748606)
- Merge branch 'feature/add_steps_display' into feature/graph_recommandation (89a69e623fa7fec59d135eb57624bcb8d8f67985)
- WIP (49acaf1b850b0914bc1f5d52ceab47d2a22fe944)
- add graphs from ourworldindata (7335378313ce70a4d4ec305a6114e8e6d167af12)
- Merge branch 'main' into feature/graph_recommandation (0c4d82b36b5d6d2f79215460720118c746d88804)
- fix merge (57a1ed70b9a0ad0e48de283e2f044e1e38eac8e1)
- Merge branch 'main' into feature/graph_recommandation (196d79336e51d5deef0215353095954d98d4165b)
- add notification on tabs (484fc0d6f3d80e3fe3afb6ffff20560ef35b6b7c)
- switch owid vectorstore to pinecone (5664fc84569d8e455ce09e2e29c48d7881e126e3)
- remove unused code (fd2ccc64d648490a0ee5acf2159827091d9fc123)
- Move configuration to an other pannel (5c3a3a4b99323e93ebf0c852bc2c2a5401929dae)
- Display images before answering (12c9afe58a2de4b7413bc860f808501c6ee2a6aa)
- Display image in a parallel task (6c5a20c1cab94c503e004acc0991ce4149835a7f)
- Merge branch 'main' into feature/graph_recommandation (b58c53f25d1cc85cfe82bbd452f8d5d11c306da3)
- Add step by step notebook to run steps of langgraphs (a059c938ce111e6673981ad32785d0d4e1c0d177)
- put event handling in separate file (76603dfba448efa3334c2bcb0169f8e1fbd92c60)
- Rerank documents and force summary for policy makers (d562d3805e6230ad8db525229b4bdde42185e721)
- edit prints for logs (9609df9642e477c67d7bf03a83becaef5c3e2b6f)
- Add OpenAlex papers recommandation (c3b815e6e630f551740188fdef719d0df16acd7e)
- Merge branch 'add_openalex_papers' into feature/graph_recommandation (6541df34bd07c0c7e2d1f1c4ffbf03c2778187a3)
- fix merge (d78271b334fb803de4c424420c297cc6983f0c93)
- add time measure (09457a7da47af1e265d9c5c906cecbf2d4586174)
- fix css (22ac4eb7878d7c9e0f3581972f399ad3119dfdde)
- remove unused code (4c4fe76848d84079766d5ec6e94c1e941d5fea01)
- fix answer latency when having multiple sources (40084ba7b8c0424741cfe3d2142eea4d24683c07)
- remove unused code (58bf75084a4d964a486a5a602e81a17e56e1cc82)
- Change number of figures (781788244a88876bdecfc5b3ddbcf65e8bd9ead6)
- front UI change (c9346b33c3251304f5b52f5c837347f10842b87c)
- add owid subtitles (7ec5d9ecfda9f0688da0521126034de80cf9dffc)
- move code from papers in separate file (363fe2eb8548665fb4fe577db1dce7ea682bac8d)
- add search only button (be2863be1c4cfdedca78497d548be321d218c312)
- config in a modal object (7283e6a6e173ab3ced6c21e292e1b6f94e659141)
- few code cleaning (094ee349527297a47eaf9bbb0903651170be47ec)
- update display and fix search only (d396732ee8df5f4aa33c10cca64d6b05d197e4d5)
- Update 20241104 - CQA - StepByStep CQA.ipynb (d7adcaada52ac5feae76017b457623ab308bbfc8)

Co-authored-by: Theo Alves <TheoLvs@users.noreply.huggingface.co>

Files changed (28) hide show

app.py +313 -213
climateqa/constants.py +22 -1
climateqa/engine/chains/answer_chitchat.py +5 -1
climateqa/engine/chains/answer_rag.py +31 -18
climateqa/engine/chains/chitchat_categorization.py +43 -0
climateqa/engine/chains/graph_retriever.py +128 -0
climateqa/engine/chains/intent_categorization.py +10 -6
climateqa/engine/chains/prompts.py +24 -1
climateqa/engine/chains/query_transformation.py +10 -2
climateqa/engine/chains/retrieve_documents.py +223 -81
climateqa/engine/chains/retrieve_papers.py +95 -0
climateqa/engine/chains/retriever.py +126 -0
climateqa/engine/chains/set_defaults.py +13 -0
climateqa/engine/chains/translation.py +2 -1
climateqa/engine/graph.py +55 -14
climateqa/engine/graph_retriever.py +88 -0
climateqa/engine/keywords.py +3 -1
climateqa/engine/reranker.py +12 -2
climateqa/engine/vectorstore.py +8 -2
climateqa/event_handler.py +123 -0
climateqa/knowledge/openalex.py +10 -6
climateqa/knowledge/retriever.py +102 -81
climateqa/utils.py +13 -0
front/utils.py +148 -3
sandbox/20240310 - CQA - Semantic Routing 1.ipynb +0 -0
sandbox/20240702 - CQA - Graph Functionality.ipynb +0 -0
sandbox/20241104 - CQA - StepByStep CQA.ipynb +0 -0
style.css +209 -54

app.py CHANGED Viewed

@@ -1,13 +1,12 @@
 from climateqa.engine.embeddings import get_embeddings_function
 embeddings_function = get_embeddings_function()
-from climateqa.knowledge.openalex import OpenAlex
 from sentence_transformers import CrossEncoder
 # reranker = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1")
-oa = OpenAlex()
 import gradio as gr
 import pandas as pd
 import numpy as np
 import os
@@ -29,7 +28,9 @@ from utils import create_user_id
 from gradio_modal import Modal
 # ClimateQ&A imports
 from climateqa.engine.llm import get_llm
@@ -39,13 +40,15 @@ from climateqa.engine.reranker import get_reranker
 from climateqa.engine.embeddings import get_embeddings_function
 from climateqa.engine.chains.prompts import audience_prompts
 from climateqa.sample_questions import QUESTIONS
-from climateqa.constants import POSSIBLE_REPORTS
 from climateqa.utils import get_image_from_azure_blob_storage
-from climateqa.engine.keywords import make_keywords_chain
-# from climateqa.engine.chains.answer_rag import make_rag_papers_chain
-from climateqa.engine.graph import make_graph_agent,display_graph
-from front.utils import make_html_source, make_html_figure_sources,parse_output_llm_with_sources,serialize_docs,make_toolbox
 # Load environment variables in local mode
 try:
@@ -54,6 +57,8 @@ try:
 except Exception as e:
     pass
 # Set up Gradio Theme
 theme = gr.themes.Base(
     primary_hue="blue",
@@ -104,52 +109,47 @@ CITATION_TEXT = r"""@misc{climateqa,
 # Create vectorstore and retriever
-vectorstore = get_pinecone_vectorstore(embeddings_function)
-llm = get_llm(provider="openai",max_tokens = 1024,temperature = 0.0)
-reranker = get_reranker("large")
-agent = make_graph_agent(llm,vectorstore,reranker)
-async def chat(query,history,audience,sources,reports):
     """taking a query and a message history, use a pipeline (reformulation, retriever, answering) to yield a tuple of:
     (messages in gradio format, messages in langchain format, source documents)"""
     date_now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
     print(f">> NEW QUESTION ({date_now}) : {query}")
-    if audience == "Children":
-        audience_prompt = audience_prompts["children"]
-    elif audience == "General public":
-        audience_prompt = audience_prompts["general"]
-    elif audience == "Experts":
-        audience_prompt = audience_prompts["experts"]
-    else:
-        audience_prompt = audience_prompts["experts"]
     # Prepare default values
-    if len(sources) == 0:
-        sources = ["IPCC"]
-    # if len(reports) == 0: # TODO
-    reports = []
-    inputs = {"user_input": query,"audience": audience_prompt,"sources_input":sources}
     result = agent.astream_events(inputs,version = "v1")
-    # path_reformulation = "/logs/reformulation/final_output"
-    # path_keywords = "/logs/keywords/final_output"
-    # path_retriever = "/logs/find_documents/final_output"
-    # path_answer = "/logs/answer/streamed_output_str/-"
     docs = []
     docs_html = ""
     output_query = ""
     output_language = ""
     output_keywords = ""
-    gallery = []
     start_streaming = False
     figures = '<div class="figures-container"><p></p> </div>'
     steps_display = {
@@ -166,36 +166,29 @@ async def chat(query,history,audience,sources,reports):
                 node = event["metadata"]["langgraph_node"]
                 if event["event"] == "on_chain_end" and event["name"] == "retrieve_documents" :# when documents are retrieved
-                    try:
-                        docs = event["data"]["output"]["documents"]
-                        docs_html = []
-                        textual_docs = [d for d in docs if d.metadata["chunk_type"] == "text"]
-                        for i, d in enumerate(textual_docs, 1):
-                            if d.metadata["chunk_type"] == "text":
-                                docs_html.append(make_html_source(d, i))
-                        used_documents = used_documents + [f"{d.metadata['short_name']} - {d.metadata['name']}" for d in docs]
-                        history[-1].content = "Adding sources :\n\n - " + "\n - ".join(np.unique(used_documents))
-                        docs_html = "".join(docs_html)
-                    except Exception as e:
-                        print(f"Error getting documents: {e}")
-                        print(event)
                 elif event["name"] in steps_display.keys() and event["event"] == "on_chain_start": #display steps
-                    event_description,display_output = steps_display[node]
                     if not hasattr(history[-1], 'metadata') or history[-1].metadata["title"] != event_description: # if a new step begins
                         history.append(ChatMessage(role="assistant", content = "", metadata={'title' :event_description}))
                 elif event["name"] != "transform_query" and event["event"] == "on_chat_model_stream" and node in ["answer_rag", "answer_search","answer_chitchat"]:# if streaming answer
-                    if start_streaming == False:
-                        start_streaming = True
-                        history.append(ChatMessage(role="assistant", content = ""))
-                    answer_message_content +=  event["data"]["chunk"].content
-                    answer_message_content = parse_output_llm_with_sources(answer_message_content)
-                    history[-1] = ChatMessage(role="assistant", content = answer_message_content)
-                    # history.append(ChatMessage(role="assistant", content = new_message_content))
                 if event["name"] == "transform_query" and event["event"] =="on_chain_end":
                     if hasattr(history[-1],"content"):
@@ -204,7 +197,7 @@ async def chat(query,history,audience,sources,reports):
                 if event["name"] == "categorize_intent" and event["event"] == "on_chain_start":
                     print("X")
-            yield history,docs_html,output_query,output_language,gallery, figures #,output_query,output_keywords
     except Exception as e:
         print(event, "has failed")
@@ -232,68 +225,7 @@ async def chat(query,history,audience,sources,reports):
         print(f"Error logging on Azure Blob Storage: {e}")
         raise gr.Error(f"ClimateQ&A Error: {str(e)[:100]} - The error has been noted, try another question and if the error remains, you can contact us :)")
-    # image_dict = {}
-    # for i,doc in enumerate(docs):
-    #     if doc.metadata["chunk_type"] == "image":
-    #         try:
-    #             key = f"Image {i+1}"
-    #             image_path = doc.metadata["image_path"].split("documents/")[1]
-    #             img = get_image_from_azure_blob_storage(image_path)
-    #             # Convert the image to a byte buffer
-    #             buffered = BytesIO()
-    #             img.save(buffered, format="PNG")
-    #             img_str = base64.b64encode(buffered.getvalue()).decode()
-    #             # Embedding the base64 string in Markdown
-    #             markdown_image = f"![Alt text](data:image/png;base64,{img_str})"
-    #             image_dict[key] = {"img":img,"md":markdown_image,"short_name": doc.metadata["short_name"],"figure_code":doc.metadata["figure_code"],"caption":doc.page_content,"key":key,"figure_code":doc.metadata["figure_code"], "img_str" : img_str}
-    #         except Exception as e:
-    #             print(f"Skipped adding image {i} because of {e}")
-    # if len(image_dict) > 0:
-    #     gallery = [x["img"] for x in list(image_dict.values())]
-    #     img = list(image_dict.values())[0]
-    #     img_md = img["md"]
-    #     img_caption = img["caption"]
-    #     img_code = img["figure_code"]
-    #     if img_code != "N/A":
-    #         img_name = f"{img['key']} - {img['figure_code']}"
-    #     else:
-    #         img_name = f"{img['key']}"
-    #     history.append(ChatMessage(role="assistant", content = f"\n\n{img_md}\n<p class='chatbot-caption'><b>{img_name}</b> - {img_caption}</p>"))
-    docs_figures = [d for d in docs if d.metadata["chunk_type"] == "image"]
-    for i, doc in enumerate(docs_figures):
-        if doc.metadata["chunk_type"] == "image":
-            try:
-                key = f"Image {i+1}"
-                image_path = doc.metadata["image_path"].split("documents/")[1]
-                img = get_image_from_azure_blob_storage(image_path)
-                # Convert the image to a byte buffer
-                buffered = BytesIO()
-                img.save(buffered, format="PNG")
-                img_str = base64.b64encode(buffered.getvalue()).decode()
-                figures = figures + make_html_figure_sources(doc, i, img_str)
-                gallery.append(img)
-            except Exception as e:
-                print(f"Skipped adding image {i} because of {e}")
-    yield history,docs_html,output_query,output_language,gallery, figures#,output_query,output_keywords
 def save_feedback(feed: str, user_id):
@@ -317,29 +249,9 @@ def log_on_azure(file, logs, share_client):
     file_client.upload_file(logs)
-def generate_keywords(query):
-    chain = make_keywords_chain(llm)
-    keywords = chain.invoke(query)
-    keywords = " AND ".join(keywords["keywords"])
-    return keywords
-papers_cols_widths = {
-    "doc":50,
-    "id":100,
-    "title":300,
-    "doi":100,
-    "publication_year":100,
-    "abstract":500,
-    "rerank_score":100,
-    "is_oa":50,
-}
-papers_cols = list(papers_cols_widths.keys())
-papers_cols_widths = list(papers_cols_widths.values())
 # --------------------------------------------------------------------
 # Gradio
 # --------------------------------------------------------------------
@@ -370,10 +282,23 @@ def vote(data: gr.LikeData):
     else:
         print(data)
 with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=theme,elem_id = "main-component") as demo:
     with gr.Tab("ClimateQ&A"):
         with gr.Row(elem_id="chatbot-row"):
@@ -396,12 +321,16 @@ with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=t
                 with gr.Row(elem_id = "input-message"):
                     textbox=gr.Textbox(placeholder="Ask me anything here!",show_label=False,scale=7,lines = 1,interactive = True,elem_id="input-textbox")
-            with gr.Column(scale=1, variant="panel",elem_id = "right-panel"):
-                with gr.Tabs() as tabs:
                     with gr.TabItem("Examples",elem_id = "tab-examples",id = 0):
                         examples_hidden = gr.Textbox(visible = False)
@@ -427,91 +356,210 @@ with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=t
                                 )
                             samples.append(group_examples)
-                    with gr.Tab("Sources",elem_id = "tab-citations",id = 1):
-                        sources_textbox = gr.HTML(show_label=False, elem_id="sources-textbox")
-                        docs_textbox = gr.State("")
-                    # with Modal(visible = False) as config_modal:
-                    with gr.Tab("Configuration",elem_id = "tab-config",id = 2):
-                        gr.Markdown("Reminder: You can talk in any language, ClimateQ&A is multi-lingual!")
-                        dropdown_sources = gr.CheckboxGroup(
-                            ["IPCC", "IPBES","IPOS"],
-                            label="Select source",
-                            value=["IPCC"],
-                            interactive=True,
-                        )
-                        dropdown_reports = gr.Dropdown(
-                            POSSIBLE_REPORTS,
-                            label="Or select specific reports",
-                            multiselect=True,
-                            value=None,
-                            interactive=True,
-                        )
-                        dropdown_audience = gr.Dropdown(
-                            ["Children","General public","Experts"],
-                            label="Select audience",
-                            value="Experts",
-                            interactive=True,
-                        )
-                        output_query = gr.Textbox(label="Query used for retrieval",show_label = True,elem_id = "reformulated-query",lines = 2,interactive = False)
-                        output_language = gr.Textbox(label="Language",show_label = True,elem_id = "language",lines = 1,interactive = False)
-                    with gr.Tab("Figures",elem_id = "tab-figures",id = 3):
-                        with Modal(visible=False, elem_id="modal_figure_galery") as modal:
-                            gallery_component = gr.Gallery(object_fit='scale-down',elem_id="gallery-component", height="80vh")
-                        show_full_size_figures = gr.Button("Show figures in full size",elem_id="show-figures",interactive=True)
-                        show_full_size_figures.click(lambda : Modal(visible=True),None,modal)
-                        figures_cards = gr.HTML(show_label=False, elem_id="sources-figures")
 #---------------------------------------------------------------------------------------
 # OTHER TABS
 #---------------------------------------------------------------------------------------
-    # with gr.Tab("Figures",elem_id = "tab-images",elem_classes = "max-height other-tabs"):
-    #     gallery_component = gr.Gallery(object_fit='cover')
-    # with gr.Tab("Papers (beta)",elem_id = "tab-papers",elem_classes = "max-height other-tabs"):
-    #     with gr.Row():
-    #         with gr.Column(scale=1):
-    #             query_papers = gr.Textbox(placeholder="Question",show_label=False,lines = 1,interactive = True,elem_id="query-papers")
-    #             keywords_papers = gr.Textbox(placeholder="Keywords",show_label=False,lines = 1,interactive = True,elem_id="keywords-papers")
-    #             after = gr.Slider(minimum=1950,maximum=2023,step=1,value=1960,label="Publication date",show_label=True,interactive=True,elem_id="date-papers")
-    #             search_papers = gr.Button("Search",elem_id="search-papers",interactive=True)
-    #         with gr.Column(scale=7):
-    #             with gr.Tab("Summary",elem_id="papers-summary-tab"):
-    #                 papers_summary = gr.Markdown(visible=True,elem_id="papers-summary")
-    #             with gr.Tab("Relevant papers",elem_id="papers-results-tab"):
-    #                 papers_dataframe = gr.Dataframe(visible=True,elem_id="papers-table",headers = papers_cols)
-    #             with gr.Tab("Citations network",elem_id="papers-network-tab"):
-    #                 citations_network = gr.HTML(visible=True,elem_id="papers-citations-network")
     with gr.Tab("About",elem_classes = "max-height other-tabs"):
         with gr.Row():
             with gr.Column(scale=1):
@@ -519,13 +567,15 @@ with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=t
-                gr.Markdown("""
-### More info
-- See more info at [https://climateqa.com](https://climateqa.com/docs/intro/)
-- Feedbacks on this [form](https://forms.office.com/e/1Yzgxm6jbp)
-### Citation
-""")
                 with gr.Accordion(CITATION_LABEL,elem_id="citation", open = False,):
                     # # Display citation label and text)
                     gr.Textbox(
@@ -538,25 +588,61 @@ with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=t
-    def start_chat(query,history):
-        # history = history + [(query,None)]
-        # history = [tuple(x) for x in history]
         history = history + [ChatMessage(role="user", content=query)]
-        return (gr.update(interactive = False),gr.update(selected=1),history)
     def finish_chat():
-        return (gr.update(interactive = True,value = ""))
     (textbox
-        .submit(start_chat, [textbox,chatbot], [textbox,tabs,chatbot],queue = False,api_name = "start_chat_textbox")
-        .then(chat, [textbox,chatbot,dropdown_audience, dropdown_sources,dropdown_reports], [chatbot,sources_textbox,output_query,output_language,gallery_component,figures_cards],concurrency_limit = 8,api_name = "chat_textbox")
         .then(finish_chat, None, [textbox],api_name = "finish_chat_textbox")
     )
     (examples_hidden
-        .change(start_chat, [examples_hidden,chatbot], [textbox,tabs,chatbot],queue = False,api_name = "start_chat_examples")
-        .then(chat, [examples_hidden,chatbot,dropdown_audience, dropdown_sources,dropdown_reports], [chatbot,sources_textbox,output_query,output_language,gallery_component, figures_cards],concurrency_limit = 8,api_name = "chat_examples")
         .then(finish_chat, None, [textbox],api_name = "finish_chat_examples")
     )
@@ -567,9 +653,23 @@ with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=t
         return [gr.update(visible=visible_bools[i]) for i in range(len(samples))]
     dropdown_samples.change(change_sample_questions,dropdown_samples,samples)
     demo.queue()

 from climateqa.engine.embeddings import get_embeddings_function
 embeddings_function = get_embeddings_function()
 from sentence_transformers import CrossEncoder
 # reranker = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1")
 import gradio as gr
+from gradio_modal import Modal
 import pandas as pd
 import numpy as np
 import os
 from gradio_modal import Modal
+from PIL import Image
+from langchain_core.runnables.schema import StreamEvent
 # ClimateQ&A imports
 from climateqa.engine.llm import get_llm
 from climateqa.engine.embeddings import get_embeddings_function
 from climateqa.engine.chains.prompts import audience_prompts
 from climateqa.sample_questions import QUESTIONS
+from climateqa.constants import POSSIBLE_REPORTS, OWID_CATEGORIES
 from climateqa.utils import get_image_from_azure_blob_storage
+from climateqa.engine.graph import make_graph_agent
+from climateqa.engine.embeddings import get_embeddings_function
+from climateqa.engine.chains.retrieve_papers import find_papers
+from front.utils import serialize_docs,process_figures
+from climateqa.event_handler import init_audience, handle_retrieved_documents, stream_answer,handle_retrieved_owid_graphs
 # Load environment variables in local mode
 try:
 except Exception as e:
     pass
+import requests
 # Set up Gradio Theme
 theme = gr.themes.Base(
     primary_hue="blue",
 # Create vectorstore and retriever
+vectorstore = get_pinecone_vectorstore(embeddings_function, index_name = os.getenv("PINECONE_API_INDEX"))
+vectorstore_graphs = get_pinecone_vectorstore(embeddings_function, index_name = os.getenv("PINECONE_API_INDEX_OWID"), text_key="description")
+llm = get_llm(provider="openai",max_tokens = 1024,temperature = 0.0)
+reranker = get_reranker("nano")
+agent = make_graph_agent(llm=llm, vectorstore_ipcc=vectorstore, vectorstore_graphs=vectorstore_graphs, reranker=reranker)
+def update_config_modal_visibility(config_open):
+    new_config_visibility_status = not config_open
+    return gr.update(visible=new_config_visibility_status), new_config_visibility_status
+async def chat(query, history, audience, sources, reports, relevant_content_sources, search_only):
     """taking a query and a message history, use a pipeline (reformulation, retriever, answering) to yield a tuple of:
     (messages in gradio format, messages in langchain format, source documents)"""
     date_now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
     print(f">> NEW QUESTION ({date_now}) : {query}")
+    audience_prompt = init_audience(audience)
     # Prepare default values
+    if sources is None or len(sources) == 0:
+        sources = ["IPCC", "IPBES", "IPOS"]
+    if reports is None or len(reports) == 0:
+        reports = []
+    inputs = {"user_input": query,"audience": audience_prompt,"sources_input":sources, "relevant_content_sources" : relevant_content_sources, "search_only": search_only}
     result = agent.astream_events(inputs,version = "v1")
     docs = []
+    used_figures=[]
+    related_contents = []
     docs_html = ""
     output_query = ""
     output_language = ""
     output_keywords = ""
     start_streaming = False
+    graphs_html = ""
     figures = '<div class="figures-container"><p></p> </div>'
     steps_display = {
                 node = event["metadata"]["langgraph_node"]
                 if event["event"] == "on_chain_end" and event["name"] == "retrieve_documents" :# when documents are retrieved
+                    docs, docs_html, history, used_documents, related_contents = handle_retrieved_documents(event, history, used_documents)
+                elif event["event"] == "on_chain_end" and node == "categorize_intent" and event["name"] == "_write": # when the query is transformed
+                    intent = event["data"]["output"]["intent"]
+                    if "language" in event["data"]["output"]:
+                        output_language = event["data"]["output"]["language"]
+                    else :
+                        output_language = "English"
+                    history[-1].content = f"Language identified : {output_language} \n Intent identified : {intent}"
                 elif event["name"] in steps_display.keys() and event["event"] == "on_chain_start": #display steps
+                    event_description, display_output = steps_display[node]
                     if not hasattr(history[-1], 'metadata') or history[-1].metadata["title"] != event_description: # if a new step begins
                         history.append(ChatMessage(role="assistant", content = "", metadata={'title' :event_description}))
                 elif event["name"] != "transform_query" and event["event"] == "on_chat_model_stream" and node in ["answer_rag", "answer_search","answer_chitchat"]:# if streaming answer
+                    history, start_streaming, answer_message_content = stream_answer(history, event, start_streaming, answer_message_content)
+                elif event["name"] in ["retrieve_graphs", "retrieve_graphs_ai"] and event["event"] == "on_chain_end":
+                    graphs_html = handle_retrieved_owid_graphs(event, graphs_html)
                 if event["name"] == "transform_query" and event["event"] =="on_chain_end":
                     if hasattr(history[-1],"content"):
                 if event["name"] == "categorize_intent" and event["event"] == "on_chain_start":
                     print("X")
+            yield history, docs_html, output_query, output_language, related_contents , graphs_html,  #,output_query,output_keywords
     except Exception as e:
         print(event, "has failed")
         print(f"Error logging on Azure Blob Storage: {e}")
         raise gr.Error(f"ClimateQ&A Error: {str(e)[:100]} - The error has been noted, try another question and if the error remains, you can contact us :)")
+    yield history, docs_html, output_query, output_language, related_contents, graphs_html
 def save_feedback(feed: str, user_id):
     file_client.upload_file(logs)
 # --------------------------------------------------------------------
 # Gradio
 # --------------------------------------------------------------------
     else:
         print(data)
+def save_graph(saved_graphs_state, embedding, category):
+    print(f"\nCategory:\n{saved_graphs_state}\n")
+    if category not in saved_graphs_state:
+        saved_graphs_state[category] = []
+    if embedding not in saved_graphs_state[category]:
+        saved_graphs_state[category].append(embedding)
+    return saved_graphs_state, gr.Button("Graph Saved")
 with gr.Blocks(title="Climate Q&A", css_paths=os.getcwd()+ "/style.css", theme=theme,elem_id = "main-component") as demo:
+    chat_completed_state = gr.State(0)
+    current_graphs = gr.State([])
+    saved_graphs = gr.State({})
+    config_open = gr.State(False)
     with gr.Tab("ClimateQ&A"):
         with gr.Row(elem_id="chatbot-row"):
                 with gr.Row(elem_id = "input-message"):
                     textbox=gr.Textbox(placeholder="Ask me anything here!",show_label=False,scale=7,lines = 1,interactive = True,elem_id="input-textbox")
+                    config_button = gr.Button("",elem_id="config-button")
+                    # config_checkbox_button = gr.Checkbox(label = '⚙️', value="show",visible=True, interactive=True, elem_id="checkbox-config")
+            with gr.Column(scale=2, variant="panel",elem_id = "right-panel"):
+                with gr.Tabs(elem_id = "right_panel_tab") as tabs:
                     with gr.TabItem("Examples",elem_id = "tab-examples",id = 0):
                         examples_hidden = gr.Textbox(visible = False)
                                 )
                             samples.append(group_examples)
+                    # with gr.Tab("Configuration", id = 10, ) as tab_config:
+                    #         # gr.Markdown("Reminders: You can talk in any language, ClimateQ&A is multi-lingual!")
+                    #     pass
+                            # with gr.Row():
+                            #     dropdown_sources = gr.CheckboxGroup(
+                            #         ["IPCC", "IPBES","IPOS"],
+                            #         label="Select source",
+                            #         value=["IPCC"],
+                            #         interactive=True,
+                            #     )
+                            #     dropdown_external_sources = gr.CheckboxGroup(
+                            #         ["IPCC figures","OpenAlex", "OurWorldInData"],
+                            #         label="Select database to search for relevant content",
+                            #         value=["IPCC figures"],
+                            #         interactive=True,
+                            #     )
+                            # dropdown_reports = gr.Dropdown(
+                            #     POSSIBLE_REPORTS,
+                            #     label="Or select specific reports",
+                            #     multiselect=True,
+                            #     value=None,
+                            #     interactive=True,
+                            # )
+                            # search_only = gr.Checkbox(label="Search only without chating", value=False, interactive=True, elem_id="checkbox-chat")
+                            # dropdown_audience = gr.Dropdown(
+                            #     ["Children","General public","Experts"],
+                            #     label="Select audience",
+                            #     value="Experts",
+                            #     interactive=True,
+                            # )
+                            # after = gr.Slider(minimum=1950,maximum=2023,step=1,value=1960,label="Publication date",show_label=True,interactive=True,elem_id="date-papers", visible=False)
+                            # output_query = gr.Textbox(label="Query used for retrieval",show_label = True,elem_id = "reformulated-query",lines = 2,interactive = False, visible= False)
+                            # output_language = gr.Textbox(label="Language",show_label = True,elem_id = "language",lines = 1,interactive = False, visible= False)
+                            # dropdown_external_sources.change(lambda x: gr.update(visible = True ) if "OpenAlex" in x else gr.update(visible=False) , inputs=[dropdown_external_sources], outputs=[after])
+                            # # dropdown_external_sources.change(lambda x: gr.update(visible = True ) if "OpenAlex" in x else gr.update(visible=False) , inputs=[dropdown_external_sources], outputs=[after], visible=True)
+                    with gr.Tab("Sources",elem_id = "tab-sources",id = 1) as tab_sources:
+                        sources_textbox = gr.HTML(show_label=False, elem_id="sources-textbox")
+                    with gr.Tab("Recommended content", elem_id="tab-recommended_content",id=2) as tab_recommended_content:
+                        with gr.Tabs(elem_id = "group-subtabs") as tabs_recommended_content:
+                            with gr.Tab("Figures",elem_id = "tab-figures",id = 3) as tab_figures:
+                                sources_raw = gr.State()
+                                with Modal(visible=False, elem_id="modal_figure_galery") as figure_modal:
+                                    gallery_component = gr.Gallery(object_fit='scale-down',elem_id="gallery-component", height="80vh")
+                                show_full_size_figures = gr.Button("Show figures in full size",elem_id="show-figures",interactive=True)
+                                show_full_size_figures.click(lambda : Modal(visible=True),None,figure_modal)
+                                figures_cards = gr.HTML(show_label=False, elem_id="sources-figures")
+                            with gr.Tab("Papers",elem_id = "tab-citations",id = 4) as tab_papers:
+                                # btn_summary = gr.Button("Summary")
+                                # Fenêtre simulée pour le Summary
+                                with gr.Accordion(visible=True, elem_id="papers-summary-popup", label= "See summary of relevant papers", open= False) as summary_popup:
+                                    papers_summary = gr.Markdown("", visible=True, elem_id="papers-summary")
+                                # btn_relevant_papers = gr.Button("Relevant papers")
+                                # Fenêtre simulée pour les Relevant Papers
+                                with gr.Accordion(visible=True, elem_id="papers-relevant-popup",label= "See relevant papers", open= False) as relevant_popup:
+                                    papers_html = gr.HTML(show_label=False, elem_id="papers-textbox")
+                                btn_citations_network = gr.Button("Explore papers citations network")
+                                # Fenêtre simulée pour le Citations Network
+                                with Modal(visible=False) as papers_modal:
+                                    citations_network = gr.HTML("<h3>Citations Network Graph</h3>", visible=True, elem_id="papers-citations-network")
+                                btn_citations_network.click(lambda: Modal(visible=True), None, papers_modal)
+                            with gr.Tab("Graphs", elem_id="tab-graphs", id=5) as tab_graphs:
+                                graphs_container = gr.HTML("<h2>There are no graphs to be displayed at the moment. Try asking another question.</h2>",elem_id="graphs-container")
+                                current_graphs.change(lambda x : x, inputs=[current_graphs], outputs=[graphs_container])
+            with Modal(visible=False,elem_id="modal-config") as config_modal:
+                gr.Markdown("Reminders: You can talk in any language, ClimateQ&A is multi-lingual!")
+                # with gr.Row():
+                dropdown_sources = gr.CheckboxGroup(
+                    ["IPCC", "IPBES","IPOS"],
+                    label="Select source (by default search in all sources)",
+                    value=["IPCC"],
+                    interactive=True,
+                )
+                dropdown_reports = gr.Dropdown(
+                    POSSIBLE_REPORTS,
+                    label="Or select specific reports",
+                    multiselect=True,
+                    value=None,
+                    interactive=True,
+                )
+                dropdown_external_sources = gr.CheckboxGroup(
+                    ["IPCC figures","OpenAlex", "OurWorldInData"],
+                    label="Select database to search for relevant content",
+                    value=["IPCC figures"],
+                    interactive=True,
+                )
+                search_only = gr.Checkbox(label="Search only for recommended content without chating", value=False, interactive=True, elem_id="checkbox-chat")
+                dropdown_audience = gr.Dropdown(
+                    ["Children","General public","Experts"],
+                    label="Select audience",
+                    value="Experts",
+                    interactive=True,
+                )
+                after = gr.Slider(minimum=1950,maximum=2023,step=1,value=1960,label="Publication date",show_label=True,interactive=True,elem_id="date-papers", visible=False)
+                output_query = gr.Textbox(label="Query used for retrieval",show_label = True,elem_id = "reformulated-query",lines = 2,interactive = False, visible= False)
+                output_language = gr.Textbox(label="Language",show_label = True,elem_id = "language",lines = 1,interactive = False, visible= False)
+                dropdown_external_sources.change(lambda x: gr.update(visible = True ) if "OpenAlex" in x else gr.update(visible=False) , inputs=[dropdown_external_sources], outputs=[after])
+                close_config_modal = gr.Button("Validate and Close",elem_id="close-config-modal")
+                close_config_modal.click(fn=update_config_modal_visibility, inputs=[config_open], outputs=[config_modal, config_open])
+                # dropdown_external_sources.change(lambda x: gr.update(visible = True ) if "OpenAlex" in x else gr.update(visible=False) , inputs=[dropdown_external_sources], outputs=[after], visible=True)
+            config_button.click(fn=update_config_modal_visibility, inputs=[config_open], outputs=[config_modal, config_open])
+                    # with gr.Tab("OECD",elem_id = "tab-oecd",id = 6):
+                    #     oecd_indicator = "RIVER_FLOOD_RP100_POP_SH"
+                    #     oecd_topic = "climate"
+                    #     oecd_latitude = "46.8332"
+                    #     oecd_longitude = "5.3725"
+                    #     oecd_zoom = "5.6442"
+                    #     # Create the HTML content with the iframe
+                    #     iframe_html = f"""
+                    #     <iframe src="https://localdataportal.oecd.org/maps.html?indicator={oecd_indicator}&topic={oecd_topic}&latitude={oecd_latitude}&longitude={oecd_longitude}&zoom={oecd_zoom}"
+                    #             width="100%" height="600" frameborder="0" style="border:0;" allowfullscreen></iframe>
+                    #     """
+                    #     oecd_textbox = gr.HTML(iframe_html, show_label=False, elem_id="oecd-textbox")
 #---------------------------------------------------------------------------------------
 # OTHER TABS
 #---------------------------------------------------------------------------------------
+    # with gr.Tab("Settings",elem_id = "tab-config",id = 2):
+    #     gr.Markdown("Reminder: You can talk in any language, ClimateQ&A is multi-lingual!")
+    #     dropdown_sources = gr.CheckboxGroup(
+    #         ["IPCC", "IPBES","IPOS", "OpenAlex"],
+    #         label="Select source",
+    #         value=["IPCC"],
+    #         interactive=True,
+    #     )
+    #     dropdown_reports = gr.Dropdown(
+    #         POSSIBLE_REPORTS,
+    #         label="Or select specific reports",
+    #         multiselect=True,
+    #         value=None,
+    #         interactive=True,
+    #     )
+    #     dropdown_audience = gr.Dropdown(
+    #         ["Children","General public","Experts"],
+    #         label="Select audience",
+    #         value="Experts",
+    #         interactive=True,
+    #     )
+    #     output_query = gr.Textbox(label="Query used for retrieval",show_label = True,elem_id = "reformulated-query",lines = 2,interactive = False)
+    #     output_language = gr.Textbox(label="Language",show_label = True,elem_id = "language",lines = 1,interactive = False)
     with gr.Tab("About",elem_classes = "max-height other-tabs"):
         with gr.Row():
             with gr.Column(scale=1):
+                gr.Markdown(
+                    """
+                    ### More info
+                    - See more info at [https://climateqa.com](https://climateqa.com/docs/intro/)
+                    - Feedbacks on this [form](https://forms.office.com/e/1Yzgxm6jbp)
+                    ### Citation
+                    """
+                )
                 with gr.Accordion(CITATION_LABEL,elem_id="citation", open = False,):
                     # # Display citation label and text)
                     gr.Textbox(
+    def start_chat(query,history,search_only):
         history = history + [ChatMessage(role="user", content=query)]
+        if search_only:
+            return (gr.update(interactive = False),gr.update(selected=1),history)
+        else:
+            return (gr.update(interactive = False),gr.update(selected=2),history)
     def finish_chat():
+        return gr.update(interactive = True,value = "")
+    # Initialize visibility states
+    summary_visible = False
+    relevant_visible = False
+    # Functions to toggle visibility
+    def toggle_summary_visibility():
+        global summary_visible
+        summary_visible = not summary_visible
+        return gr.update(visible=summary_visible)
+    def toggle_relevant_visibility():
+        global relevant_visible
+        relevant_visible = not relevant_visible
+        return gr.update(visible=relevant_visible)
+    def change_completion_status(current_state):
+        current_state = 1 - current_state
+        return current_state
+    def update_sources_number_display(sources_textbox, figures_cards, current_graphs, papers_html):
+        sources_number = sources_textbox.count("<h2>")
+        figures_number = figures_cards.count("<h2>")
+        graphs_number = current_graphs.count("<iframe")
+        papers_number = papers_html.count("<h2>")
+        sources_notif_label = f"Sources ({sources_number})"
+        figures_notif_label = f"Figures ({figures_number})"
+        graphs_notif_label = f"Graphs ({graphs_number})"
+        papers_notif_label = f"Papers ({papers_number})"
+        recommended_content_notif_label = f"Recommended content ({figures_number + graphs_number + papers_number})"
+        return gr.update(label = recommended_content_notif_label), gr.update(label = sources_notif_label), gr.update(label = figures_notif_label), gr.update(label = graphs_notif_label), gr.update(label = papers_notif_label)
     (textbox
+        .submit(start_chat, [textbox,chatbot, search_only], [textbox,tabs,chatbot],queue = False,api_name = "start_chat_textbox")
+        .then(chat, [textbox,chatbot,dropdown_audience, dropdown_sources,dropdown_reports, dropdown_external_sources, search_only] ,[chatbot,sources_textbox,output_query,output_language, sources_raw, current_graphs],concurrency_limit = 8,api_name = "chat_textbox")
         .then(finish_chat, None, [textbox],api_name = "finish_chat_textbox")
+        # .then(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_sources, tab_figures, tab_graphs, tab_papers] )
     )
     (examples_hidden
+        .change(start_chat, [examples_hidden,chatbot, search_only], [textbox,tabs,chatbot],queue = False,api_name = "start_chat_examples")
+        .then(chat, [examples_hidden,chatbot,dropdown_audience, dropdown_sources,dropdown_reports, dropdown_external_sources, search_only] ,[chatbot,sources_textbox,output_query,output_language, sources_raw, current_graphs],concurrency_limit = 8,api_name = "chat_textbox")
         .then(finish_chat, None, [textbox],api_name = "finish_chat_examples")
+        # .then(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_sources, tab_figures, tab_graphs, tab_papers] )
     )
         return [gr.update(visible=visible_bools[i]) for i in range(len(samples))]
+    sources_raw.change(process_figures, inputs=[sources_raw], outputs=[figures_cards, gallery_component])
+    # update sources numbers
+    sources_textbox.change(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_recommended_content, tab_sources, tab_figures, tab_graphs, tab_papers])
+    figures_cards.change(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_recommended_content, tab_sources, tab_figures, tab_graphs, tab_papers])
+    current_graphs.change(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_recommended_content, tab_sources, tab_figures, tab_graphs, tab_papers])
+    papers_html.change(update_sources_number_display, [sources_textbox, figures_cards, current_graphs,papers_html],[tab_recommended_content, tab_sources, tab_figures, tab_graphs, tab_papers])
+    # other questions examples
     dropdown_samples.change(change_sample_questions,dropdown_samples,samples)
+    # search for papers
+    textbox.submit(find_papers,[textbox,after, dropdown_external_sources], [papers_html,citations_network,papers_summary])
+    examples_hidden.change(find_papers,[examples_hidden,after,dropdown_external_sources], [papers_html,citations_network,papers_summary])
+    # btn_summary.click(toggle_summary_visibility, outputs=summary_popup)
+    # btn_relevant_papers.click(toggle_relevant_visibility, outputs=relevant_popup)
     demo.queue()

climateqa/constants.py CHANGED Viewed

@@ -42,4 +42,25 @@ POSSIBLE_REPORTS = [
     "IPBES IAS A C5",
     "IPBES IAS A C6",
     "IPBES IAS A SPM"
-]

     "IPBES IAS A C5",
     "IPBES IAS A C6",
     "IPBES IAS A SPM"
+]
+OWID_CATEGORIES = ['Access to Energy', 'Agricultural Production',
+       'Agricultural Regulation & Policy', 'Air Pollution',
+       'Animal Welfare', 'Antibiotics', 'Biodiversity', 'Biofuels',
+       'Biological & Chemical Weapons', 'CO2 & Greenhouse Gas Emissions',
+       'COVID-19', 'Clean Water', 'Clean Water & Sanitation',
+       'Climate Change', 'Crop Yields', 'Diet Compositions',
+       'Electricity', 'Electricity Mix', 'Energy', 'Energy Efficiency',
+       'Energy Prices', 'Environmental Impacts of Food Production',
+       'Environmental Protection & Regulation', 'Famines', 'Farm Size',
+       'Fertilizers', 'Fish & Overfishing', 'Food Supply', 'Food Trade',
+       'Food Waste', 'Food and Agriculture', 'Forests & Deforestation',
+       'Fossil Fuels', 'Future Population Growth',
+       'Hunger & Undernourishment', 'Indoor Air Pollution', 'Land Use',
+       'Land Use & Yields in Agriculture', 'Lead Pollution',
+       'Meat & Dairy Production', 'Metals & Minerals',
+       'Natural Disasters', 'Nuclear Energy', 'Nuclear Weapons',
+       'Oil Spills', 'Outdoor Air Pollution', 'Ozone Layer', 'Pandemics',
+       'Pesticides', 'Plastic Pollution', 'Renewable Energy', 'Soil',
+       'Transport', 'Urbanization', 'Waste Management', 'Water Pollution',
+       'Water Use & Stress', 'Wildfires']

climateqa/engine/chains/answer_chitchat.py CHANGED Viewed

@@ -45,8 +45,12 @@ def make_chitchat_node(llm):
     chitchat_chain = make_chitchat_chain(llm)
     async def answer_chitchat(state,config):
         answer = await chitchat_chain.ainvoke({"question":state["user_input"]},config)
-        return {"answer":answer}
     return answer_chitchat

     chitchat_chain = make_chitchat_chain(llm)
     async def answer_chitchat(state,config):
+        print("---- Answer chitchat ----")
         answer = await chitchat_chain.ainvoke({"question":state["user_input"]},config)
+        state["answer"] = answer
+        return state
+        # return {"answer":answer}
     return answer_chitchat

climateqa/engine/chains/answer_rag.py CHANGED Viewed

@@ -7,6 +7,9 @@ from langchain_core.prompts.base import format_document
 from climateqa.engine.chains.prompts import answer_prompt_template,answer_prompt_without_docs_template,answer_prompt_images_template
 from climateqa.engine.chains.prompts import papers_prompt_template
 DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
@@ -40,6 +43,7 @@ def make_rag_chain(llm):
     prompt = ChatPromptTemplate.from_template(answer_prompt_template)
     chain = ({
         "context":lambda x : _combine_documents(x["documents"]),
         "query":itemgetter("query"),
         "language":itemgetter("language"),
         "audience":itemgetter("audience"),
@@ -51,7 +55,6 @@ def make_rag_chain_without_docs(llm):
     chain = prompt | llm | StrOutputParser()
     return chain
 def make_rag_node(llm,with_docs = True):
     if with_docs:
@@ -60,7 +63,17 @@ def make_rag_node(llm,with_docs = True):
         rag_chain = make_rag_chain_without_docs(llm)
     async def answer_rag(state,config):
         answer = await rag_chain.ainvoke(state,config)
         return {"answer":answer}
     return answer_rag
@@ -68,32 +81,32 @@ def make_rag_node(llm,with_docs = True):
-# def make_rag_papers_chain(llm):
-#     prompt = ChatPromptTemplate.from_template(papers_prompt_template)
-#     input_documents = {
-#         "context":lambda x : _combine_documents(x["docs"]),
-#         **pass_values(["question","language"])
-#     }
-#     chain = input_documents | prompt | llm | StrOutputParser()
-#     chain = rename_chain(chain,"answer")
-#     return chain
-# def make_illustration_chain(llm):
-#     prompt_with_images = ChatPromptTemplate.from_template(answer_prompt_images_template)
-#     input_description_images = {
-#         "images":lambda x : _combine_documents(get_image_docs(x["docs"])),
-#         **pass_values(["question","audience","language","answer"]),
-#     }
-#     illustration_chain = input_description_images | prompt_with_images | llm | StrOutputParser()
-#     return illustration_chain

 from climateqa.engine.chains.prompts import answer_prompt_template,answer_prompt_without_docs_template,answer_prompt_images_template
 from climateqa.engine.chains.prompts import papers_prompt_template
+import time
+from ..utils import rename_chain, pass_values
 DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
     prompt = ChatPromptTemplate.from_template(answer_prompt_template)
     chain = ({
         "context":lambda x : _combine_documents(x["documents"]),
+        "context_length":lambda x : print("CONTEXT LENGTH : " , len(_combine_documents(x["documents"]))),
         "query":itemgetter("query"),
         "language":itemgetter("language"),
         "audience":itemgetter("audience"),
     chain = prompt | llm | StrOutputParser()
     return chain
 def make_rag_node(llm,with_docs = True):
     if with_docs:
         rag_chain = make_rag_chain_without_docs(llm)
     async def answer_rag(state,config):
+        print("---- Answer RAG ----")
+        start_time = time.time()
         answer = await rag_chain.ainvoke(state,config)
+        end_time = time.time()
+        elapsed_time = end_time - start_time
+        print("RAG elapsed time: ", elapsed_time)
+        print("Answer size : ", len(answer))
+        # print(f"\n\nAnswer:\n{answer}")
         return {"answer":answer}
     return answer_rag
+def make_rag_papers_chain(llm):
+    prompt = ChatPromptTemplate.from_template(papers_prompt_template)
+    input_documents = {
+        "context":lambda x : _combine_documents(x["docs"]),
+        **pass_values(["question","language"])
+    }
+    chain = input_documents | prompt | llm | StrOutputParser()
+    chain = rename_chain(chain,"answer")
+    return chain
+def make_illustration_chain(llm):
+    prompt_with_images = ChatPromptTemplate.from_template(answer_prompt_images_template)
+    input_description_images = {
+        "images":lambda x : _combine_documents(get_image_docs(x["docs"])),
+        **pass_values(["question","audience","language","answer"]),
+    }
+    illustration_chain = input_description_images | prompt_with_images | llm | StrOutputParser()
+    return illustration_chain

climateqa/engine/chains/chitchat_categorization.py ADDED Viewed

	@@ -0,0 +1,43 @@

+from langchain_core.pydantic_v1 import BaseModel, Field
+from typing import List
+from typing import Literal
+from langchain.prompts import ChatPromptTemplate
+from langchain_core.utils.function_calling import convert_to_openai_function
+from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
+class IntentCategorizer(BaseModel):
+    """Analyzing the user message input"""
+    environment: bool = Field(
+        description="Return 'True' if the question relates to climate change, the environment, nature, etc. (Example: should I eat fish?). Return 'False' if the question is just chit chat or not related to the environment or climate change.",
+    )
+def make_chitchat_intent_categorization_chain(llm):
+    openai_functions = [convert_to_openai_function(IntentCategorizer)]
+    llm_with_functions = llm.bind(functions = openai_functions,function_call={"name":"IntentCategorizer"})
+    prompt = ChatPromptTemplate.from_messages([
+        ("system", "You are a helpful assistant, you will analyze, translate and reformulate the user input message using the function provided"),
+        ("user", "input: {input}")
+    ])
+    chain = prompt | llm_with_functions | JsonOutputFunctionsParser()
+    return chain
+def make_chitchat_intent_categorization_node(llm):
+    categorization_chain = make_chitchat_intent_categorization_chain(llm)
+    def categorize_message(state):
+        output = categorization_chain.invoke({"input": state["user_input"]})
+        print(f"\n\nChit chat output intent categorization: {output}\n")
+        state["search_graphs_chitchat"] = output["environment"]
+        print(f"\n\nChit chat output intent categorization: {state}\n")
+        return state
+    return categorize_message

climateqa/engine/chains/graph_retriever.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import sys
+import os
+from contextlib import contextmanager
+from ..reranker import rerank_docs
+from ..graph_retriever import retrieve_graphs # GraphRetriever
+from ...utils import remove_duplicates_keep_highest_score
+def divide_into_parts(target, parts):
+    # Base value for each part
+    base = target // parts
+    # Remainder to distribute
+    remainder = target % parts
+    # List to hold the result
+    result = []
+    for i in range(parts):
+        if i < remainder:
+            # These parts get base value + 1
+            result.append(base + 1)
+        else:
+            # The rest get the base value
+            result.append(base)
+    return result
+@contextmanager
+def suppress_output():
+    # Open a null device
+    with open(os.devnull, 'w') as devnull:
+        # Store the original stdout and stderr
+        old_stdout = sys.stdout
+        old_stderr = sys.stderr
+        # Redirect stdout and stderr to the null device
+        sys.stdout = devnull
+        sys.stderr = devnull
+        try:
+            yield
+        finally:
+            # Restore stdout and stderr
+            sys.stdout = old_stdout
+            sys.stderr = old_stderr
+def make_graph_retriever_node(vectorstore, reranker, rerank_by_question=True, k_final=15, k_before_reranking=100):
+    async def node_retrieve_graphs(state):
+        print("---- Retrieving graphs ----")
+        POSSIBLE_SOURCES = ["IEA", "OWID"]
+        questions = state["remaining_questions"] if state["remaining_questions"] is not None and state["remaining_questions"]!=[]  else [state["query"]]
+        # sources_input = state["sources_input"]
+        sources_input = ["auto"]
+        auto_mode = "auto" in sources_input
+        # There are several options to get the final top k
+        # Option 1 - Get 100 documents by question and rerank by question
+        # Option 2 - Get 100/n documents by question and rerank the total
+        if rerank_by_question:
+            k_by_question = divide_into_parts(k_final,len(questions))
+        docs = []
+        for i,q in enumerate(questions):
+            question = q["question"] if isinstance(q, dict) else q
+            print(f"Subquestion {i}: {question}")
+            # If auto mode, we use all sources
+            if auto_mode:
+                sources = POSSIBLE_SOURCES
+            # Otherwise, we use the config
+            else:
+                sources = sources_input
+            if any([x in POSSIBLE_SOURCES for x in sources]):
+                sources = [x for x in sources if x in POSSIBLE_SOURCES]
+                # Search the document store using the retriever
+                docs_question = await retrieve_graphs(
+                    query = question,
+                    vectorstore = vectorstore,
+                    sources = sources,
+                    k_total = k_before_reranking,
+                    threshold = 0.5,
+                    )
+                # docs_question = retriever.get_relevant_documents(question)
+                # Rerank
+                if reranker is not None and docs_question!=[]:
+                    with suppress_output():
+                        docs_question = rerank_docs(reranker,docs_question,question)
+                else:
+                    # Add a default reranking score
+                    for doc in docs_question:
+                        doc.metadata["reranking_score"] = doc.metadata["similarity_score"]
+                # If rerank by question we select the top documents for each question
+                if rerank_by_question:
+                    docs_question = docs_question[:k_by_question[i]]
+                # Add sources used in the metadata
+                for doc in docs_question:
+                    doc.metadata["sources_used"] = sources
+                print(f"{len(docs_question)} graphs retrieved for subquestion {i + 1}: {docs_question}")
+                docs.extend(docs_question)
+            else:
+                print(f"There are no graphs which match the sources filtered on. Sources filtered on: {sources}. Sources available: {POSSIBLE_SOURCES}.")
+            # Remove duplicates and keep the duplicate document with the highest reranking score
+            docs = remove_duplicates_keep_highest_score(docs)
+            # Sorting the list in descending order by rerank_score
+            # Then select the top k
+            docs = sorted(docs, key=lambda x: x.metadata["reranking_score"], reverse=True)
+            docs = docs[:k_final]
+        return {"recommended_content": docs}
+    return node_retrieve_graphs

climateqa/engine/chains/intent_categorization.py CHANGED Viewed

@@ -17,8 +17,8 @@ class IntentCategorizer(BaseModel):
     intent: str = Field(
         enum=[
             "ai_impact",
-            "geo_info",
-            "esg",
             "search",
             "chitchat",
         ],
@@ -28,11 +28,12 @@ class IntentCategorizer(BaseModel):
             Examples:
             - ai_impact = Environmental impacts of AI: "What are the environmental impacts of AI", "How does AI affect the environment"
-            - geo_info = Geolocated info about climate change: Any question where the user wants to know localized impacts of climate change, eg: "What will be the temperature in Marseille in 2050"
-            - esg = Any question about the ESG regulation, frameworks and standards like the CSRD, TCFD, SASB, GRI, CDP, etc.
             - search = Searching for any quesiton about climate change, energy, biodiversity, nature, and everything we can find the IPCC or IPBES reports or scientific papers,
             - chitchat = Any general question that is not related to the environment or climate change or just conversational, or if you don't think searching the IPCC or IPBES reports would be relevant
         """,
     )
@@ -43,7 +44,7 @@ def make_intent_categorization_chain(llm):
     llm_with_functions = llm.bind(functions = openai_functions,function_call={"name":"IntentCategorizer"})
     prompt = ChatPromptTemplate.from_messages([
-        ("system", "You are a helpful assistant, you will analyze, translate and reformulate the user input message using the function provided"),
         ("user", "input: {input}")
     ])
@@ -56,7 +57,10 @@ def make_intent_categorization_node(llm):
     categorization_chain = make_intent_categorization_chain(llm)
     def categorize_message(state):
-        output = categorization_chain.invoke({"input":state["user_input"]})
         if "language" not in output: output["language"] = "English"
         output["query"] = state["user_input"]
         return output

     intent: str = Field(
         enum=[
             "ai_impact",
+            # "geo_info",
+            # "esg",
             "search",
             "chitchat",
         ],
             Examples:
             - ai_impact = Environmental impacts of AI: "What are the environmental impacts of AI", "How does AI affect the environment"
             - search = Searching for any quesiton about climate change, energy, biodiversity, nature, and everything we can find the IPCC or IPBES reports or scientific papers,
             - chitchat = Any general question that is not related to the environment or climate change or just conversational, or if you don't think searching the IPCC or IPBES reports would be relevant
         """,
+            # - geo_info = Geolocated info about climate change: Any question where the user wants to know localized impacts of climate change, eg: "What will be the temperature in Marseille in 2050"
+            # - esg = Any question about the ESG regulation, frameworks and standards like the CSRD, TCFD, SASB, GRI, CDP, etc.
     )
     llm_with_functions = llm.bind(functions = openai_functions,function_call={"name":"IntentCategorizer"})
     prompt = ChatPromptTemplate.from_messages([
+        ("system", "You are a helpful assistant, you will analyze, translate and categorize the user input message using the function provided. Categorize the user input as ai ONLY if it is related to Artificial Intelligence, search if it is related to the environment, climate change, energy, biodiversity, nature, etc. and chitchat if it is just general conversation."),
         ("user", "input: {input}")
     ])
     categorization_chain = make_intent_categorization_chain(llm)
     def categorize_message(state):
+        print("---- Categorize_message ----")
+        output = categorization_chain.invoke({"input": state["user_input"]})
+        print(f"\n\nOutput intent categorization: {output}\n")
         if "language" not in output: output["language"] = "English"
         output["query"] = state["user_input"]
         return output

climateqa/engine/chains/prompts.py CHANGED Viewed

@@ -147,4 +147,27 @@ audience_prompts = {
     "children": "6 year old children that don't know anything about science and climate change and need metaphors to learn",
     "general": "the general public who know the basics in science and climate change and want to learn more about it without technical terms. Still use references to passages.",
     "experts": "expert and climate scientists that are not afraid of technical terms",
-}

     "children": "6 year old children that don't know anything about science and climate change and need metaphors to learn",
     "general": "the general public who know the basics in science and climate change and want to learn more about it without technical terms. Still use references to passages.",
     "experts": "expert and climate scientists that are not afraid of technical terms",
+}
+answer_prompt_graph_template = """
+Given the user question and a list of graphs which are related to the question, rank the graphs based on relevance to the user question. ALWAYS follow the guidelines given below.
+### Guidelines ###
+- Keep all the graphs that are given to you.
+- NEVER modify the graph HTML embedding, the category or the source leave them exactly as they are given.
+- Return the ranked graphs as a list of dictionaries with keys 'embedding', 'category', and 'source'.
+- Return a valid JSON output.
+-----------------------
+User question:
+{query}
+Graphs and their HTML embedding:
+{recommended_content}
+-----------------------
+{format_instructions}
+Output the result as json with a key "graphs" containing a list of dictionaries of the relevant graphs with keys 'embedding', 'category', and 'source'. Do not modify the graph HTML embedding, the category or the source. Do not put any message or text before or after the JSON output.
+"""

climateqa/engine/chains/query_transformation.py CHANGED Viewed

@@ -69,15 +69,15 @@ class QueryAnalysis(BaseModel):
     #     """
     # )
-    sources: List[Literal["IPCC", "IPBES", "IPOS","OpenAlex"]] = Field(
         ...,
         description="""
             Given a user question choose which documents would be most relevant for answering their question,
             - IPCC is for questions about climate change, energy, impacts, and everything we can find the IPCC reports
             - IPBES is for questions about biodiversity and nature
             - IPOS is for questions about the ocean and deep sea mining
-            - OpenAlex is for any other questions that are not in the previous categories but could be found in the scientific litterature
         """,
     )
     # topics: List[Literal[
     #     "Climate change",
@@ -138,6 +138,8 @@ def make_query_transform_node(llm,k_final=15):
     rewriter_chain = make_query_rewriter_chain(llm)
     def transform_query(state):
         if "sources_auto" not in state or state["sources_auto"] is None or state["sources_auto"] is False:
             auto_mode = False
@@ -158,6 +160,12 @@ def make_query_transform_node(llm,k_final=15):
         for question in new_state["questions"]:
             question_state = {"question":question}
             analysis_output = rewriter_chain.invoke({"input":question})
             question_state.update(analysis_output)
             questions.append(question_state)

     #     """
     # )
+    sources: List[Literal["IPCC", "IPBES", "IPOS"]] = Field( #,"OpenAlex"]] = Field(
         ...,
         description="""
             Given a user question choose which documents would be most relevant for answering their question,
             - IPCC is for questions about climate change, energy, impacts, and everything we can find the IPCC reports
             - IPBES is for questions about biodiversity and nature
             - IPOS is for questions about the ocean and deep sea mining
         """,
+            # - OpenAlex is for any other questions that are not in the previous categories but could be found in the scientific litterature
     )
     # topics: List[Literal[
     #     "Climate change",
     rewriter_chain = make_query_rewriter_chain(llm)
     def transform_query(state):
+        print("---- Transform query ----")
         if "sources_auto" not in state or state["sources_auto"] is None or state["sources_auto"] is False:
             auto_mode = False
         for question in new_state["questions"]:
             question_state = {"question":question}
             analysis_output = rewriter_chain.invoke({"input":question})
+            # TODO WARNING llm should always return smthg
+            # The case when the llm does not return any sources
+            if not analysis_output["sources"] or not all(source in ["IPCC", "IPBS", "IPOS"] for source in analysis_output["sources"]):
+                analysis_output["sources"] = ["IPCC", "IPBES", "IPOS"]
             question_state.update(analysis_output)
             questions.append(question_state)

climateqa/engine/chains/retrieve_documents.py CHANGED Viewed

@@ -8,10 +8,13 @@ from langchain_core.runnables import RunnableParallel, RunnablePassthrough
 from langchain_core.runnables import RunnableLambda
 from ..reranker import rerank_docs
-from ...knowledge.retriever import ClimateQARetriever
 from ...knowledge.openalex import OpenAlexRetriever
 from .keywords_extraction import make_keywords_extraction_chain
 from ..utils import log_event
@@ -57,105 +60,244 @@ def query_retriever(question):
     """Just a dummy tool to simulate the retriever query"""
     return question
-def make_retriever_node(vectorstore,reranker,llm,rerank_by_question=True, k_final=15, k_before_reranking=100, k_summary=5):
-    # The chain callback is not necessary, but it propagates the langchain callbacks to the astream_events logger to display intermediate results
-    @chain
-    async def retrieve_documents(state,config):
-        keywords_extraction = make_keywords_extraction_chain(llm)
-        current_question = state["remaining_questions"][0]
-        remaining_questions = state["remaining_questions"][1:]
-        # ToolMessage(f"Retrieving documents for question: {current_question['question']}",tool_call_id = "retriever")
-        # # There are several options to get the final top k
-        # # Option 1 - Get 100 documents by question and rerank by question
-        # # Option 2 - Get 100/n documents by question and rerank the total
-        # if rerank_by_question:
-        #     k_by_question = divide_into_parts(k_final,len(questions))
-        if "documents" in state and state["documents"] is not None:
-            docs = state["documents"]
-        else:
-            docs = []
-        k_by_question = k_final // state["n_questions"]
-        sources = current_question["sources"]
-        question = current_question["question"]
-        index = current_question["index"]
-        await log_event({"question":question,"sources":sources,"index":index},"log_retriever",config)
-        if index == "Vector":
-            # Search the document store using the retriever
-            # Configure high top k for further reranking step
-            retriever = ClimateQARetriever(
-                vectorstore=vectorstore,
-                sources = sources,
-                min_size = 200,
-                k_summary = k_summary,
-                k_total = k_before_reranking,
-                threshold = 0.5,
-            )
-            docs_question = await retriever.ainvoke(question,config)
-        elif index == "OpenAlex":
-            keywords = keywords_extraction.invoke(question)["keywords"]
-            openalex_query = " AND ".join(keywords)
-            print(f"... OpenAlex query: {openalex_query}")
-            retriever_openalex = OpenAlexRetriever(
-                min_year = state.get("min_year",1960),
-                max_year = state.get("max_year",None),
-                k = k_before_reranking
-            )
-            docs_question = await retriever_openalex.ainvoke(openalex_query,config)
-        else:
-            raise Exception(f"Index {index} not found in the routing index")
-        # Rerank
-        if reranker is not None:
-            with suppress_output():
-                docs_question = rerank_docs(reranker,docs_question,question)
-        else:
-            # Add a default reranking score
-            for doc in docs_question:
-                doc.metadata["reranking_score"] = doc.metadata["similarity_score"]
-        # If rerank by question we select the top documents for each question
-        if rerank_by_question:
-            docs_question = docs_question[:k_by_question]
-        # Add sources used in the metadata
         for doc in docs_question:
-            doc.metadata["sources_used"] = sources
-            doc.metadata["question_used"] = question
-            doc.metadata["index_used"] = index
-        # Add to the list of docs
-        docs.extend(docs_question)
-        # Sorting the list in descending order by rerank_score
-        docs = sorted(docs, key=lambda x: x.metadata["reranking_score"], reverse=True)
-        new_state = {"documents":docs,"remaining_questions":remaining_questions}
-        return new_state
-    return retrieve_documents

 from langchain_core.runnables import RunnableLambda
 from ..reranker import rerank_docs
+# from ...knowledge.retriever import ClimateQARetriever
 from ...knowledge.openalex import OpenAlexRetriever
 from .keywords_extraction import make_keywords_extraction_chain
 from ..utils import log_event
+from langchain_core.vectorstores import VectorStore
+from typing import List
+from langchain_core.documents.base import Document
     """Just a dummy tool to simulate the retriever query"""
     return question
+def _add_sources_used_in_metadata(docs,sources,question,index):
+    for doc in docs:
+        doc.metadata["sources_used"] = sources
+        doc.metadata["question_used"] = question
+        doc.metadata["index_used"] = index
+    return docs
+def _get_k_summary_by_question(n_questions):
+    if n_questions == 0:
+        return 0
+    elif n_questions == 1:
+        return 5
+    elif n_questions == 2:
+        return 3
+    elif n_questions == 3:
+        return 2
+    else:
+        return 1
+def _get_k_images_by_question(n_questions):
+    if n_questions == 0:
+        return 0
+    elif n_questions == 1:
+        return 7
+    elif n_questions == 2:
+        return 5
+    elif n_questions == 3:
+        return 2
+    else:
+        return 1
+def _add_metadata_and_score(docs: List) -> Document:
+    # Add score to metadata
+    docs_with_metadata = []
+    for i,(doc,score) in enumerate(docs):
+        doc.page_content = doc.page_content.replace("\r\n"," ")
+        doc.metadata["similarity_score"] = score
+        doc.metadata["content"] = doc.page_content
+        doc.metadata["page_number"] = int(doc.metadata["page_number"]) + 1
+        # doc.page_content = f"""Doc {i+1} - {doc.metadata['short_name']}: {doc.page_content}"""
+        docs_with_metadata.append(doc)
+    return docs_with_metadata
+async def get_IPCC_relevant_documents(
+    query: str,
+    vectorstore:VectorStore,
+    sources:list = ["IPCC","IPBES","IPOS"],
+    search_figures:bool = False,
+    reports:list = [],
+    threshold:float = 0.6,
+    k_summary:int = 3,
+    k_total:int = 10,
+    k_images: int = 5,
+    namespace:str = "vectors",
+    min_size:int = 200,
+    search_only:bool = False,
+) :
+    # Check if all elements in the list are either IPCC or IPBES
+    assert isinstance(sources,list)
+    assert sources
+    assert all([x in ["IPCC","IPBES","IPOS"] for x in sources])
+    assert k_total > k_summary, "k_total should be greater than k_summary"
+    # Prepare base search kwargs
+    filters = {}
+    if len(reports) > 0:
+        filters["short_name"] = {"$in":reports}
+    else:
+        filters["source"] = { "$in": sources}
+    # INIT
+    docs_summaries = []
+    docs_full = []
+    docs_images = []
+    if search_only:
+        # Only search for images if search_only is True
+        if search_figures:
+            filters_image = {
+                **filters,
+                "chunk_type":"image"
+            }
+            docs_images = vectorstore.similarity_search_with_score(query=query,filter = filters_image,k = k_images)
+            docs_images = _add_metadata_and_score(docs_images)
+    else:
+        # Regular search flow for text and optionally images
+        # Search for k_summary documents in the summaries dataset
+        filters_summaries = {
+            **filters,
+            "chunk_type":"text",
+            "report_type": { "$in":["SPM"]},
+        }
+        docs_summaries = vectorstore.similarity_search_with_score(query=query,filter = filters_summaries,k = k_summary)
+        docs_summaries = [x for x in docs_summaries if x[1] > threshold]
+        # Search for k_total - k_summary documents in the full reports dataset
+        filters_full = {
+            **filters,
+            "chunk_type":"text",
+            "report_type": { "$nin":["SPM"]},
+        }
+        k_full = k_total - len(docs_summaries)
+        docs_full = vectorstore.similarity_search_with_score(query=query,filter = filters_full,k = k_full)
+        if search_figures:
+            # Images
+            filters_image = {
+                **filters,
+                "chunk_type":"image"
+            }
+            docs_images = vectorstore.similarity_search_with_score(query=query,filter = filters_image,k = k_images)
+        docs_summaries, docs_full, docs_images = _add_metadata_and_score(docs_summaries), _add_metadata_and_score(docs_full), _add_metadata_and_score(docs_images)
+        # Filter if length are below threshold
+        docs_summaries = [x for x in docs_summaries if len(x.page_content) > min_size]
+        docs_full = [x for x in docs_full if len(x.page_content) > min_size]
+    return {
+        "docs_summaries" : docs_summaries,
+        "docs_full" : docs_full,
+        "docs_images" : docs_images,
+    }
+# The chain callback is not necessary, but it propagates the langchain callbacks to the astream_events logger to display intermediate results
+# @chain
+async def retrieve_documents(state,config, vectorstore,reranker,llm,rerank_by_question=True, k_final=15, k_before_reranking=100, k_summary=5, k_images=5):
+    """
+    Retrieve and rerank documents based on the current question in the state.
+    Args:
+        state (dict): The current state containing documents, related content, relevant content sources, remaining questions and n_questions.
+        config (dict): Configuration settings for logging and other purposes.
+        vectorstore (object): The vector store used to retrieve relevant documents.
+        reranker (object): The reranker used to rerank the retrieved documents.
+        llm (object): The language model used for processing.
+        rerank_by_question (bool, optional): Whether to rerank documents by question. Defaults to True.
+        k_final (int, optional): The final number of documents to retrieve. Defaults to 15.
+        k_before_reranking (int, optional): The number of documents to retrieve before reranking. Defaults to 100.
+        k_summary (int, optional): The number of summary documents to retrieve. Defaults to 5.
+        k_images (int, optional): The number of image documents to retrieve. Defaults to 5.
+    Returns:
+        dict: The updated state containing the retrieved and reranked documents, related content, and remaining questions.
+    """
+    print("---- Retrieve documents ----")
+    # Get the documents from the state
+    if "documents" in state and state["documents"] is not None:
+        docs = state["documents"]
+    else:
+        docs = []
+    # Get the related_content from the state
+    if "related_content" in state and state["related_content"] is not None:
+        related_content = state["related_content"]
+    else:
+        related_content = []
+    search_figures = "IPCC figures" in state["relevant_content_sources"]
+    search_only = state["search_only"]
+    # Get the current question
+    current_question = state["remaining_questions"][0]
+    remaining_questions = state["remaining_questions"][1:]
+    k_by_question = k_final // state["n_questions"]
+    k_summary_by_question = _get_k_summary_by_question(state["n_questions"])
+    k_images_by_question = _get_k_images_by_question(state["n_questions"])
+    sources = current_question["sources"]
+    question = current_question["question"]
+    index = current_question["index"]
+    print(f"Retrieve documents for question: {question}")
+    await log_event({"question":question,"sources":sources,"index":index},"log_retriever",config)
+    if index == "Vector": # always true for now
+        docs_question_dict = await get_IPCC_relevant_documents(
+            query  = question,
+            vectorstore=vectorstore,
+            search_figures = search_figures,
+            sources = sources,
+            min_size = 200,
+            k_summary = k_summary_by_question,
+            k_total = k_before_reranking,
+            k_images = k_images_by_question,
+            threshold = 0.5,
+            search_only = search_only,
+        )
+    # Rerank
+    if reranker is not None:
+        with suppress_output():
+            docs_question_summary_reranked = rerank_docs(reranker,docs_question_dict["docs_summaries"],question)
+            docs_question_fulltext_reranked = rerank_docs(reranker,docs_question_dict["docs_full"],question)
+            docs_question_images_reranked = rerank_docs(reranker,docs_question_dict["docs_images"],question)
+            if rerank_by_question:
+                docs_question_summary_reranked = sorted(docs_question_summary_reranked, key=lambda x: x.metadata["reranking_score"], reverse=True)
+                docs_question_fulltext_reranked = sorted(docs_question_fulltext_reranked, key=lambda x: x.metadata["reranking_score"], reverse=True)
+                docs_question_images_reranked = sorted(docs_question_images_reranked, key=lambda x: x.metadata["reranking_score"], reverse=True)
+    else:
+        docs_question = docs_question_dict["docs_summaries"] + docs_question_dict["docs_full"]
+        # Add a default reranking score
         for doc in docs_question:
+            doc.metadata["reranking_score"] = doc.metadata["similarity_score"]
+    docs_question = docs_question_summary_reranked + docs_question_fulltext_reranked
+    docs_question = docs_question[:k_by_question]
+    images_question = docs_question_images_reranked[:k_images]
+    if reranker is not None and rerank_by_question:
+        docs_question = sorted(docs_question, key=lambda x: x.metadata["reranking_score"], reverse=True)
+    # Add sources used in the metadata
+    docs_question = _add_sources_used_in_metadata(docs_question,sources,question,index)
+    images_question = _add_sources_used_in_metadata(images_question,sources,question,index)
+    # Add to the list of docs
+    docs.extend(docs_question)
+    related_content.extend(images_question)
+    new_state = {"documents":docs, "related_contents": related_content,"remaining_questions":remaining_questions}
+    return new_state
+def make_retriever_node(vectorstore,reranker,llm,rerank_by_question=True, k_final=15, k_before_reranking=100, k_summary=5):
+    @chain
+    async def retrieve_docs(state, config):
+        state =  await retrieve_documents(state,config, vectorstore,reranker,llm,rerank_by_question, k_final, k_before_reranking, k_summary)
+        return state
+    return retrieve_docs

climateqa/engine/chains/retrieve_papers.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from climateqa.engine.keywords import make_keywords_chain
+from climateqa.engine.llm import get_llm
+from climateqa.knowledge.openalex import OpenAlex
+from climateqa.engine.chains.answer_rag import make_rag_papers_chain
+from front.utils import make_html_papers
+from climateqa.engine.reranker import get_reranker
+oa = OpenAlex()
+llm = get_llm(provider="openai",max_tokens = 1024,temperature = 0.0)
+reranker = get_reranker("nano")
+papers_cols_widths = {
+    "id":100,
+    "title":300,
+    "doi":100,
+    "publication_year":100,
+    "abstract":500,
+    "is_oa":50,
+}
+papers_cols = list(papers_cols_widths.keys())
+papers_cols_widths = list(papers_cols_widths.values())
+def generate_keywords(query):
+    chain = make_keywords_chain(llm)
+    keywords = chain.invoke(query)
+    keywords = " AND ".join(keywords["keywords"])
+    return keywords
+async def find_papers(query,after, relevant_content_sources, reranker= reranker):
+    if "OpenAlex" in relevant_content_sources:
+        summary = ""
+        keywords = generate_keywords(query)
+        df_works = oa.search(keywords,after = after)
+        print(f"Found {len(df_works)} papers")
+        if not df_works.empty:
+            df_works = df_works.dropna(subset=["abstract"])
+            df_works = df_works[df_works["abstract"] != ""].reset_index(drop = True)
+            df_works = oa.rerank(query,df_works,reranker)
+            df_works = df_works.sort_values("rerank_score",ascending=False)
+            docs_html = []
+            for i in range(10):
+                docs_html.append(make_html_papers(df_works, i))
+            docs_html = "".join(docs_html)
+            G = oa.make_network(df_works)
+            height = "750px"
+            network = oa.show_network(G,color_by = "rerank_score",notebook=False,height = height)
+            network_html = network.generate_html()
+            network_html = network_html.replace("'", "\"")
+            css_to_inject = "<style>#mynetwork { border: none !important; } .card { border: none !important; }</style>"
+            network_html = network_html + css_to_inject
+            network_html = f"""<iframe style="width: 100%; height: {height};margin:0 auto" name="result" allow="midi; geolocation; microphone; camera;
+            display-capture; encrypted-media;" sandbox="allow-modals allow-forms
+            allow-scripts allow-same-origin allow-popups
+            allow-top-navigation-by-user-activation allow-downloads" allowfullscreen=""
+            allowpaymentrequest="" frameborder="0" srcdoc='{network_html}'></iframe>"""
+            docs = df_works["content"].head(10).tolist()
+            df_works = df_works.reset_index(drop = True).reset_index().rename(columns = {"index":"doc"})
+            df_works["doc"] = df_works["doc"] + 1
+            df_works = df_works[papers_cols]
+            yield docs_html, network_html, summary
+            chain = make_rag_papers_chain(llm)
+            result = chain.astream_log({"question": query,"docs": docs,"language":"English"})
+            path_answer = "/logs/StrOutputParser/streamed_output/-"
+            async for op in result:
+                op = op.ops[0]
+                if op['path'] == path_answer: # reforulated question
+                    new_token = op['value'] # str
+                    summary += new_token
+                else:
+                    continue
+                yield docs_html, network_html, summary
+        else :
+            print("No papers found")
+    else :
+        yield "","", ""

climateqa/engine/chains/retriever.py ADDED Viewed

	@@ -0,0 +1,126 @@

+# import sys
+# import os
+# from contextlib import contextmanager
+# from ..reranker import rerank_docs
+# from ...knowledge.retriever import ClimateQARetriever
+# def divide_into_parts(target, parts):
+#     # Base value for each part
+#     base = target // parts
+#     # Remainder to distribute
+#     remainder = target % parts
+#     # List to hold the result
+#     result = []
+#     for i in range(parts):
+#         if i < remainder:
+#             # These parts get base value + 1
+#             result.append(base + 1)
+#         else:
+#             # The rest get the base value
+#             result.append(base)
+#     return result
+# @contextmanager
+# def suppress_output():
+#     # Open a null device
+#     with open(os.devnull, 'w') as devnull:
+#         # Store the original stdout and stderr
+#         old_stdout = sys.stdout
+#         old_stderr = sys.stderr
+#         # Redirect stdout and stderr to the null device
+#         sys.stdout = devnull
+#         sys.stderr = devnull
+#         try:
+#             yield
+#         finally:
+#             # Restore stdout and stderr
+#             sys.stdout = old_stdout
+#             sys.stderr = old_stderr
+# def make_retriever_node(vectorstore,reranker,rerank_by_question=True, k_final=15, k_before_reranking=100, k_summary=5):
+#     def retrieve_documents(state):
+#         POSSIBLE_SOURCES = ["IPCC","IPBES","IPOS"] # ,"OpenAlex"]
+#         questions = state["questions"]
+#         # Use sources from the user input or from the LLM detection
+#         if "sources_input" not in state or state["sources_input"] is None:
+#             sources_input = ["auto"]
+#         else:
+#             sources_input = state["sources_input"]
+#         auto_mode = "auto" in sources_input
+#         # There are several options to get the final top k
+#         # Option 1 - Get 100 documents by question and rerank by question
+#         # Option 2 - Get 100/n documents by question and rerank the total
+#         if rerank_by_question:
+#             k_by_question = divide_into_parts(k_final,len(questions))
+#         docs = []
+#         for i,q in enumerate(questions):
+#             sources = q["sources"]
+#             question = q["question"]
+#             # If auto mode, we use the sources detected by the LLM
+#             if auto_mode:
+#                 sources = [x for x in sources if x in POSSIBLE_SOURCES]
+#             # Otherwise, we use the config
+#             else:
+#                 sources = sources_input
+#             # Search the document store using the retriever
+#             # Configure high top k for further reranking step
+#             retriever = ClimateQARetriever(
+#                 vectorstore=vectorstore,
+#                 sources = sources,
+#                 # reports = ias_reports,
+#                 min_size = 200,
+#                 k_summary = k_summary,
+#                 k_total = k_before_reranking,
+#                 threshold = 0.5,
+#             )
+#             docs_question = retriever.get_relevant_documents(question)
+#             # Rerank
+#             if reranker is not None:
+#                 with suppress_output():
+#                     docs_question = rerank_docs(reranker,docs_question,question)
+#             else:
+#                 # Add a default reranking score
+#                 for doc in docs_question:
+#                     doc.metadata["reranking_score"] = doc.metadata["similarity_score"]
+#             # If rerank by question we select the top documents for each question
+#             if rerank_by_question:
+#                 docs_question = docs_question[:k_by_question[i]]
+#             # Add sources used in the metadata
+#             for doc in docs_question:
+#                 doc.metadata["sources_used"] = sources
+#             # Add to the list of docs
+#             docs.extend(docs_question)
+#         # Sorting the list in descending order by rerank_score
+#         # Then select the top k
+#         docs = sorted(docs, key=lambda x: x.metadata["reranking_score"], reverse=True)
+#         docs = docs[:k_final]
+#         new_state = {"documents":docs}
+#         return new_state
+#     return retrieve_documents

climateqa/engine/chains/set_defaults.py ADDED Viewed

	@@ -0,0 +1,13 @@

+def set_defaults(state):
+    print("---- Setting defaults ----")
+    if not state["audience"] or state["audience"] is None:
+        state.update({"audience": "experts"})
+    sources_input = state["sources_input"] if "sources_input" in state else ["auto"]
+    state.update({"sources_input": sources_input})
+    # if not state["sources_input"] or state["sources_input"] is None:
+    #     state.update({"sources_input": ["auto"]})
+    return state

climateqa/engine/chains/translation.py CHANGED Viewed

@@ -30,10 +30,11 @@ def make_translation_chain(llm):
 def make_translation_node(llm):
     translation_chain = make_translation_chain(llm)
     def translate_query(state):
         user_input = state["user_input"]
         translation = translation_chain.invoke({"input":user_input})
         return {"query":translation["translation"]}

 def make_translation_node(llm):
     translation_chain = make_translation_chain(llm)
     def translate_query(state):
+        print("---- Translate query ----")
         user_input = state["user_input"]
         translation = translation_chain.invoke({"input":user_input})
         return {"query":translation["translation"]}

climateqa/engine/graph.py CHANGED Viewed

@@ -7,7 +7,7 @@ from langgraph.graph import END, StateGraph
 from langchain_core.runnables.graph import CurveStyle, MermaidDrawMethod
 from typing_extensions import TypedDict
-from typing import List
 from IPython.display import display, HTML, Image
@@ -18,6 +18,9 @@ from .chains.translation import make_translation_node
 from .chains.intent_categorization import make_intent_categorization_node
 from .chains.retrieve_documents import make_retriever_node
 from .chains.answer_rag import make_rag_node
 class GraphState(TypedDict):
     """
@@ -26,16 +29,21 @@ class GraphState(TypedDict):
     user_input : str
     language : str
     intent : str
     query: str
     remaining_questions : List[dict]
     n_questions : int
     answer: str
     audience: str = "experts"
     sources_input: List[str] = ["IPCC","IPBES"]
     sources_auto: bool = True
     min_year: int = 1960
     max_year: int = None
     documents: List[Document]
 def search(state): #TODO
     return state
@@ -52,6 +60,13 @@ def route_intent(state):
     else:
         # Search route
         return "search"
 def route_translation(state):
     if state["language"].lower() == "english":
@@ -66,11 +81,18 @@ def route_based_on_relevant_docs(state,threshold_docs=0.2):
     else:
         return "answer_rag_no_docs"
 def make_id_dict(values):
     return {k:k for k in values}
-def make_graph_agent(llm,vectorstore,reranker,threshold_docs = 0.2):
     workflow = StateGraph(GraphState)
@@ -80,21 +102,26 @@ def make_graph_agent(llm,vectorstore,reranker,threshold_docs = 0.2):
     translate_query = make_translation_node(llm)
     answer_chitchat = make_chitchat_node(llm)
     answer_ai_impact = make_ai_impact_node(llm)
-    retrieve_documents = make_retriever_node(vectorstore,reranker,llm)
-    answer_rag = make_rag_node(llm,with_docs=True)
-    answer_rag_no_docs = make_rag_node(llm,with_docs=False)
     # Define the nodes
     workflow.add_node("categorize_intent", categorize_intent)
     workflow.add_node("search", search)
     workflow.add_node("answer_search", answer_search)
     workflow.add_node("transform_query", transform_query)
     workflow.add_node("translate_query", translate_query)
     workflow.add_node("answer_chitchat", answer_chitchat)
-    # workflow.add_node("answer_ai_impact", answer_ai_impact)
-    workflow.add_node("retrieve_documents",retrieve_documents)
-    workflow.add_node("answer_rag",answer_rag)
-    workflow.add_node("answer_rag_no_docs",answer_rag_no_docs)
     # Entry point
     workflow.set_entry_point("categorize_intent")
@@ -106,6 +133,12 @@ def make_graph_agent(llm,vectorstore,reranker,threshold_docs = 0.2):
         make_id_dict(["answer_chitchat","search"])
     )
     workflow.add_conditional_edges(
         "search",
         route_translation,
@@ -113,8 +146,9 @@ def make_graph_agent(llm,vectorstore,reranker,threshold_docs = 0.2):
     )
     workflow.add_conditional_edges(
         "retrieve_documents",
-        lambda state : "retrieve_documents" if len(state["remaining_questions"]) > 0 else "answer_search",
-        make_id_dict(["retrieve_documents","answer_search"])
     )
     workflow.add_conditional_edges(
@@ -122,14 +156,21 @@ def make_graph_agent(llm,vectorstore,reranker,threshold_docs = 0.2):
         lambda x : route_based_on_relevant_docs(x,threshold_docs=threshold_docs),
         make_id_dict(["answer_rag","answer_rag_no_docs"])
     )
     # Define the edges
     workflow.add_edge("translate_query", "transform_query")
     workflow.add_edge("transform_query", "retrieve_documents")
     workflow.add_edge("answer_rag", END)
     workflow.add_edge("answer_rag_no_docs", END)
-    workflow.add_edge("answer_chitchat", END)
-    # workflow.add_edge("answer_ai_impact", END)
     # Compile
     app = workflow.compile()
@@ -146,4 +187,4 @@ def display_graph(app):
                 draw_method=MermaidDrawMethod.API,
             )
         )
-    )

 from langchain_core.runnables.graph import CurveStyle, MermaidDrawMethod
 from typing_extensions import TypedDict
+from typing import List, Dict
 from IPython.display import display, HTML, Image
 from .chains.intent_categorization import make_intent_categorization_node
 from .chains.retrieve_documents import make_retriever_node
 from .chains.answer_rag import make_rag_node
+from .chains.graph_retriever import make_graph_retriever_node
+from .chains.chitchat_categorization import make_chitchat_intent_categorization_node
+# from .chains.set_defaults import set_defaults
 class GraphState(TypedDict):
     """
     user_input : str
     language : str
     intent : str
+    search_graphs_chitchat : bool
     query: str
     remaining_questions : List[dict]
     n_questions : int
     answer: str
     audience: str = "experts"
     sources_input: List[str] = ["IPCC","IPBES"]
+    relevant_content_sources: List[str] = ["IPCC figures"]
     sources_auto: bool = True
     min_year: int = 1960
     max_year: int = None
     documents: List[Document]
+    related_contents : Dict[str,Document]
+    recommended_content : List[Document]
+    search_only : bool = False
 def search(state): #TODO
     return state
     else:
         # Search route
         return "search"
+def chitchat_route_intent(state):
+    intent = state["search_graphs_chitchat"]
+    if intent is True:
+        return "retrieve_graphs_chitchat"
+    elif intent is False:
+        return END
 def route_translation(state):
     if state["language"].lower() == "english":
     else:
         return "answer_rag_no_docs"
+def route_retrieve_documents(state):
+    if state["search_only"] :
+        return END
+    elif len(state["remaining_questions"]) > 0:
+        return "retrieve_documents"
+    else:
+        return "answer_search"
 def make_id_dict(values):
     return {k:k for k in values}
+def make_graph_agent(llm, vectorstore_ipcc, vectorstore_graphs, reranker, threshold_docs=0.2):
     workflow = StateGraph(GraphState)
     translate_query = make_translation_node(llm)
     answer_chitchat = make_chitchat_node(llm)
     answer_ai_impact = make_ai_impact_node(llm)
+    retrieve_documents = make_retriever_node(vectorstore_ipcc, reranker, llm)
+    retrieve_graphs = make_graph_retriever_node(vectorstore_graphs, reranker)
+    answer_rag = make_rag_node(llm, with_docs=True)
+    answer_rag_no_docs = make_rag_node(llm, with_docs=False)
+    chitchat_categorize_intent = make_chitchat_intent_categorization_node(llm)
     # Define the nodes
+    # workflow.add_node("set_defaults", set_defaults)
     workflow.add_node("categorize_intent", categorize_intent)
     workflow.add_node("search", search)
     workflow.add_node("answer_search", answer_search)
     workflow.add_node("transform_query", transform_query)
     workflow.add_node("translate_query", translate_query)
     workflow.add_node("answer_chitchat", answer_chitchat)
+    workflow.add_node("chitchat_categorize_intent", chitchat_categorize_intent)
+    workflow.add_node("retrieve_graphs", retrieve_graphs)
+    workflow.add_node("retrieve_graphs_chitchat", retrieve_graphs)
+    workflow.add_node("retrieve_documents", retrieve_documents)
+    workflow.add_node("answer_rag", answer_rag)
+    workflow.add_node("answer_rag_no_docs", answer_rag_no_docs)
     # Entry point
     workflow.set_entry_point("categorize_intent")
         make_id_dict(["answer_chitchat","search"])
     )
+    workflow.add_conditional_edges(
+        "chitchat_categorize_intent",
+        chitchat_route_intent,
+        make_id_dict(["retrieve_graphs_chitchat", END])
+    )
     workflow.add_conditional_edges(
         "search",
         route_translation,
     )
     workflow.add_conditional_edges(
         "retrieve_documents",
+        # lambda state : "retrieve_documents" if len(state["remaining_questions"]) > 0 else "answer_search",
+        route_retrieve_documents,
+        make_id_dict([END,"retrieve_documents","answer_search"])
     )
     workflow.add_conditional_edges(
         lambda x : route_based_on_relevant_docs(x,threshold_docs=threshold_docs),
         make_id_dict(["answer_rag","answer_rag_no_docs"])
     )
+    workflow.add_conditional_edges(
+        "transform_query",
+        lambda state : "retrieve_graphs" if "OurWorldInData" in state["relevant_content_sources"]  else END,
+        make_id_dict(["retrieve_graphs", END])
+    )
     # Define the edges
     workflow.add_edge("translate_query", "transform_query")
     workflow.add_edge("transform_query", "retrieve_documents")
+    workflow.add_edge("retrieve_graphs", END)
     workflow.add_edge("answer_rag", END)
     workflow.add_edge("answer_rag_no_docs", END)
+    workflow.add_edge("answer_chitchat", "chitchat_categorize_intent")
     # Compile
     app = workflow.compile()
                 draw_method=MermaidDrawMethod.API,
             )
         )
+    )

climateqa/engine/graph_retriever.py ADDED Viewed

	@@ -0,0 +1,88 @@

+from langchain_core.retrievers import BaseRetriever
+from langchain_core.documents.base import Document
+from langchain_core.vectorstores import VectorStore
+from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
+from typing import List
+# class GraphRetriever(BaseRetriever):
+#     vectorstore:VectorStore
+#     sources:list = ["OWID"] # plus tard ajouter OurWorldInData # faudra integrate avec l'autre retriever
+#     threshold:float = 0.5
+#     k_total:int = 10
+#     def _get_relevant_documents(
+#         self, query: str, *, run_manager: CallbackManagerForRetrieverRun
+#     ) -> List[Document]:
+#         # Check if all elements in the list are IEA or OWID
+#         assert isinstance(self.sources,list)
+#         assert self.sources
+#         assert any([x in ["OWID"] for x in self.sources])
+#         # Prepare base search kwargs
+#         filters = {}
+#         filters["source"] = {"$in": self.sources}
+#         docs = self.vectorstore.similarity_search_with_score(query=query, filter=filters, k=self.k_total)
+#         # Filter if scores are below threshold
+#         docs = [x for x in docs if x[1] > self.threshold]
+#         # Remove duplicate documents
+#         unique_docs = []
+#         seen_docs = []
+#         for i, doc in enumerate(docs):
+#             if doc[0].page_content not in seen_docs:
+#                 unique_docs.append(doc)
+#                 seen_docs.append(doc[0].page_content)
+#         # Add score to metadata
+#         results = []
+#         for i,(doc,score) in enumerate(unique_docs):
+#             doc.metadata["similarity_score"] = score
+#             doc.metadata["content"] = doc.page_content
+#             results.append(doc)
+#         return results
+async def retrieve_graphs(
+    query: str,
+    vectorstore:VectorStore,
+    sources:list = ["OWID"], # plus tard ajouter OurWorldInData # faudra integrate avec l'autre retriever
+    threshold:float = 0.5,
+    k_total:int = 10,
+)-> List[Document]:
+        # Check if all elements in the list are IEA or OWID
+        assert isinstance(sources,list)
+        assert sources
+        assert any([x in ["OWID"] for x in sources])
+        # Prepare base search kwargs
+        filters = {}
+        filters["source"] = {"$in": sources}
+        docs = vectorstore.similarity_search_with_score(query=query, filter=filters, k=k_total)
+        # Filter if scores are below threshold
+        docs = [x for x in docs if x[1] > threshold]
+        # Remove duplicate documents
+        unique_docs = []
+        seen_docs = []
+        for i, doc in enumerate(docs):
+            if doc[0].page_content not in seen_docs:
+                unique_docs.append(doc)
+                seen_docs.append(doc[0].page_content)
+        # Add score to metadata
+        results = []
+        for i,(doc,score) in enumerate(unique_docs):
+            doc.metadata["similarity_score"] = score
+            doc.metadata["content"] = doc.page_content
+            results.append(doc)
+        return results

climateqa/engine/keywords.py CHANGED Viewed

@@ -11,10 +11,12 @@ class KeywordsOutput(BaseModel):
     keywords: list = Field(
         description="""
-        Generate 1 or 2 relevant keywords from the user query to ask a search engine for scientific research papers.
         Example:
         - "What is the impact of deep sea mining ?" -> ["deep sea mining"]
         - "How will El Nino be impacted by climate change" -> ["el nino"]
         - "Is climate change a hoax" -> [Climate change","hoax"]
         """

     keywords: list = Field(
         description="""
+        Generate 1 or 2 relevant keywords from the user query to ask a search engine for scientific research papers. Answer only with English keywords.
+        Do not use special characters or accents.
         Example:
         - "What is the impact of deep sea mining ?" -> ["deep sea mining"]
+        - "Quel est l'impact de l'exploitation minière en haute mer ?" -> ["deep sea mining"]
         - "How will El Nino be impacted by climate change" -> ["el nino"]
         - "Is climate change a hoax" -> [Climate change","hoax"]
         """

climateqa/engine/reranker.py CHANGED Viewed

@@ -1,11 +1,14 @@
 import os
 from scipy.special import expit, logit
 from rerankers import Reranker
-def get_reranker(model = "nano",cohere_api_key = None):
-    assert model in ["nano","tiny","small","large"]
     if model == "nano":
         reranker = Reranker('ms-marco-TinyBERT-L-2-v2', model_type='flashrank')
@@ -17,11 +20,18 @@ def get_reranker(model = "nano",cohere_api_key = None):
         if cohere_api_key is None:
             cohere_api_key = os.environ["COHERE_API_KEY"]
         reranker = Reranker("cohere", lang='en', api_key = cohere_api_key)
     return reranker
 def rerank_docs(reranker,docs,query):
     # Get a list of texts from langchain docs
     input_docs = [x.page_content for x in docs]

 import os
+from dotenv import load_dotenv
 from scipy.special import expit, logit
 from rerankers import Reranker
+from sentence_transformers import CrossEncoder
+load_dotenv()
+def get_reranker(model = "nano", cohere_api_key = None):
+    assert model in ["nano","tiny","small","large", "jina"]
     if model == "nano":
         reranker = Reranker('ms-marco-TinyBERT-L-2-v2', model_type='flashrank')
         if cohere_api_key is None:
             cohere_api_key = os.environ["COHERE_API_KEY"]
         reranker = Reranker("cohere", lang='en', api_key = cohere_api_key)
+    elif model == "jina":
+        # Reached token quota so does not work
+        reranker = Reranker("jina-reranker-v2-base-multilingual", api_key = os.getenv("JINA_RERANKER_API_KEY"))
+        # marche pas sans gpu ? et anyways returns with another structure donc faudrait changer le code du retriever node
+        # reranker = CrossEncoder("jinaai/jina-reranker-v2-base-multilingual", automodel_args={"torch_dtype": "auto"}, trust_remote_code=True,)
     return reranker
 def rerank_docs(reranker,docs,query):
+    if docs == []:
+        return []
     # Get a list of texts from langchain docs
     input_docs = [x.page_content for x in docs]

climateqa/engine/vectorstore.py CHANGED Viewed

@@ -4,6 +4,7 @@
 import os
 from pinecone import Pinecone
 from langchain_community.vectorstores import Pinecone as PineconeVectorstore
 # LOAD ENVIRONMENT VARIABLES
 try:
@@ -13,7 +14,12 @@ except:
     pass
-def get_pinecone_vectorstore(embeddings,text_key = "content"):
     # # initialize pinecone
     # pinecone.init(
@@ -27,7 +33,7 @@ def get_pinecone_vectorstore(embeddings,text_key = "content"):
     # return vectorstore
     pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
-    index = pc.Index(os.getenv("PINECONE_API_INDEX"))
     vectorstore = PineconeVectorstore(
         index, embeddings, text_key,

 import os
 from pinecone import Pinecone
 from langchain_community.vectorstores import Pinecone as PineconeVectorstore
+from langchain_chroma import Chroma
 # LOAD ENVIRONMENT VARIABLES
 try:
     pass
+def get_chroma_vectorstore(embedding_function, persist_directory="/home/dora/climate-question-answering/data/vectorstore"):
+    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)
+    return vectorstore
+def get_pinecone_vectorstore(embeddings,text_key = "content", index_name = os.getenv("PINECONE_API_INDEX")):
     # # initialize pinecone
     # pinecone.init(
     # return vectorstore
     pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
+    index = pc.Index(index_name)
     vectorstore = PineconeVectorstore(
         index, embeddings, text_key,

climateqa/event_handler.py ADDED Viewed

	@@ -0,0 +1,123 @@

+from langchain_core.runnables.schema import StreamEvent
+from gradio import ChatMessage
+from climateqa.engine.chains.prompts import audience_prompts
+from front.utils import make_html_source,parse_output_llm_with_sources,serialize_docs,make_toolbox,generate_html_graphs
+import numpy as np
+def init_audience(audience :str) -> str:
+    if audience == "Children":
+        audience_prompt = audience_prompts["children"]
+    elif audience == "General public":
+        audience_prompt = audience_prompts["general"]
+    elif audience == "Experts":
+        audience_prompt = audience_prompts["experts"]
+    else:
+        audience_prompt = audience_prompts["experts"]
+    return audience_prompt
+def handle_retrieved_documents(event: StreamEvent, history : list[ChatMessage], used_documents : list[str]) -> tuple[str, list[ChatMessage], list[str]]:
+    """
+    Handles the retrieved documents and returns the HTML representation of the documents
+    Args:
+        event (StreamEvent): The event containing the retrieved documents
+        history (list[ChatMessage]): The current message history
+        used_documents (list[str]): The list of used documents
+    Returns:
+        tuple[str, list[ChatMessage], list[str]]: The updated HTML representation of the documents, the updated message history and the updated list of used documents
+    """
+    try:
+        docs = event["data"]["output"]["documents"]
+        docs_html = []
+        textual_docs = [d for d in docs if d.metadata["chunk_type"] == "text"]
+        for i, d in enumerate(textual_docs, 1):
+            if d.metadata["chunk_type"] == "text":
+                docs_html.append(make_html_source(d, i))
+        used_documents = used_documents + [f"{d.metadata['short_name']} - {d.metadata['name']}" for d in docs]
+        if used_documents!=[]:
+            history[-1].content = "Adding sources :\n\n - " + "\n - ".join(np.unique(used_documents))
+        docs_html = "".join(docs_html)
+        related_contents = event["data"]["output"]["related_contents"]
+    except Exception as e:
+        print(f"Error getting documents: {e}")
+        print(event)
+    return docs, docs_html, history, used_documents, related_contents
+def stream_answer(history: list[ChatMessage], event : StreamEvent, start_streaming : bool, answer_message_content : str)-> tuple[list[ChatMessage], bool, str]:
+    """
+    Handles the streaming of the answer and updates the history with the new message content
+    Args:
+        history (list[ChatMessage]): The current message history
+        event (StreamEvent): The event containing the streamed answer
+        start_streaming (bool): A flag indicating if the streaming has started
+        new_message_content (str): The content of the new message
+    Returns:
+        tuple[list[ChatMessage], bool, str]: The updated history, the updated streaming flag and the updated message content
+    """
+    if start_streaming == False:
+        start_streaming = True
+        history.append(ChatMessage(role="assistant", content = ""))
+    answer_message_content +=  event["data"]["chunk"].content
+    answer_message_content = parse_output_llm_with_sources(answer_message_content)
+    history[-1] = ChatMessage(role="assistant", content = answer_message_content)
+    # history.append(ChatMessage(role="assistant", content = new_message_content))
+    return history, start_streaming, answer_message_content
+def handle_retrieved_owid_graphs(event :StreamEvent, graphs_html: str) -> str:
+    """
+    Handles the retrieved OWID graphs and returns the HTML representation of the graphs
+    Args:
+        event (StreamEvent): The event containing the retrieved graphs
+        graphs_html (str): The current HTML representation of the graphs
+    Returns:
+        str: The updated HTML representation
+    """
+    try:
+        recommended_content = event["data"]["output"]["recommended_content"]
+        unique_graphs = []
+        seen_embeddings = set()
+        for x in recommended_content:
+            embedding = x.metadata["returned_content"]
+            # Check if the embedding has already been seen
+            if embedding not in seen_embeddings:
+                unique_graphs.append({
+                    "embedding": embedding,
+                    "metadata": {
+                        "source": x.metadata["source"],
+                        "category": x.metadata["category"]
+                    }
+                })
+                # Add the embedding to the seen set
+                seen_embeddings.add(embedding)
+        categories = {}
+        for graph in unique_graphs:
+            category = graph['metadata']['category']
+            if category not in categories:
+                categories[category] = []
+            categories[category].append(graph['embedding'])
+        for category, embeddings in categories.items():
+            graphs_html += f"<h3>{category}</h3>"
+            for embedding in embeddings:
+                graphs_html += f"<div>{embedding}</div>"
+    except Exception as e:
+        print(f"Error getting graphs: {e}")
+    return graphs_html

climateqa/knowledge/openalex.py CHANGED Viewed

@@ -41,6 +41,10 @@ class OpenAlex():
                 break
             df_works = pd.DataFrame(page)
             df_works = df_works.dropna(subset = ["title"])
             df_works["primary_location"] = df_works["primary_location"].map(replace_nan_with_empty_dict)
             df_works["abstract"] = df_works["abstract_inverted_index"].apply(lambda x: self.get_abstract_from_inverted_index(x)).fillna("")
@@ -51,8 +55,9 @@ class OpenAlex():
             df_works["num_tokens"] = df_works["content"].map(lambda x : num_tokens_from_string(x))
             df_works = df_works.drop(columns = ["abstract_inverted_index"])
-            # df_works["subtitle"] = df_works["title"] + " - " + df_works["primary_location"]["source"]["display_name"] + " - " + df_works["publication_year"]
             return df_works
         else:
            raise Exception("Keywords must be a string")
@@ -62,11 +67,10 @@ class OpenAlex():
         scores = reranker.rank(
             query,
-            df["content"].tolist(),
-            top_k = len(df),
         )
-        scores.sort(key = lambda x : x["corpus_id"])
-        scores = [x["score"] for x in scores]
         df["rerank_score"] = scores
         return df

                 break
             df_works = pd.DataFrame(page)
+            if df_works.empty:
+                return df_works
             df_works = df_works.dropna(subset = ["title"])
             df_works["primary_location"] = df_works["primary_location"].map(replace_nan_with_empty_dict)
             df_works["abstract"] = df_works["abstract_inverted_index"].apply(lambda x: self.get_abstract_from_inverted_index(x)).fillna("")
             df_works["num_tokens"] = df_works["content"].map(lambda x : num_tokens_from_string(x))
             df_works = df_works.drop(columns = ["abstract_inverted_index"])
+            df_works["display_name"] = df_works["primary_location"].apply(lambda x :x["source"] if type(x) == dict and 'source' in x else "").apply(lambda x : x["display_name"] if type(x) == dict and "display_name" in x else "")
+            df_works["subtitle"] = df_works["title"].astype(str) + " - " + df_works["display_name"].astype(str) + " - " + df_works["publication_year"].astype(str)
             return df_works
         else:
            raise Exception("Keywords must be a string")
         scores = reranker.rank(
             query,
+            df["content"].tolist()
         )
+        scores = sorted(scores.results, key = lambda x : x.document.doc_id)
+        scores = [x.score for x in scores]
         df["rerank_score"] = scores
         return df

climateqa/knowledge/retriever.py CHANGED Viewed

@@ -1,81 +1,102 @@
-# https://github.com/langchain-ai/langchain/issues/8623
-import pandas as pd
-from langchain_core.retrievers import BaseRetriever
-from langchain_core.vectorstores import VectorStoreRetriever
-from langchain_core.documents.base import Document
-from langchain_core.vectorstores import VectorStore
-from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
-from typing import List
-from pydantic import Field
-class ClimateQARetriever(BaseRetriever):
-    vectorstore:VectorStore
-    sources:list = ["IPCC","IPBES","IPOS"]
-    reports:list = []
-    threshold:float = 0.6
-    k_summary:int = 3
-    k_total:int = 10
-    namespace:str = "vectors",
-    min_size:int = 200,
-    def _get_relevant_documents(
-        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
-    ) -> List[Document]:
-        # Check if all elements in the list are either IPCC or IPBES
-        assert isinstance(self.sources,list)
-        assert all([x in ["IPCC","IPBES","IPOS"] for x in self.sources])
-        assert self.k_total > self.k_summary, "k_total should be greater than k_summary"
-        # Prepare base search kwargs
-        filters = {}
-        if len(self.reports) > 0:
-            filters["short_name"] = {"$in":self.reports}
-        else:
-            filters["source"] = { "$in":self.sources}
-        # Search for k_summary documents in the summaries dataset
-        filters_summaries = {
-            **filters,
-            "report_type": { "$in":["SPM"]},
-        }
-        docs_summaries = self.vectorstore.similarity_search_with_score(query=query,filter = filters_summaries,k = self.k_summary)
-        docs_summaries = [x for x in docs_summaries if x[1] > self.threshold]
-        # Search for k_total - k_summary documents in the full reports dataset
-        filters_full = {
-            **filters,
-            "report_type": { "$nin":["SPM"]},
-        }
-        k_full = self.k_total - len(docs_summaries)
-        docs_full = self.vectorstore.similarity_search_with_score(query=query,filter = filters_full,k = k_full)
-        # Concatenate documents
-        docs = docs_summaries + docs_full
-        # Filter if scores are below threshold
-        docs = [x for x in docs if len(x[0].page_content) > self.min_size]
-        # docs = [x for x in docs if x[1] > self.threshold]
-        # Add score to metadata
-        results = []
-        for i,(doc,score) in enumerate(docs):
-            doc.page_content = doc.page_content.replace("\r\n"," ")
-            doc.metadata["similarity_score"] = score
-            doc.metadata["content"] = doc.page_content
-            doc.metadata["page_number"] = int(doc.metadata["page_number"]) + 1
-            # doc.page_content = f"""Doc {i+1} - {doc.metadata['short_name']}: {doc.page_content}"""
-            results.append(doc)
-        # Sort by score
-        # results = sorted(results,key = lambda x : x.metadata["similarity_score"],reverse = True)
-        return results

+# # https://github.com/langchain-ai/langchain/issues/8623
+# import pandas as pd
+# from langchain_core.retrievers import BaseRetriever
+# from langchain_core.vectorstores import VectorStoreRetriever
+# from langchain_core.documents.base import Document
+# from langchain_core.vectorstores import VectorStore
+# from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
+# from typing import List
+# from pydantic import Field
+# def _add_metadata_and_score(docs: List) -> Document:
+#     # Add score to metadata
+#     docs_with_metadata = []
+#     for i,(doc,score) in enumerate(docs):
+#         doc.page_content = doc.page_content.replace("\r\n"," ")
+#         doc.metadata["similarity_score"] = score
+#         doc.metadata["content"] = doc.page_content
+#         doc.metadata["page_number"] = int(doc.metadata["page_number"]) + 1
+#         # doc.page_content = f"""Doc {i+1} - {doc.metadata['short_name']}: {doc.page_content}"""
+#         docs_with_metadata.append(doc)
+#     return docs_with_metadata
+# class ClimateQARetriever(BaseRetriever):
+#     vectorstore:VectorStore
+#     sources:list = ["IPCC","IPBES","IPOS"]
+#     reports:list = []
+#     threshold:float = 0.6
+#     k_summary:int = 3
+#     k_total:int = 10
+#     namespace:str = "vectors",
+#     min_size:int = 200,
+#     def _get_relevant_documents(
+#         self, query: str, *, run_manager: CallbackManagerForRetrieverRun
+#     ) -> List[Document]:
+#         # Check if all elements in the list are either IPCC or IPBES
+#         assert isinstance(self.sources,list)
+#         assert self.sources
+#         assert all([x in ["IPCC","IPBES","IPOS"] for x in self.sources])
+#         assert self.k_total > self.k_summary, "k_total should be greater than k_summary"
+#         # Prepare base search kwargs
+#         filters = {}
+#         if len(self.reports) > 0:
+#             filters["short_name"] = {"$in":self.reports}
+#         else:
+#             filters["source"] = { "$in":self.sources}
+#         # Search for k_summary documents in the summaries dataset
+#         filters_summaries = {
+#             **filters,
+#             "chunk_type":"text",
+#             "report_type": { "$in":["SPM"]},
+#         }
+#         docs_summaries = self.vectorstore.similarity_search_with_score(query=query,filter = filters_summaries,k = self.k_summary)
+#         docs_summaries = [x for x in docs_summaries if x[1] > self.threshold]
+#         # docs_summaries = []
+#         # Search for k_total - k_summary documents in the full reports dataset
+#         filters_full = {
+#             **filters,
+#             "chunk_type":"text",
+#             "report_type": { "$nin":["SPM"]},
+#         }
+#         k_full = self.k_total - len(docs_summaries)
+#         docs_full = self.vectorstore.similarity_search_with_score(query=query,filter = filters_full,k = k_full)
+#         # Images
+#         filters_image = {
+#             **filters,
+#             "chunk_type":"image"
+#         }
+#         docs_images = self.vectorstore.similarity_search_with_score(query=query,filter = filters_image,k = k_full)
+#         # docs_images = []
+#         # Concatenate documents
+#         # docs = docs_summaries + docs_full + docs_images
+#         # Filter if scores are below threshold
+#         # docs = [x for x in docs if x[1] > self.threshold]
+#         docs_summaries, docs_full, docs_images = _add_metadata_and_score(docs_summaries), _add_metadata_and_score(docs_full), _add_metadata_and_score(docs_images)
+#         # Filter if length are below threshold
+#         docs_summaries = [x for x in docs_summaries if len(x.page_content) > self.min_size]
+#         docs_full = [x for x in docs_full if len(x.page_content) > self.min_size]
+#         return {
+#             "docs_summaries" : docs_summaries,
+#             "docs_full" : docs_full,
+#             "docs_images" : docs_images,
+#         }

climateqa/utils.py CHANGED Viewed

@@ -20,3 +20,16 @@ def get_image_from_azure_blob_storage(path):
     file_object = get_file_from_azure_blob_storage(path)
     image = Image.open(file_object)
     return image

     file_object = get_file_from_azure_blob_storage(path)
     image = Image.open(file_object)
     return image
+def remove_duplicates_keep_highest_score(documents):
+    unique_docs = {}
+    for doc in documents:
+        doc_id = doc.metadata.get('doc_id')
+        if doc_id in unique_docs:
+            if doc.metadata['reranking_score'] > unique_docs[doc_id].metadata['reranking_score']:
+                unique_docs[doc_id] = doc
+        else:
+            unique_docs[doc_id] = doc
+    return list(unique_docs.values())

front/utils.py CHANGED Viewed

@@ -1,12 +1,19 @@
 import re
-def make_pairs(lst):
     """from a list of even lenght, make tupple pairs"""
     return [(lst[i], lst[i + 1]) for i in range(0, len(lst), 2)]
-def serialize_docs(docs):
     new_docs = []
     for doc in docs:
         new_doc = {}
@@ -17,7 +24,7 @@ def serialize_docs(docs):
-def parse_output_llm_with_sources(output):
     # Split the content into a list of text and "[Doc X]" references
     content_parts = re.split(r'\[(Doc\s?\d+(?:,\s?Doc\s?\d+)*)\]', output)
     parts = []
@@ -32,6 +39,119 @@ def parse_output_llm_with_sources(output):
     content_parts = "".join(parts)
     return content_parts
 def make_html_source(source,i):
     meta = source.metadata
@@ -108,6 +228,31 @@ def make_html_source(source,i):
     return card
 def make_html_figure_sources(source,i,img_str):
     meta = source.metadata
     content = source.page_content.strip()

 import re
+from collections import defaultdict
+from climateqa.utils import get_image_from_azure_blob_storage
+from climateqa.engine.chains.prompts import audience_prompts
+from PIL import Image
+from io import BytesIO
+import base64
+def make_pairs(lst:list)->list:
     """from a list of even lenght, make tupple pairs"""
     return [(lst[i], lst[i + 1]) for i in range(0, len(lst), 2)]
+def serialize_docs(docs:list)->list:
     new_docs = []
     for doc in docs:
         new_doc = {}
+def parse_output_llm_with_sources(output:str)->str:
     # Split the content into a list of text and "[Doc X]" references
     content_parts = re.split(r'\[(Doc\s?\d+(?:,\s?Doc\s?\d+)*)\]', output)
     parts = []
     content_parts = "".join(parts)
     return content_parts
+def process_figures(docs:list)->tuple:
+    gallery=[]
+    used_figures =[]
+    figures = '<div class="figures-container"><p></p> </div>'
+    docs_figures = [d for d in docs if d.metadata["chunk_type"] == "image"]
+    for i, doc in enumerate(docs_figures):
+        if doc.metadata["chunk_type"] == "image":
+            if doc.metadata["figure_code"] != "N/A":
+                title = f"{doc.metadata['figure_code']} - {doc.metadata['short_name']}"
+            else:
+                title = f"{doc.metadata['short_name']}"
+            if title not in used_figures:
+                used_figures.append(title)
+                try:
+                    key = f"Image {i+1}"
+                    image_path = doc.metadata["image_path"].split("documents/")[1]
+                    img = get_image_from_azure_blob_storage(image_path)
+                    # Convert the image to a byte buffer
+                    buffered = BytesIO()
+                    max_image_length = 500
+                    img_resized = img.resize((max_image_length, int(max_image_length * img.size[1]/img.size[0])))
+                    img_resized.save(buffered, format="PNG")
+                    img_str = base64.b64encode(buffered.getvalue()).decode()
+                    figures = figures + make_html_figure_sources(doc, i, img_str)
+                    gallery.append(img)
+                except Exception as e:
+                    print(f"Skipped adding image {i} because of {e}")
+    return figures, gallery
+def generate_html_graphs(graphs:list)->str:
+    # Organize graphs by category
+    categories = defaultdict(list)
+    for graph in graphs:
+        category = graph['metadata']['category']
+        categories[category].append(graph['embedding'])
+    # Begin constructing the HTML
+    html_code = '''
+                <!DOCTYPE html>
+                <html lang="en">
+                <head>
+                    <meta charset="UTF-8">
+                    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+                    <title>Graphs by Category</title>
+                    <style>
+                        .tab-content {
+                            display: none;
+                        }
+                        .tab-content.active {
+                            display: block;
+                        }
+                        .tabs {
+                            margin-bottom: 20px;
+                        }
+                        .tab-button {
+                            background-color: #ddd;
+                            border: none;
+                            padding: 10px 20px;
+                            cursor: pointer;
+                            margin-right: 5px;
+                        }
+                        .tab-button.active {
+                            background-color: #ccc;
+                        }
+                    </style>
+                    <script>
+                        function showTab(tabId) {
+                            var contents = document.getElementsByClassName('tab-content');
+                            var buttons = document.getElementsByClassName('tab-button');
+                            for (var i = 0; i < contents.length; i++) {
+                                contents[i].classList.remove('active');
+                                buttons[i].classList.remove('active');
+                            }
+                            document.getElementById(tabId).classList.add('active');
+                            document.querySelector('button[data-tab="'+tabId+'"]').classList.add('active');
+                        }
+                    </script>
+                </head>
+                <body>
+                    <div class="tabs">
+                '''
+    # Add buttons for each category
+    for i, category in enumerate(categories.keys()):
+        active_class = 'active' if i == 0 else ''
+        html_code += f'<button class="tab-button {active_class}" onclick="showTab(\'tab-{i}\')" data-tab="tab-{i}">{category}</button>'
+    html_code += '</div>'
+    # Add content for each category
+    for i, (category, embeds) in enumerate(categories.items()):
+        active_class = 'active' if i == 0 else ''
+        html_code += f'<div id="tab-{i}" class="tab-content {active_class}">'
+        for embed in embeds:
+            html_code += embed
+        html_code += '</div>'
+    html_code += '''
+                </body>
+                </html>
+                '''
+    return html_code
 def make_html_source(source,i):
     meta = source.metadata
     return card
+def make_html_papers(df,i):
+    title = df['title'][i]
+    content = df['abstract'][i]
+    url = df['doi'][i]
+    publication_date = df['publication_year'][i]
+    subtitle = df['subtitle'][i]
+    card = f"""
+    <div class="card" id="doc{i}">
+        <div class="card-content">
+            <h2>Doc {i+1} - {title}</h2>
+            <p>{content}</p>
+        </div>
+        <div class="card-footer">
+            <span>{subtitle}</span>
+            <a href="{url}" target="_blank" class="pdf-link">
+                <span role="img" aria-label="Open paper">🔗</span>
+            </a>
+        </div>
+    </div>
+        """
+    return card
 def make_html_figure_sources(source,i,img_str):
     meta = source.metadata
     content = source.page_content.strip()

sandbox/20240310 - CQA - Semantic Routing 1.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

sandbox/20240702 - CQA - Graph Functionality.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

sandbox/20241104 - CQA - StepByStep CQA.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

style.css CHANGED Viewed

@@ -3,6 +3,61 @@
     --user-image: url('https://ih1.redbubble.net/image.4776899543.6215/st,small,507x507-pad,600x600,f8f8f8.jpg');
   } */
 /* fix for huggingface infinite growth*/
 main.flex.flex-1.flex-col {
@@ -85,7 +140,12 @@ body.dark .tip-box * {
     font-size:14px !important;
 }
 a {
     text-decoration: none;
@@ -161,60 +221,111 @@ a {
     border:none;
 }
-/* .gallery-item > div:hover{
-    background-color:#7494b0 !important;
-    color:white!important;
-}
-.gallery-item:hover{
-    border:#7494b0 !important;
 }
-.gallery-item > div{
-    background-color:white !important;
-    color:#577b9b!important;
 }
-.label{
-    color:#577b9b!important;
-} */
-/* .paginate{
-    color:#577b9b!important;
 } */
-/* span[data-testid="block-info"]{
-    background:none !important;
-    color:#577b9b;
-  } */
-/* Pseudo-element for the circularly cropped picture */
-/* .message.bot::before {
-    content: '';
     position: absolute;
-    top: -10px;
-    left: -10px;
-    width: 30px;
-    height: 30px;
-    background-image: var(--user-image);
-    background-size: cover;
-    background-position: center;
     border-radius: 50%;
-    z-index: 10;
-  }
-   */
-label.selected{
-  background:none !important;
 }
-#submit-button{
-    padding:0px !important;
 }
 @media screen and (min-width: 1024px) {
     .gradio-container {
         max-height: calc(100vh - 190px) !important;
         overflow: hidden;
@@ -225,6 +336,8 @@ label.selected{
     } */
     div#tab-examples{
         height:calc(100vh - 190px) !important;
         overflow-y: scroll !important;
@@ -236,6 +349,10 @@ label.selected{
         overflow-y: scroll !important;
         /* overflow-y: auto !important; */
     }
     div#sources-figures{
         height:calc(100vh - 300px) !important;
@@ -243,6 +360,18 @@ label.selected{
         overflow-y: scroll !important;
     }
     div#tab-config{
         height:calc(100vh - 190px) !important;
         overflow-y: scroll !important;
@@ -409,8 +538,7 @@ span.chatbot > p > img{
 }
 #dropdown-samples{
-  /*! border:none !important; */
-  /*! border-width:0px !important; */
   background:none !important;
 }
@@ -468,6 +596,10 @@ span.chatbot > p > img{
   input[type="checkbox"]:checked + .dropdown-content {
     display: block;
   }
   .dropdown-content {
     display: none;
@@ -489,7 +621,7 @@ span.chatbot > p > img{
     border-bottom: 5px solid black;
   }
-  .loader {
     border: 1px solid #d0d0d0 !important; /* Light grey background */
     border-top: 1px solid #db3434 !important; /* Blue color */
     border-right: 1px solid #3498db !important; /* Blue color */
@@ -499,41 +631,64 @@ span.chatbot > p > img{
     animation: spin 2s linear infinite;
     display:inline-block;
     margin-right:10px !important;
-  }
-  .checkmark{
     color:green !important;
     font-size:18px;
     margin-right:10px !important;
-  }
-  @keyframes spin {
     0% { transform: rotate(0deg); }
     100% { transform: rotate(360deg); }
-  }
-  .relevancy-score{
     margin-top:10px !important;
     font-size:10px !important;
     font-style:italic;
-  }
-  .score-green{
     color:green !important;
-  }
-  .score-orange{
     color:orange !important;
-  }
-  .score-red{
     color:red !important;
-  }
 .message-buttons-left.panel.message-buttons.with-avatar {
     display: none;
 }
 /* Specific fixes for Hugging Face Space iframe */
 .h-full {
     height: auto !important;

     --user-image: url('https://ih1.redbubble.net/image.4776899543.6215/st,small,507x507-pad,600x600,f8f8f8.jpg');
   } */
+#tab-recommended_content{
+    padding-top: 0px;
+    padding-left : 0px;
+    padding-right: 0px;
+}
+#group-subtabs {
+    /* display: block; */
+    width: 100%; /* Ensures the parent uses the full width */
+    position : sticky;
+}
+#group-subtabs .tab-container {
+    display: flex;
+    text-align: center;
+    width: 100%; /* Ensures the tabs span the full width */
+}
+#group-subtabs .tab-container button {
+    flex: 1; /* Makes each button take equal width */
+}
+#papers-summary-popup button span{
+    /* make label of accordio in bold, center, and bigger */
+    font-size: 16px;
+    font-weight: bold;
+    text-align: center;
+}
+#papers-relevant-popup span{
+    /* make label of accordio in bold, center, and bigger */
+    font-size: 16px;
+    font-weight: bold;
+    text-align: center;
+}
+#tab-citations .button{
+    padding: 12px 16px;
+    font-size: 16px;
+    font-weight: bold;
+    cursor: pointer;
+    border: none;
+    outline: none;
+    text-align: left;
+    transition: background-color 0.3s ease;
+}
+.gradio-container {
+    width: 100%!important;
+    max-width: 100% !important;
+}
 /* fix for huggingface infinite growth*/
 main.flex.flex-1.flex-col {
     font-size:14px !important;
 }
+.card-content img {
+    display: block;
+    margin: auto;
+    max-width: 100%; /* Ensures the image is responsive */
+    height: auto;
+}
 a {
     text-decoration: none;
     border:none;
 }
+label.selected{
+  background: #93c5fd !important;
 }
+#submit-button{
+    padding:0px !important;
 }
+#modal-config .block.modal-block.padded {
+    padding-top: 25px;
+    height: 100vh;
+}
+#modal-config .modal-container{
+    margin: 0px;
+    padding: 0px;
+}
+/* Modal styles */
+#modal-config {
+    position: fixed;
+    top: 0;
+    left: 0;
+    height: 100vh;
+    width: 500px;
+    background-color: white;
+    box-shadow: 2px 0 10px rgba(0, 0, 0, 0.1);
+    z-index: 1000;
+    padding: 15px;
+    transform: none;
+}
+#modal-config .close{
+    display: none;
+}
+/* Push main content to the right when modal is open */
+/* .modal ~ * {
+    margin-left: 300px;
+    transition: margin-left 0.3s ease;
 } */
+#modal-config .modal .wrap ul{
+    position:static;
+    top: 100%;
+    left: 0;
+    /* min-height: 100px; */
+    height: 100%;
+    /* margin-top: 0; */
+    z-index: 9999;
+    pointer-events: auto;
+    height: 200px;
+}
+#config-button{
+    background: none;
+    border: none;
+    padding: 8px;
+    cursor: pointer;
+    width: 40px;
+    height: 40px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    border-radius: 50%;
+    transition: background-color 0.2s;
+}
+#config-button::before {
+    content: '⚙️';
+    font-size: 20px;
+}
+#config-button:hover {
+    background-color: rgba(0, 0, 0, 0.1);
+}
+#checkbox-config{
+    display: block;
     position: absolute;
+    background: none;
+    border: none;
+    padding: 8px;
+    cursor: pointer;
+    width: 40px;
+    height: 40px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
     border-radius: 50%;
+    transition: background-color 0.2s;
+    font-size: 20px;
+    text-align: center;
 }
+#checkbox-config:checked{
+    display: block;
 }
 @media screen and (min-width: 1024px) {
+    /* Additional style for scrollable tab content */
+    /* div#tab-recommended_content {
+        overflow-y: auto;
+        max-height: 80vh;
+    } */
     .gradio-container {
         max-height: calc(100vh - 190px) !important;
         overflow: hidden;
     } */
     div#tab-examples{
         height:calc(100vh - 190px) !important;
         overflow-y: scroll !important;
         overflow-y: scroll !important;
         /* overflow-y: auto !important; */
     }
+    div#graphs-container{
+        height:calc(100vh - 210px) !important;
+        overflow-y: scroll !important;
+    }
     div#sources-figures{
         height:calc(100vh - 300px) !important;
         overflow-y: scroll !important;
     }
+    div#graphs-container{
+        height:calc(100vh - 300px) !important;
+        max-height: 90vh !important;
+        overflow-y: scroll !important;
+    }
+    div#tab-citations{
+        height:calc(100vh - 300px) !important;
+        max-height: 90vh !important;
+        overflow-y: scroll !important;
+    }
     div#tab-config{
         height:calc(100vh - 190px) !important;
         overflow-y: scroll !important;
 }
 #dropdown-samples{
   background:none !important;
 }
   input[type="checkbox"]:checked + .dropdown-content {
     display: block;
   }
+  #checkbox-chat input[type="checkbox"] {
+    display: flex !important;
+  }
   .dropdown-content {
     display: none;
     border-bottom: 5px solid black;
   }
+.loader {
     border: 1px solid #d0d0d0 !important; /* Light grey background */
     border-top: 1px solid #db3434 !important; /* Blue color */
     border-right: 1px solid #3498db !important; /* Blue color */
     animation: spin 2s linear infinite;
     display:inline-block;
     margin-right:10px !important;
+}
+.checkmark{
     color:green !important;
     font-size:18px;
     margin-right:10px !important;
+}
+@keyframes spin {
     0% { transform: rotate(0deg); }
     100% { transform: rotate(360deg); }
+}
+.relevancy-score{
     margin-top:10px !important;
     font-size:10px !important;
     font-style:italic;
+}
+.score-green{
     color:green !important;
+}
+.score-orange{
     color:orange !important;
+}
+.score-red{
     color:red !important;
+}
+/* Mobile specific adjustments */
+@media screen and (max-width: 767px) {
+    div#tab-recommended_content {
+        max-height: 50vh; /* Reduce height for smaller screens */
+        overflow-y: auto;
+    }
+}
+/* Additional style for scrollable tab content */
+div#tab-saved-graphs {
+    overflow-y: auto; /* Enable vertical scrolling */
+    max-height: 80vh; /* Adjust height as needed */
+}
+/* Mobile specific adjustments */
+@media screen and (max-width: 767px) {
+    div#tab-saved-graphs {
+        max-height: 50vh; /* Reduce height for smaller screens */
+        overflow-y: auto;
+    }
+}
 .message-buttons-left.panel.message-buttons.with-avatar {
     display: none;
 }
 /* Specific fixes for Hugging Face Space iframe */
 .h-full {
     height: auto !important;