Spaces:

MachineLearningReply
/

q-and-a-tool

Running

App Files Files Community

amrohendawi commited on Apr 15, 2024

Commit

2846658

1 Parent(s): 3ccc981

haystack 2.0 implementation

Browse files

Files changed (11) hide show

.gitignore +2 -1
Dockerfile +29 -0
README.md +45 -48
app.py +207 -251
authenticator_config.yaml +15 -0
document_qa_engine.py +120 -0
generate_keys.py +0 -15
hashed_password.pkl +0 -0
requirements.txt +18 -10
ml_logo.png → resources/ml_logo.png +0 -0
utils.py +58 -0

.gitignore CHANGED Viewed

@@ -2,4 +2,5 @@
 .vscode
 .idea
 *.pyc
-**/.DS_Store

 .vscode
 .idea
 *.pyc
+**/.DS_Store
+venv/

Dockerfile ADDED Viewed

	@@ -0,0 +1,29 @@

+FROM python:3.10-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    software-properties-common \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt
+COPY . .
+# extract version
+COPY .git ./.git
+RUN git rev-parse --short HEAD > revision.txt
+RUN rm -rf ./.git
+EXPOSE 8501
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+ENV PYTHONPATH "${PYTHONPATH}:."
+ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -9,70 +9,63 @@ app_file: app.py
 pinned: false
 ---
-# Template Streamlit App for Haystack Search Pipelines
-This template [Streamlit](https://docs.streamlit.io/) app set up for simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
-See the ['How to use this template'](#how-to-use-this-template) instructions below to create a simple UI for your own Haystack search pipelines.
-Below you will also find instructions on how you could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
 ## Installation and Running
-To run the bare application which does _nothing_:
-1. Install requirements: `pip install -r requirements.txt`
-2. Run the streamlit app: `streamlit run app.py`
-This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll notice that the app will only show you instructions on what to edit.
-### Optional Configurations
-You can set optional cofigurations to set the:
--  `--task` you want to start the app with: `rag` or `extractive` (default: rag)
--  `--store` you want to use: `inmemory`, `opensearch`, `weaviate` or `milvus` (default: inmemory)
--  `--name` you want to have for the app. (default: 'My Search App')
-E.g.:
-```bash
-streamlit run app.py -- --store opensearch --task extractive --name 'My Opensearch Documentation Search'
-```
-In a `.env` file, include all the config settings that you would like to use based on:
-- The DocumentStore of your choice
-- The Extractive/Generative model of your choice
-While the `/utils/config.py` will create default values for some configurations, others have to be set in the `.env` such as the `OPENAI_KEY`
-Example `.env`
-```
-OPENAI_KEY=YOUR_KEY
-EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L12-v2
-GENERATIVE_MODEL=text-davinci-003
-```
-## How to use this template
-1. Create a new repository from this template or simply open it in a codespace to start playing around 💙
-2. Make sure your `requirements.txt` file includes the Haystack and Streamlit versions you would like to use.
-3. Change the code in `utils/haystack.py` if you would like a different pipeline.
-4. Create a `.env`file with all of your configuration settings.
-5. Make any UI edits you'd like to and [share with the Haystack community](https://haystack.deepeset.ai/community)
-6. Run the app as show in [installation and running](#installation-and-running)
 ### Repo structure
-- `./utils`: This is where we have 3 files:
-    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it uses default values. An example of this is in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
-    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and cache it, and `query()` which is the function called by `app.py` once a user query is received.
     - `ui.py`: Use this file for any UI and initial value setups.
-- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search bar, a 'Run' button, and a response that you can highlight answers with.
 ### What to edit?
 There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
 - Change the pipelines to use the embedding models, extractive or generative models as you need.
-- If using the `rag` task, change the `default_prompt_template` to use one of our available ones on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
 ## Pushing to Hugging Face Spaces 🤗
@@ -83,15 +76,19 @@ A few things to pay attention to:
 1. Create a New Space on Hugging Face with the Streamlit SDK.
 2. Create a Hugging Face token on your HF account.
 3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
-4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for your HF Space too!
-5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any changes to the frontmatter of this readme to display the title, emoji etc you desire.
-6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information, and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml) working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
 ```yaml
 name: Sync to Hugging Face hub
 on:
   push:
-    branches: [main]
   # to run this workflow manually from the Actions tab
   workflow_dispatch:

 pinned: false
 ---
+# Document Insights - Extractive & Generative Methods using Haystack
+This template [Streamlit](https://docs.streamlit.io/) app set up for
+simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to
+do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
+Below you will also find instructions on how you
+could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
 ## Installation and Running
+### Local development
+To run the bare application which does _nothing_:
+1. Install requirements: `pip install -r requirements.txt`
+2. Run the streamlit app: `streamlit run app.py`
+This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll
+notice that the app will only show you instructions on what to edit.
+### Docker
+To run the app in a Docker container:
+1. Build the Docker image: `docker build -t haystack-streamlit .`
+2. Run the Docker container: `docker run -p 8501:8501 haystack-streamlit` (make sure to bind any other ports you need)
+3. Open your browser and go to `http://localhost:8501`
 ### Repo structure
+- `./utils`: This is where we have 3 files:
+    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it
+      uses default values. An example of this is
+      in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
+    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search
+      pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and
+      cache it, and `query()` which is the function called by `app.py` once a user query is received.
     - `ui.py`: Use this file for any UI and initial value setups.
+- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search
+  bar, a 'Run' button, and a response that you can highlight answers with.
+- `requirements.txt`: This file includes the required libraries to run the Streamlit app.
+- `document_qa_engine.py`: This file includes the QA pipeline with Haystack.
 ### What to edit?
 There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
 - Change the pipelines to use the embedding models, extractive or generative models as you need.
+- If using the `rag` task, change the `default_prompt_template` to use one of our available ones
+  on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
+### Using local LLM models
+To use the `local LLM` mode you can use [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/).
+For more info on how to run the app with a local LLM model please refer to the documentation of the tool you are using.
+The `local_llm` mode expects an API available at `http://localhost:1234/v1`.
 ## Pushing to Hugging Face Spaces 🤗
 1. Create a New Space on Hugging Face with the Streamlit SDK.
 2. Create a Hugging Face token on your HF account.
 3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
+4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for
+   your HF Space too!
+5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any
+   changes to the frontmatter of this readme to display the title, emoji etc you desire.
+6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information,
+   and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml)
+   working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
 ```yaml
 name: Sync to Hugging Face hub
 on:
   push:
+    branches: [ main ]
   # to run this workflow manually from the Actions tab
   workflow_dispatch:

app.py CHANGED Viewed

@@ -1,284 +1,240 @@
-from utils.check_pydantic_version import use_pydantic_v1
-use_pydantic_v1() #This function has to be run before importing haystack. as haystack requires pydantic v1 to run
-from operator import index
-import streamlit as st
-import logging
 import os
-from annotated_text import annotation
-from json import JSONDecodeError
-from markdown import markdown
-from utils.config import parser
-from utils.haystack import start_document_store, query, initialize_pipeline, start_preprocessor_node, start_retriever, start_reader
-from utils.ui import reset_results, set_initial_state
 import pandas as pd
-import haystack
-from datetime import datetime
-import streamlit.components.v1 as components
 import streamlit_authenticator as stauth
-import pickle
 from streamlit_modal import Modal
-import numpy as np
-names = ['mlreply']
-usernames = ['mlreply']
-with open('hashed_password.pkl','rb') as f:
-    hashed_passwords = pickle.load(f)
-# Whether the file upload should be enabled or not
-DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
-def show_documents_list(retrieved_documents):
-    data = []
-    for i, document in enumerate(retrieved_documents):
-        data.append([document.meta['name']])
-    df = pd.DataFrame(data, columns=['Uploaded Document Name'])
-    df.drop_duplicates(subset=['Uploaded Document Name'], inplace=True)
-    df.index = np.arange(1, len(df) + 1)
-    return df
-# Define a function to handle file uploads
-def upload_files():
-    uploaded_files = upload_container.file_uploader(
-            "upload", type=["pdf", "txt", "docx"], accept_multiple_files=True, label_visibility="hidden", key=1
-        )
-    return uploaded_files
-# Define a function to process a single file
-def process_file(data_file, preprocesor, document_store):
-    # read file and add content
-    file_contents = data_file.read().decode("utf-8")
-    docs = [{
-        'content': str(file_contents),
-        'meta': {'name': str(data_file.name)}
-    }]
-    try:
-        names = [item.meta.get('name') for item in document_store.get_all_documents()]
-        #if args.store == 'inmemory':
-        # doc = converter.convert(file_path=files, meta=None)
-        if data_file.name in names:
-            print(f"{data_file.name} already processed")
-        else:
-            print(f'preprocessing uploaded doc {data_file.name}.......')
-            #print(data_file.read().decode("utf-8"))
-            preprocessed_docs = preprocesor.process(docs)
-            print('writing to document store.......')
-            document_store.write_documents(preprocessed_docs)
-            print('updating emebdding.......')
-            document_store.update_embeddings(retriever)
-    except Exception as e:
-        print(e)
-# Define a function to upload the documents to haystack document store
-def upload_document():
-    if data_files is not None:
-        for data_file in data_files:
-            # Upload file
-            if data_file:
-                try:
-                    #raw_json = upload_doc(data_file)
-                    # Call the process_file function for each uploaded file
-                    if args.store == 'inmemory':
-                        processed_data = process_file(data_file, preprocesor, document_store)
-                    #upload_container.write(str(data_file.name) + " &nbsp;&nbsp; ✅ ")
-                except Exception as e:
-                    upload_container.write(str(data_file.name) + " &nbsp;&nbsp; ❌ ")
-                    upload_container.write("_This file could not be parsed, see the logs for more information._")
-# Define a function to reset the documents in haystack document store
-def reset_documents():
-    print('\nReseting documents list at ' + str(datetime.now()) + '\n')
-    st.session_state.data_files = None
-    document_store.delete_documents()
-try:
-    args = parser.parse_args()
-    preprocesor = start_preprocessor_node()
-    document_store = start_document_store(type=args.store)
-    document_store.get_all_documents()
-    retriever = start_retriever(document_store)
-    reader = start_reader()
     st.set_page_config(
-        page_title="MLReplySearch",
-        layout="centered",
         page_icon=":shark:",
         menu_items={
             'Get Help': 'https://www.extremelycoolapp.com/help',
             'Report a bug': "https://www.extremelycoolapp.com/bug",
             'About': "# This is a header. This is an *extremely* cool app!"
         }
     )
-    st.sidebar.image("ml_logo.png", use_column_width=True)
-    authenticator = stauth.Authenticate(names, usernames, hashed_passwords, "document_search", "random_text", cookie_expiry_days=1)
-    name, authentication_status, username = authenticator.login("Login", "main")
-    if authentication_status == False:
-        st.error("Username/Password is incorrect")
-    if authentication_status == None:
-        st.warning("Please enter your username and password")
-    if authentication_status:
-        # Sidebar for Task Selection
-        st.sidebar.header('Options:')
-        # OpenAI Key Input
-        openai_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password")
-        if openai_key:
-            task_options = ['Extractive', 'Generative']
-        else:
-            task_options = ['Extractive']
-        task_selection = st.sidebar.radio('Select the task:', task_options)
-        # Check the task and initialize pipeline accordingly
-        if task_selection == 'Extractive':
-            pipeline_extractive = initialize_pipeline("extractive", document_store, retriever, reader)
-        elif task_selection == 'Generative' and openai_key:  # Check for openai_key to ensure user has entered it
-            pipeline_rag = initialize_pipeline("rag", document_store, retriever, reader, openai_key=openai_key)
-        set_initial_state()
-        modal = Modal("Manage Files", key="demo-modal")
-        open_modal = st.sidebar.button("Manage Files", use_container_width=True)
-        if open_modal:
-            modal.open()
-        st.write('# ' + args.name)
-        if modal.is_open():
-            with modal.container():
-                if not DISABLE_FILE_UPLOAD:
-                    upload_container = st.container()
-                    data_files = upload_files()
-                    upload_document()
-                    st.session_state.sidebar_state = 'collapsed'
-                st.table(show_documents_list(document_store.get_all_documents()))
-        # File upload block
-       # if not DISABLE_FILE_UPLOAD:
-        #    upload_container = st.sidebar.container()
-         #   upload_container.write("## File Upload:")
-          #  data_files = upload_files()
-            # Button to update files in the documentStore
-           # upload_container.button('Upload Files', on_click=upload_document, args=())
-        # Button to reset the documents in DocumentStore
-        st.sidebar.button("Reset documents", on_click=reset_documents, args=(), use_container_width=True)
-        if "question" not in st.session_state:
-            st.session_state.question = ""
-        # Search bar
-        question = st.text_input("Question", value=st.session_state.question, max_chars=100, on_change=reset_results, label_visibility="hidden")
-        run_pressed = st.button("Run")
-        run_query = (
-            run_pressed or question != st.session_state.question #or task_selection != st.session_state.task
         )
-        # Get results for query
-        if run_query and question:
-            if task_selection == 'Extractive':
-                reset_results()
-                st.session_state.question = question
-                with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
-                    try:
-                        st.session_state.results_extractive = query(pipeline_extractive, question)
-                        st.session_state.task = task_selection
-                    except JSONDecodeError as je:
-                        st.error(
-                            "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
-                        )
-                    except Exception as e:
-                        logging.exception(e)
-                        st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
-            elif task_selection == 'Generative':
-                reset_results()
-                st.session_state.question = question
-                with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
-                    try:
-                        st.session_state.results_generative = query(pipeline_rag, question)
-                        st.session_state.task = task_selection
-                    except JSONDecodeError as je:
-                        st.error(
-                            "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
-                        )
-                    except Exception as e:
-                        if "API key is invalid" in str(e):
-                            logging.exception(e)
-                            st.error("🐞 &nbsp;&nbsp; incorrect API key provided. You can find your API key at https://platform.openai.com/account/api-keys.")
-                        else:
-                            logging.exception(e)
-                            st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
-        # Display results
-        if (st.session_state.results_extractive or st.session_state.results_generative) and run_query:
-            # Handle Extractive Answers
-            if task_selection == 'Extractive':
-                results = st.session_state.results_extractive
-                st.subheader("Extracted Answers:")
-                if 'answers' in results:
-                    answers = results['answers']
-                    treshold = 0.2
-                    higher_then_treshold = any(ans.score > treshold for ans in answers)
-                    if not higher_then_treshold:
-                        st.markdown(f"<span style='color:red'>Please note none of the answers achieved a score higher then {int(treshold) * 100}%. Which probably means that the desired answer is not in the searched documents.</span>", unsafe_allow_html=True)
-                    for count, answer in enumerate(answers):
-                        if answer.answer:
-                            text, context = answer.answer, answer.context
-                            start_idx = context.find(text)
-                            end_idx = start_idx + len(text)
-                            score = round(answer.score, 3)
-                            st.markdown(f"**Answer {count + 1}:**")
-                            st.markdown(
-                                context[:start_idx] + str(annotation(body=text, label=f'SCORE {score}', background='#964448', color='#ffffff')) + context[end_idx:],
-                                unsafe_allow_html=True,
-                            )
-                        else:
-                            st.info(
-                                "🤔 &nbsp;&nbsp; Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
-                            )
-            # Handle Generative Answers
-            elif task_selection == 'Generative':
-                results = st.session_state.results_generative
-                st.subheader("Generated Answer:")
-                if 'results' in results:
-                    st.markdown("**Answer:**")
-                    st.write(results['results'][0])
-            # Handle Retrieved Documents
-            if 'documents' in results:
-                retrieved_documents = results['documents']
-                st.subheader("Retriever Results:")
-                data = []
-                for i, document in enumerate(retrieved_documents):
-                    # Truncate the content
-                    truncated_content = (document.content[:150] + '...') if len(document.content) > 150 else document.content
-                    data.append([i + 1, document.meta['name'], truncated_content])
-                # Convert data to DataFrame and display using Streamlit
-                df = pd.DataFrame(data, columns=['Ranked Context', 'Document Name', 'Content'])
-                st.table(df)
-except SystemExit as e:
-    os._exit(e.code)

 import os
+from dotenv import load_dotenv
 import pandas as pd
+import streamlit as st
 import streamlit_authenticator as stauth
 from streamlit_modal import Modal
+from utils import new_file, clear_memory, append_documentation_to_sidebar, load_authenticator_config, init_qa, \
+    append_header
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack import Document
+load_dotenv()
+OPENAI_MODELS = ['gpt-3.5-turbo',
+                 "gpt-4",
+                 "gpt-4-1106-preview"]
+OPEN_MODELS = [
+    'mistralai/Mistral-7B-Instruct-v0.1',
+    'HuggingFaceH4/zephyr-7b-beta'
+]
+def reset_chat_memory():
+    st.button(
+        'Reset chat memory',
+        key="reset-memory-button",
+        on_click=clear_memory,
+        help="Clear the conversational memory. Currently implemented to retain the 4 most recent messages.",
+        disabled=False)
+def manage_files(modal, document_store):
+    open_modal = st.sidebar.button("Manage Files", use_container_width=True)
+    if open_modal:
+        modal.open()
+    if modal.is_open():
+        with modal.container():
+            uploaded_file = st.file_uploader(
+                "Upload a CV in PDF format",
+                type=("pdf",),
+                on_change=new_file(),
+                disabled=st.session_state['document_qa_model'] is None,
+                label_visibility="collapsed",
+                help="The document is used to answer your questions. The system will process the document and store it in a RAG to answer your questions.",
+            )
+            edited_df = st.data_editor(use_container_width=True, data=st.session_state['files'],
+                                       num_rows='dynamic',
+                                       column_order=['name', 'size', 'is_active'],
+                                       column_config={'name': {'editable': False}, 'size': {'editable': False},
+                                                      'is_active': {'editable': True, 'type': 'checkbox',
+                                                                    'width': 100}}
+                                       )
+            st.session_state['files'] = pd.DataFrame(columns=['name', 'content', 'size', 'is_active'])
+            if uploaded_file:
+                st.session_state['file_uploaded'] = True
+                st.session_state['files'] = pd.concat([st.session_state['files'], edited_df])
+                with st.spinner('Processing the CV content...'):
+                    store_file_in_table(document_store, uploaded_file)
+                    ingest_document(uploaded_file)
+def ingest_document(uploaded_file):
+    if not st.session_state['document_qa_model']:
+        st.warning('Please select a model to start asking questions')
+    else:
+        try:
+            st.session_state['document_qa_model'].ingest_pdf(uploaded_file)
+            st.success('Document processed successfully')
+        except Exception as e:
+            st.error(f"Error processing the document: {e}")
+            st.session_state['file_uploaded'] = False
+def store_file_in_table(document_store, uploaded_file):
+    pdf_content = uploaded_file.getvalue()
+    st.session_state['pdf_content'] = pdf_content
+    st.session_state.messages = []
+    document = Document(content=pdf_content, meta={"name": uploaded_file.name})
+    df = pd.DataFrame(st.session_state['files'])
+    df['is_active'] = False
+    st.session_state['files'] = pd.concat([df, pd.DataFrame(
+        [{"name": uploaded_file.name, "content": pdf_content, "size": len(pdf_content),
+          "is_active": True}])])
+    document_store.write_documents([document])
+def init_session_state():
+    st.session_state.setdefault('files', pd.DataFrame(columns=['name', 'content', 'size', 'is_active']))
+    st.session_state.setdefault('models', [])
+    st.session_state.setdefault('api_keys', {})
+    st.session_state.setdefault('current_selected_model', 'gpt-3.5-turbo')
+    st.session_state.setdefault('current_api_key', '')
+    st.session_state.setdefault('messages', [])
+    st.session_state.setdefault('pdf_content', None)
+    st.session_state.setdefault('memory', None)
+    st.session_state.setdefault('pdf', None)
+    st.session_state.setdefault('document_qa_model', None)
+    st.session_state.setdefault('file_uploaded', False)
+def set_page_config():
     st.set_page_config(
+        page_title="CV Insights AI Assistant",
         page_icon=":shark:",
+        initial_sidebar_state="expanded",
+        layout="wide",
         menu_items={
             'Get Help': 'https://www.extremelycoolapp.com/help',
             'Report a bug': "https://www.extremelycoolapp.com/bug",
             'About': "# This is a header. This is an *extremely* cool app!"
         }
     )
+def update_running_model(api_key, model):
+    st.session_state['api_keys'][model] = api_key
+    st.session_state['document_qa_model'] = init_qa(model, api_key)
+def init_api_key_dict():
+    st.session_state['models'] = OPENAI_MODELS + list(OPEN_MODELS) + ['local LLM']
+    for model_name in OPENAI_MODELS:
+        st.session_state['api_keys'][model_name] = None
+def display_chat_messages(chat_box, chat_input):
+    with chat_box:
+        if chat_input:
+            for message in st.session_state.messages:
+                with st.chat_message(message["role"]):
+                    st.markdown(message["content"], unsafe_allow_html=True)
+            st.chat_message("user").markdown(chat_input)
+            st.session_state.messages.append({"role": "user", "content": chat_input})
+            with st.chat_message("assistant"):
+                response = st.session_state['document_qa_model'].process_message(chat_input)
+                st.markdown(response)
+                st.session_state.messages.append({"role": "assistant", "content": response})
+def setup_model_selection():
+    model = st.selectbox(
+        "Model:",
+        options=st.session_state['models'],
+        index=0,  # default to the first model in the list gpt-3.5-turbo
+        placeholder="Select model",
+        help="Select an LLM:"
+    )
+    if model:
+        if model != st.session_state['current_selected_model']:
+            st.session_state['current_selected_model'] = model
+            if model == 'local LLM':
+                st.session_state['document_qa_model'] = init_qa(model)
+    api_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password",
+                                    disabled=st.session_state['current_selected_model'] == 'local LLM')
+    if api_key and api_key != st.session_state['current_api_key']:
+        update_running_model(api_key, model)
+        st.session_state['current_api_key'] = api_key
+    return model
+def setup_task_selection(model):
+    # enable extractive and generative tasks if we're using a local LLM or an OpenAI model with an API key
+    if model == 'local LLM' or st.session_state['api_keys'].get(model):
+        task_options = ['Extractive', 'Generative']
+    else:
+        task_options = ['Extractive']
+    task_selection = st.sidebar.radio('Select the task:', task_options)
+    # TODO: Add the task selection logic here (initializing the model based on the task)
+def setup_page_body():
+    chat_box = st.container(height=350, border=False)
+    chat_input = st.chat_input(
+        placeholder="Upload a document to start asking questions...",
+        disabled=not st.session_state['file_uploaded'],
+    )
+    if st.session_state['file_uploaded']:
+        display_chat_messages(chat_box, chat_input)
+class StreamlitApp:
+    def __init__(self):
+        self.authenticator_config = load_authenticator_config()
+        self.document_store = InMemoryDocumentStore()
+        set_page_config()
+        self.authenticator = self.init_authenticator()
+        init_session_state()
+        init_api_key_dict()
+    def init_authenticator(self):
+        return stauth.Authenticate(
+            self.authenticator_config['credentials'],
+            self.authenticator_config['cookie']['name'],
+            self.authenticator_config['cookie']['key'],
+            self.authenticator_config['cookie']['expiry_days']
         )
+    def setup_sidebar(self):
+        with st.sidebar:
+            st.sidebar.image("resources/ml_logo.png", use_column_width=True)
+            # Sidebar for Task Selection
+            st.sidebar.header('Options:')
+            model = setup_model_selection()
+            setup_task_selection(model)
+            st.divider()
+            self.authenticator.logout()
+            reset_chat_memory()
+            modal = Modal("Manage Files", key="demo-modal")
+            manage_files(modal, self.document_store)
+            st.divider()
+            append_documentation_to_sidebar()
+    def run(self):
+        name, authentication_status, username = self.authenticator.login()
+        if authentication_status:
+            self.run_authenticated_app()
+        elif st.session_state["authentication_status"] is False:
+            st.error('Username/password is incorrect')
+        elif st.session_state["authentication_status"] is None:
+            st.warning('Please enter your username and password')
+    def run_authenticated_app(self):
+        self.setup_sidebar()
+        append_header()
+        setup_page_body()
+app = StreamlitApp()
+app.run()

authenticator_config.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+credentials:
+  usernames:
+    mlreply:
+      email: mlreply@reply.de
+      failed_login_attempts: 0 # Will be managed automatically
+      logged_in: False # Will be managed automatically
+      name: ML Reply
+      password: mlreply # Will be hashed automatically
+cookie:
+  expiry_days: 1
+  key: some_signature_key # Must be string
+  name: some_cookie_name
+#pre-authorized:
+#  emails:
+#    - melsby@gmail.com

document_qa_engine.py ADDED Viewed

	@@ -0,0 +1,120 @@

+from typing import List
+from pypdf import PdfReader
+from haystack.utils import Secret
+from haystack import Pipeline, Document, component
+from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
+from haystack.components.writers import DocumentWriter
+from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
+from haystack.components.builders import PromptBuilder
+from haystack.components.generators.chat import OpenAIChatGenerator, HuggingFaceTGIChatGenerator
+from haystack.components.generators import OpenAIGenerator, HuggingFaceTGIGenerator
+from haystack.document_stores.types import DuplicatePolicy
+SENTENCE_RETREIVER_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+MAX_TOKENS = 500
+template = """
+As a professional HR recruiter given the following information, answer the question shortly and concisely in 1 or 2 sentences.
+Context:
+{% for document in documents %}
+    {{ document.content }}
+{% endfor %}
+Question: {{question}}
+Answer:
+"""
+@component
+class UploadedFileConverter:
+    """
+    A component to convert uploaded PDF files to Documents
+    """
+    @component.output_types(documents=List[Document])
+    def run(self, uploaded_file):
+        pdf = PdfReader(uploaded_file)
+        documents = []
+        # uploaded file name without .pdf at the end and with _ and page number at the end
+        name = uploaded_file.name.rstrip('.PDF') + '_'
+        for page in pdf.pages:
+            documents.append(
+                Document(
+                    content=page.extract_text(),
+                    meta={'name': name + f"_{page.page_number}"}))
+        return {"documents": documents}
+def create_ingestion_pipeline(document_store):
+    doc_embedder = SentenceTransformersDocumentEmbedder(model=SENTENCE_RETREIVER_MODEL)
+    doc_embedder.warm_up()
+    pipeline = Pipeline()
+    pipeline.add_component("converter", UploadedFileConverter())
+    pipeline.add_component("cleaner", DocumentCleaner())
+    pipeline.add_component("splitter",
+                           DocumentSplitter(split_by="passage", split_length=100, split_overlap=10))
+    pipeline.add_component("embedder", doc_embedder)
+    pipeline.add_component("writer",
+                           DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
+    pipeline.connect("converter", "cleaner")
+    pipeline.connect("cleaner", "splitter")
+    pipeline.connect("splitter", "embedder")
+    pipeline.connect("embedder", "writer")
+    return pipeline
+def create_query_pipeline(document_store, model_name, api_key):
+    prompt_builder = PromptBuilder(template=template)
+    if model_name == "local LLM":
+        generator = OpenAIGenerator(model=model_name,
+                                    api_base_url="http://localhost:1234/v1",
+                                    generation_kwargs={"max_tokens": MAX_TOKENS}
+                                    )
+    elif "gpt" in model_name:
+        generator = OpenAIGenerator(api_key=Secret.from_token(api_key), model=model_name,
+                                    generation_kwargs={"max_tokens": MAX_TOKENS}
+                                    )
+    else:
+        generator = HuggingFaceTGIGenerator(token=Secret.from_token(api_key), model=model_name,
+                                            generation_kwargs={"max_new_tokens": MAX_TOKENS}
+                                            )
+    query_pipeline = Pipeline()
+    query_pipeline.add_component("text_embedder",
+                                 SentenceTransformersTextEmbedder(model=SENTENCE_RETREIVER_MODEL))
+    query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
+    query_pipeline.add_component("prompt_builder", prompt_builder)
+    query_pipeline.add_component("generator", generator)
+    query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
+    query_pipeline.connect("retriever.documents", "prompt_builder.documents")
+    query_pipeline.connect("prompt_builder", "generator")
+    return query_pipeline
+class DocumentQAEngine:
+    def __init__(self,
+                 model_name,
+                 api_key=None
+                 ):
+        self.api_key = api_key
+        self.model_name = model_name
+        document_store = InMemoryDocumentStore()
+        self.chunks = []
+        self.query_pipeline = create_query_pipeline(document_store, model_name, api_key)
+        self.pdf_ingestion_pipeline = create_ingestion_pipeline(document_store)
+    def ingest_pdf(self, uploaded_file):
+        self.pdf_ingestion_pipeline.run({"converter": {"uploaded_file": uploaded_file}})
+    def process_message(self, query):
+        response = self.query_pipeline.run({"text_embedder": {"text": query}, "prompt_builder": {"question": query}})
+        return response["generator"]["replies"][0]

generate_keys.py DELETED Viewed

@@ -1,15 +0,0 @@
-# -*- coding: utf-8 -*-
-import pickle
-from pathlib import Path
-import streamlit_authenticator as stauth
-names = ['mlreply']
-usernames = ['mlreply']
-passwords = ['mlreply1']
-hashed_passwords = stauth.Hasher((passwords)).generate()
-with open('hashed_password.pkl','wb') as f:
-    pickle.dump(hashed_passwords, f)

hashed_password.pkl DELETED Viewed

Binary file (78 Bytes)

requirements.txt CHANGED Viewed

@@ -1,10 +1,18 @@
-scikit-learn==1.3.2
-safetensors==0.3.3.post1
-farm-haystack[inference,weaviate,opensearch,file-conversion,pdf]==1.20.0
-milvus-haystack
-streamlit==1.23.0
-streamlit-authenticator==0.1.5
-streamlit_modal
-markdown
-st-annotated-text
-datasets

+# Streamlit
+streamlit~=1.32.2
+streamlit-modal==0.1.2
+streamlit-authenticator==0.3.2
+streamlit-pdf-viewer==0.0.9
+# LLM
+haystack-ai~=2.0.0
+sentence_transformers~=2.6.0
+# Utils
+pandas~=2.2.1
+pypdf~=4.2.0
+pytest~=8.1.1
+python-dotenv~=1.0.1
+# Dev Utils
+watchdog

ml_logo.png → resources/ml_logo.png RENAMED Viewed

File without changes

utils.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from document_qa_engine import DocumentQAEngine
+import streamlit as st
+import logging
+from yaml import load, SafeLoader, YAMLError
+def load_authenticator_config(file_path='authenticator_config.yaml'):
+    try:
+        with open(file_path, 'r') as file:
+            authenticator_config = load(file, Loader=SafeLoader)
+            return authenticator_config
+    except FileNotFoundError:
+        logging.error(f"File {file_path} not found.")
+    except YAMLError as error:
+        logging.error(f"Error parsing YAML file: {error}")
+def new_file():
+    st.session_state['loaded_embeddings'] = None
+    st.session_state['doc_id'] = None
+    st.session_state['uploaded'] = True
+    clear_memory()
+def clear_memory():
+    if st.session_state['memory']:
+        st.session_state['memory'].clear()
+def init_qa(model, api_key=None):
+    print(f"Initializing QA with model: {model} and API key: {api_key}")
+    return DocumentQAEngine(model, api_key=api_key)
+def append_header():
+    _, header_container, _ = st.columns([0.25, 0.5, 0.25])
+    with header_container:
+        st.header('📄 Document Insights :rainbow[AI] Assistant 📚', divider='rainbow')
+        st.text("📥 Upload documents in PDF format. Get insights.. ask questions..")
+def append_documentation_to_sidebar():
+    with st.expander("Disclaimer"):
+        st.markdown(
+            """
+            :warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely
+            for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use
+            or handling of the data submitted to third parties LLMs.
+            """)
+    with st.expander("Documentation"):
+        st.markdown(
+            """
+            Upload a CV as PDF document. Once the spinner stops, you can proceed to ask your questions. The answers will
+            be displayed in the right column. The system will answer your questions using the content of the document
+            and mark refrences over the PDF viewer.
+            """)