Spaces:

boldhasnain
/

RAG

Configuration error

App Files Files Community

boldhasnain commited on Aug 30, 2024

Commit

4e20cdf

verified ·

1 Parent(s): 8045fd9

Upload 9 files

Browse files

Files changed (9) hide show

Dockerfile +18 -0
README.md +19 -11
docker_rag.sh +27 -0
feedback_loop.txt +49 -0
landing_page.py +313 -0
requirements.txt +53 -0
software_data.txt +0 -0
software_final.txt +0 -0
streamlit_rag.sh +14 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,18 @@

+FROM python:3.9
+MAINTAINER pranavrao25
+WORKDIR /app
+COPY requirements.txt .
+RUN apt-get update \
+  && apt-get -y install tesseract-ocr
+RUN pip install -r requirements.txt
+COPY . .
+EXPOSE 8501
+CMD ["nohup", "streamlit","run","landing_page.py", "&"]

README.md CHANGED Viewed

@@ -1,11 +1,19 @@
----
-title: RAG
-emoji: 🌍
-colorFrom: red
-colorTo: red
-sdk: docker
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Multi-modal RAG based LLM for Information Retrieval
+In this project we have set up a RAG system with the following features:
+<ol>
+<li>Custom PDF input</li>
+<li>Multi-modal interface with support for images & text</li>
+<li>Feedback recording and reusage</li>
+<li>Usage of Agents for Context Retrieval</li>
+</ol>
+The project primarily runs on Streamlit<br>
+Here is the [Docker Image](https://hub.docker.com/repository/docker/pranavrao25/ragimage/general)<br>
+Procedure to run the pipeline:
+1. Clone the project
+2. If you want to run the docker image, then run ```docker_rag.sh``` file as ```/bin/zsh ./docker_rag.sh```
+3. Else if you want to run directly using streamlit, then:
+   1. Install the requirements through ```pip -r requirements.txt```
+   2. Run the ```streamlit_rag.sh``` file as ```/bin/zsh ./streamlit_rag.sh```

docker_rag.sh ADDED Viewed

	@@ -0,0 +1,27 @@

+#!/bin/bash
+trap 'on_exit' SIGINT
+on_exit() {
+    rm -rf figures_*
+    rm -rf pdfs
+    mkdir pdfs
+    exit 0
+}
+sudo apt-get update
+sudo apt-get install tesseract-ocr
+echo "TESSERACT INSTALLED"
+sudo apt install apt-transport-https ca-certificates curl software-properties-common
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
+sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
+apt-cache policy docker-ce
+sudo apt install docker.io
+echo "DOCKER INSTALLED"
+sudo apt install python3.12-venv
+python3 -m venv ragenv
+echo "VIRTUAL ENVIRONMENT CREATED"
+source ragenv/bin/activate
+echo "RUNNING RAG"
+sudo docker run -p 8501:8501 pranavrao25/ragimage:image &
+wait $!

feedback_loop.txt ADDED Viewed

	@@ -0,0 +1,49 @@

+The feedback for How to create or edit recipes? is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+is POSITIVE  and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
+        2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
+ent process? is POSITIVE Good and the response is  Alignment marks are specific features or structures with well-known coordinates used in the alignment process. To execute an alignment, an image displaying these structural characteristics is needed. These alignment marks can either be specially designed adjusting aids or distinctive features of already patterned structures. Both types of features serve the same purpose: to ensure precise alignment by providing reference points that can be easily identified and used for accurate positioning during the alignment process.
+The feedback for  What does alignment accuracy refer to, and how is it achieved in a microscopy context? is POSITIVE Good and the response is  Alignment accuracy refers to the variation between the actual position and the target position of a specimen under a microscope. Achieving high alignment accuracy involves a process called alignment, where the beam deflection is corrected through digital shift and rotation to minimize the mismatch between the actual and target positions. This method can achieve precision within tens of nanometers, ensuring that the specimen is accurately positioned and aligned for detailed examination and analysis.
+The feedback for  What are adjustment points in the context of using a microscope, and why are they important? is POSITIVE Good and the response is  Adjustment points are specific locations on a specimen with well-known coordinates, essential for determining the coordinate transformation from the specimen system to the stage system on a microscope. When a specimen is loaded onto the microscope stage, the absolute position and angle relative to the stage are initially undefined. The adjustment procedure uses these points to establish a precise coordinate system, enabling accurate navigation of the specimen. The accuracy of this adjustment relies on the precision of the microscope stage and the selection of adjustment points, ensuring the system gathers maximum information while avoiding overdetermination.
+The feedback for  What is present in the menu bar? is POSITIVE Good and the response is  The present in the menu bar are: File Menu, Edit Menu, View Menu, Sample Menu, Image Menu, and Settings Menu.
+The feedback for  What is a beam blanker? is POSITIVE Good and the response is  A beam blanker is a device that moves back with the intended direction of motion to avoid unintended exposure during standby times and beam settling times. An electrostatic blanker has the advantage of being able to switch the beam on and off very fast.
+The feedback for  How to create a recipe with exclusive function? is POSITIVE Good and the response is  1. To create a recipe with exclusive function, you can either choose the entry 'Exclusive' in the Material drop-down list or click at the (Create Recipe) Pen button next to the Material drop-down list. 2. If there are any previously assigned parameters, they will be used as default when changing to Exclusive Mode. Doing so, the Pen button changes to the Save Changes button. By pressing this button, the user is offered a possibility to save the edited set of exposure parameters as a new recipe. Apart from user saving the edited recipe as a new recipe, the user can even overwrite the existing recipe as well.
+The feedback for What does Capture button in Image menu do? is NEGATIVE and the response is To capture an image in eLitho, first select the desired image mode (switch between Live and sample mode). Then click on the "Capture" button in the image menu. This will start capturing the image. You can stop the image capture by clicking on the "Freeze" button or by pressing the "End Frame" key.
+The feedback for Hello is POSITIVE and the response is Hello! How can I help you today?

landing_page.py ADDED Viewed

	@@ -0,0 +1,313 @@

+import shutil
+import streamlit as st
+st.set_page_config(
+   page_title="RAG Configuration",
+   page_icon="🤖",
+   layout="wide",
+   initial_sidebar_state="collapsed"
+)
+import re
+import os
+import spire.pdf
+import fitz
+from src.Databases import *
+from langchain.text_splitter import *
+from sentence_transformers import SentenceTransformer, CrossEncoder
+from langchain_community.llms import HuggingFaceHub
+from langchain_huggingface import HuggingFaceEmbeddings
+from transformers import (AutoFeatureExtractor, AutoModel, AutoImageProcessor)
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+class SentenceTransformerEmbeddings:
+    """
+      Wrapper Class for SentenceTransformer Class
+    """
+    def __init__(self, model_name: str):
+        """
+          Initiliases a Sentence Transformer
+        """
+        self.model = SentenceTransformer(model_name)
+    def embed_documents(self, texts):
+        """
+        Returns a list of embeddings for the given texts.
+        """
+        return self.model.encode(texts, convert_to_tensor=True).tolist()
+    def embed_query(self, text):
+        """
+          Returns a list of embeddings for the given text.
+        """
+        return self.model.encode(text, convert_to_tensor=True).tolist()
+@st.cache_resource(show_spinner=False)
+def settings():
+    return HuggingFaceEmbedding(model_name="BAAI/bge-base-en")
+@st.cache_resource(show_spinner=False)
+def pine_embedding_model():
+    return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")  # 784 dimension + euclidean
+@st.cache_resource(show_spinner=False)
+def weaviate_embedding_model():
+    return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
+@st.cache_resource(show_spinner=False)
+def load_image_model(model):
+    extractor = AutoFeatureExtractor.from_pretrained(model)
+    im_model = AutoModel.from_pretrained(model)
+    return extractor, im_model
+@st.cache_resource(show_spinner=False)
+def load_bi_encoder():
+    return HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2", model_kwargs={"device": "cpu"})
+@st.cache_resource(show_spinner=False)
+def pine_embedding_model():
+    return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")  # 784 dimension + euclidean
+@st.cache_resource(show_spinner=False)
+def weaviate_embedding_model():
+    return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
+@st.cache_resource(show_spinner=False)
+def load_cross():
+        return CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu")
+@st.cache_resource(show_spinner=False)
+def pine_cross_encoder():
+    return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512, device="cpu")
+@st.cache_resource(show_spinner=False)
+def weaviate_cross_encoder():
+    return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512, device="cpu")
+@st.cache_resource(show_spinner=False)
+def load_chat_model():
+    template = '''
+    You are an assistant for question-answering tasks.
+    Use the following pieces of retrieved context to answer the question accurately.
+    If the question is not related to the context, just answer 'I don't know'.
+    Question: {question}
+    Context: {context}
+    Answer:
+    '''
+    return HuggingFaceHub(
+        repo_id="mistralai/Mistral-7B-Instruct-v0.1",
+        model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512, "query_wrapper_prompt": template}
+    )
+@st.cache_resource(show_spinner=False)
+def load_q_model():
+    return HuggingFaceHub(
+        repo_id="mistralai/Mistral-7B-Instruct-v0.3",
+        model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512}
+    )
+@st.cache_resource(show_spinner=False)
+def load_image_model(model):
+    extractor = AutoFeatureExtractor.from_pretrained(model)
+    im_model = AutoModel.from_pretrained(model)
+    return extractor, im_model
+@st.cache_resource(show_spinner=False)
+def load_nomic_model():
+    return  AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5"), AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5",
+                                         trust_remote_code=True)
+@st.cache_resource(show_spinner=False)
+def vector_database_prep(file):
+    def data_prep(file):
+        def findWholeWord(w):
+            return re.compile(r'\b{0}\b'.format(re.escape(w)), flags=re.IGNORECASE).search
+        file_name = file.name
+        pdf_file_path = os.path.join(os.getcwd(), 'pdfs', file_name)
+        image_folder = os.path.join(os.getcwd(), f'figures_{file_name}')
+        if not os.path.exists(image_folder):
+            os.makedirs(image_folder)
+        # everything down here is wrt pages dir
+        print('1. folder made')
+        with spire.pdf.PdfDocument() as doc:
+            doc.LoadFromFile(pdf_file_path)
+            images = []
+            for page_num in range(doc.Pages.Count):
+                page = doc.Pages[page_num]
+                for image_num in range(len(page.ImagesInfo)):
+                    imageFileName = os.path.join(image_folder, f'figure-{page_num}-{image_num}.png')
+                    image = page.ImagesInfo[image_num]
+                    image.Image.Save(imageFileName)
+                    images.append({
+                        "image_file_name": imageFileName,
+                        "image": image
+                    })
+        print('2. image extraction done')
+        image_info = []
+        for image_file in os.listdir(image_folder):
+            if image_file.endswith('.png'):
+                image_info.append({
+                    "image_file_name": image_file[:-4],
+                    "image": Image.open(os.path.join(image_folder, image_file)),
+                    "pg_no": int(image_file.split('-')[1])
+                })
+        print('3. temporary')
+        figures = []
+        with fitz.open(pdf_file_path) as pdf_file:
+            data = ""
+            for page in pdf_file:
+                text = page.get_text()
+                if not (findWholeWord('table of contents')(text) or findWholeWord('index')(text)):
+                    data += text
+            data = data.replace('}', '-')
+            data = data.replace('{', '-')
+            print('4. Data extraction done')
+            hs = []
+            for i in image_info:
+                src = i['image_file_name'] + '.png'
+                headers = {'_': []}
+                header = '_'
+                page = pdf_file[i['pg_no']]
+                texts = page.get_text('dict')
+                for block in texts['blocks']:
+                    if block['type'] == 0:
+                        for line in block['lines']:
+                            for span in line['spans']:
+                                if 'bol' in span['font'].lower() and not span['text'].isnumeric():
+                                    header = span['text']
+                                    print("header: ", header)
+                                    headers[header] = [header]
+                                else:
+                                    headers[header].append(span['text'])
+                                try:
+                                    if findWholeWord('fig')(span['text']):
+                                        i['image_file_name'] = span['text']
+                                        figures.append(span['text'].split('fig')[-1])
+                                    elif findWholeWord('figure')(span['text']):
+                                        i['image_file_name'] = span['text']
+                                        figures.append(span['text'].lower().split('figure')[-1])
+                                    else:
+                                        pass
+                                except re.error:
+                                    pass
+                if not i['image_file_name'].endswith('.png'):
+                    s = i['image_file_name'] + '.png'
+                    i['image_file_name'] = s
+                    os.rename(os.path.join(image_folder, src), os.path.join(image_folder, i['image_file_name']))
+                hs.append({"image": i, "header": headers})
+            print('5. header and figures done')
+            figure_contexts = {}
+            for fig in figures:
+                figure_contexts[fig] = []
+                for page_num in range(len(pdf_file)):
+                    page = pdf_file[page_num]
+                    texts = page.get_text('dict')
+                    for block in texts['blocks']:
+                        if block['type'] == 0:
+                            for line in block['lines']:
+                                for span in line['spans']:
+                                    if findWholeWord(fig)(span['text']):
+                                        print('figure mention: ', span['text'])
+                                        figure_contexts[fig].append(span['text'])
+            print('6. Figure context collected')
+            contexts = []
+            for h in hs:
+                context = ""
+                for q in h['header'].values():
+                    context += "".join(q)
+                s = pytesseract.image_to_string(h['image']['image'])
+                qwea = context + '\n' + s if len(s) != 0 else context
+                contexts.append((
+                    h['image']['image_file_name'],
+                    qwea,
+                    h['image']['image']
+                ))
+            print('7. Overall context collected')
+            image_content = []
+            for fig in figure_contexts:
+                for c in contexts:
+                    if findWholeWord(fig)(c[0]):
+                        s = c[1] + '\n' + "\n".join(figure_contexts[fig])
+                        s = str("\n".join(
+                            [
+                                "".join([h for h in i.strip() if h.isprintable()])
+                                for i in s.split('\n')
+                                if len(i.strip()) != 0
+                            ]
+                        ))
+                        image_content.append((
+                            c[0],
+                            s,
+                            c[2]
+                        ))
+            print('8. Figure context added')
+        return data, image_content
+    # Vector Database objects
+    extractor, i_model = st.session_state['extractor'], st.session_state['image_model']
+    pinecone_embed = st.session_state['pinecone_embed']
+    weaviate_embed = st.session_state['weaviate_embed']
+    vb1 = UnifiedDatabase('vb1', 'lancedb/rag')
+    vb1.model_prep(extractor, i_model, weaviate_embed,
+                   RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
+    vb2 = UnifiedDatabase('vb2', 'lancedb/rag')
+    vb2.model_prep(extractor, i_model, pinecone_embed,
+                   RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
+    vb_list = [vb1, vb2]
+    data, image_content = data_prep(file)
+    for vb in vb_list:
+        vb.upsert(data)
+        vb.upsert(image_content)  # image_cont = dict[image_file_path, context, PIL]
+    return vb_list
+os.environ["HUGGINGFACEHUB_API_TOKEN"] = st.secrets["HUGGINGFACEHUB_API_TOKEN"]
+os.environ["LANGCHAIN_PROJECT"] = st.secrets["LANGCHAIN_PROJECT"]
+os.environ["OPENAI_API_KEY"] = st.secrets["GPT_KEY"]
+st.session_state['pdf_file'] = []
+st.session_state['vb_list'] = []
+st.session_state['Settings.embed_model'] = settings()
+st.session_state['processor'], st.session_state['vision_model'] = load_nomic_model()
+st.session_state['bi_encoder'] = load_bi_encoder()
+st.session_state['chat_model'] = load_chat_model()
+st.session_state['cross_model'] = load_cross()
+st.session_state['q_model'] = load_q_model()
+st.session_state['extractor'], st.session_state['image_model'] = load_image_model("google/vit-base-patch16-224-in21k")
+st.session_state['pinecone_embed'] = pine_embedding_model()
+st.session_state['weaviate_embed'] = weaviate_embedding_model()
+st.title('Multi-modal RAG based LLM for Information Retrieval')
+st.subheader('Converse with our Chatbot')
+st.markdown('Enter a pdf file as a source.')
+uploaded_file = st.file_uploader("Choose an pdf document...", type=["pdf"], accept_multiple_files=False)
+if uploaded_file is not None:
+    with open(uploaded_file.name, mode='wb') as w:
+        w.write(uploaded_file.getvalue())
+    if not os.path.exists(os.path.join(os.getcwd(), 'pdfs')):
+        os.makedirs(os.path.join(os.getcwd(), 'pdfs'))
+    shutil.move(uploaded_file.name, os.path.join(os.getcwd(), 'pdfs'))
+    st.session_state['pdf_file'] = uploaded_file.name
+    with st.spinner('Extracting'):
+        vb_list = vector_database_prep(uploaded_file)
+    st.session_state['vb_list'] = vb_list
+    st.switch_page('pages/rag.py')

requirements.txt ADDED Viewed

	@@ -0,0 +1,53 @@

+streamlit
+langchain_openai
+requests
+langchain
+langchain_community
+datasets
+openai
+numpy
+transformers
+torch
+sentence_transformers
+langchain_huggingface
+ragas
+weaviate-client
+streamlit_feedback
+pinecone-client
+langchain_pinecone
+langchain_weaviate
+langsmith
+langgraph
+pandas
+scipy
+pillow
+torchvision
+sentence-transformers
+unidecode
+pytesseract
+langchain_mistralai
+pymupdf
+langchain-huggingface
+llmlingua
+accelerate
+pyarrow
+lancedb
+pillow_heif
+llama-index-vector-stores-lancedb
+llama-index
+ftfy
+tqdm
+llama-index-multi-modal-llms-openai
+llama-index-embeddings-huggingface
+llama-index-readers-file
+einops
+unstructured
+unstructured_inference
+unstructured.pytesseract
+pdfminer
+llama-index-embeddings-clip
+scikit-image
+scikit-learn
+matplotlib
+Spire.Pdf
+python-pptx

software_data.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

software_final.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

streamlit_rag.sh ADDED Viewed

	@@ -0,0 +1,14 @@

+#!/bin/zsh
+trap 'on_exit' SIGINT
+on_exit() {
+    rm -rf figures_*
+    rm -rf pdfs
+    rm -rf lancedb
+    mkdir pdfs
+    exit 0
+}
+streamlit run landing_page.py &
+wait $!