boldhasnain commited on
Commit
4e20cdf
·
verified ·
1 Parent(s): 8045fd9

Upload 9 files

Browse files
Files changed (9) hide show
  1. Dockerfile +18 -0
  2. README.md +19 -11
  3. docker_rag.sh +27 -0
  4. feedback_loop.txt +49 -0
  5. landing_page.py +313 -0
  6. requirements.txt +53 -0
  7. software_data.txt +0 -0
  8. software_final.txt +0 -0
  9. streamlit_rag.sh +14 -0
Dockerfile ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9
2
+
3
+ MAINTAINER pranavrao25
4
+
5
+ WORKDIR /app
6
+
7
+ COPY requirements.txt .
8
+
9
+ RUN apt-get update \
10
+ && apt-get -y install tesseract-ocr
11
+
12
+ RUN pip install -r requirements.txt
13
+
14
+ COPY . .
15
+
16
+ EXPOSE 8501
17
+
18
+ CMD ["nohup", "streamlit","run","landing_page.py", "&"]
README.md CHANGED
@@ -1,11 +1,19 @@
1
- ---
2
- title: RAG
3
- emoji: 🌍
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
+ # Multi-modal RAG based LLM for Information Retrieval
2
+
3
+ In this project we have set up a RAG system with the following features:
4
+ <ol>
5
+ <li>Custom PDF input</li>
6
+ <li>Multi-modal interface with support for images & text</li>
7
+ <li>Feedback recording and reusage</li>
8
+ <li>Usage of Agents for Context Retrieval</li>
9
+ </ol>
10
+
11
+ The project primarily runs on Streamlit<br>
12
+ Here is the [Docker Image](https://hub.docker.com/repository/docker/pranavrao25/ragimage/general)<br>
13
+
14
+ Procedure to run the pipeline:
15
+ 1. Clone the project
16
+ 2. If you want to run the docker image, then run ```docker_rag.sh``` file as ```/bin/zsh ./docker_rag.sh```
17
+ 3. Else if you want to run directly using streamlit, then:
18
+ 1. Install the requirements through ```pip -r requirements.txt```
19
+ 2. Run the ```streamlit_rag.sh``` file as ```/bin/zsh ./streamlit_rag.sh```
docker_rag.sh ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ trap 'on_exit' SIGINT
4
+
5
+ on_exit() {
6
+ rm -rf figures_*
7
+ rm -rf pdfs
8
+ mkdir pdfs
9
+ exit 0
10
+ }
11
+
12
+ sudo apt-get update
13
+ sudo apt-get install tesseract-ocr
14
+ echo "TESSERACT INSTALLED"
15
+ sudo apt install apt-transport-https ca-certificates curl software-properties-common
16
+ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
17
+ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
18
+ apt-cache policy docker-ce
19
+ sudo apt install docker.io
20
+ echo "DOCKER INSTALLED"
21
+ sudo apt install python3.12-venv
22
+ python3 -m venv ragenv
23
+ echo "VIRTUAL ENVIRONMENT CREATED"
24
+ source ragenv/bin/activate
25
+ echo "RUNNING RAG"
26
+ sudo docker run -p 8501:8501 pranavrao25/ragimage:image &
27
+ wait $!
feedback_loop.txt ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The feedback for How to create or edit recipes? is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
2
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
3
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
4
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
5
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
6
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
7
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
8
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
9
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
10
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
11
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
12
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
13
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
14
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
15
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
16
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
17
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
18
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
19
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
20
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
21
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
22
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
23
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
24
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
25
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
26
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
27
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
28
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
29
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
30
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
31
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
32
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
33
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
34
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
35
+ ent process? is POSITIVE Good and the response is Alignment marks are specific features or structures with well-known coordinates used in the alignment process. To execute an alignment, an image displaying these structural characteristics is needed. These alignment marks can either be specially designed adjusting aids or distinctive features of already patterned structures. Both types of features serve the same purpose: to ensure precise alignment by providing reference points that can be easily identified and used for accurate positioning during the alignment process.
36
+
37
+ The feedback for What does alignment accuracy refer to, and how is it achieved in a microscopy context? is POSITIVE Good and the response is Alignment accuracy refers to the variation between the actual position and the target position of a specimen under a microscope. Achieving high alignment accuracy involves a process called alignment, where the beam deflection is corrected through digital shift and rotation to minimize the mismatch between the actual and target positions. This method can achieve precision within tens of nanometers, ensuring that the specimen is accurately positioned and aligned for detailed examination and analysis.
38
+
39
+ The feedback for What are adjustment points in the context of using a microscope, and why are they important? is POSITIVE Good and the response is Adjustment points are specific locations on a specimen with well-known coordinates, essential for determining the coordinate transformation from the specimen system to the stage system on a microscope. When a specimen is loaded onto the microscope stage, the absolute position and angle relative to the stage are initially undefined. The adjustment procedure uses these points to establish a precise coordinate system, enabling accurate navigation of the specimen. The accuracy of this adjustment relies on the precision of the microscope stage and the selection of adjustment points, ensuring the system gathers maximum information while avoiding overdetermination.
40
+
41
+ The feedback for What is present in the menu bar? is POSITIVE Good and the response is The present in the menu bar are: File Menu, Edit Menu, View Menu, Sample Menu, Image Menu, and Settings Menu.
42
+
43
+ The feedback for What is a beam blanker? is POSITIVE Good and the response is A beam blanker is a device that moves back with the intended direction of motion to avoid unintended exposure during standby times and beam settling times. An electrostatic blanker has the advantage of being able to switch the beam on and off very fast.
44
+
45
+ The feedback for How to create a recipe with exclusive function? is POSITIVE Good and the response is 1. To create a recipe with exclusive function, you can either choose the entry 'Exclusive' in the Material drop-down list or click at the (Create Recipe) Pen button next to the Material drop-down list. 2. If there are any previously assigned parameters, they will be used as default when changing to Exclusive Mode. Doing so, the Pen button changes to the Save Changes button. By pressing this button, the user is offered a possibility to save the edited set of exposure parameters as a new recipe. Apart from user saving the edited recipe as a new recipe, the user can even overwrite the existing recipe as well.
46
+
47
+ The feedback for What does Capture button in Image menu do? is NEGATIVE and the response is To capture an image in eLitho, first select the desired image mode (switch between Live and sample mode). Then click on the "Capture" button in the image menu. This will start capturing the image. You can stop the image capture by clicking on the "Freeze" button or by pressing the "End Frame" key.
48
+
49
+ The feedback for Hello is POSITIVE and the response is Hello! How can I help you today?
landing_page.py ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import shutil
2
+ import streamlit as st
3
+ st.set_page_config(
4
+ page_title="RAG Configuration",
5
+ page_icon="🤖",
6
+ layout="wide",
7
+ initial_sidebar_state="collapsed"
8
+ )
9
+ import re
10
+ import os
11
+ import spire.pdf
12
+ import fitz
13
+ from src.Databases import *
14
+ from langchain.text_splitter import *
15
+ from sentence_transformers import SentenceTransformer, CrossEncoder
16
+ from langchain_community.llms import HuggingFaceHub
17
+ from langchain_huggingface import HuggingFaceEmbeddings
18
+ from transformers import (AutoFeatureExtractor, AutoModel, AutoImageProcessor)
19
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
20
+
21
+
22
+ class SentenceTransformerEmbeddings:
23
+ """
24
+ Wrapper Class for SentenceTransformer Class
25
+ """
26
+
27
+ def __init__(self, model_name: str):
28
+ """
29
+ Initiliases a Sentence Transformer
30
+ """
31
+ self.model = SentenceTransformer(model_name)
32
+
33
+ def embed_documents(self, texts):
34
+ """
35
+ Returns a list of embeddings for the given texts.
36
+ """
37
+ return self.model.encode(texts, convert_to_tensor=True).tolist()
38
+
39
+ def embed_query(self, text):
40
+ """
41
+ Returns a list of embeddings for the given text.
42
+ """
43
+ return self.model.encode(text, convert_to_tensor=True).tolist()
44
+
45
+
46
+ @st.cache_resource(show_spinner=False)
47
+ def settings():
48
+ return HuggingFaceEmbedding(model_name="BAAI/bge-base-en")
49
+
50
+
51
+ @st.cache_resource(show_spinner=False)
52
+ def pine_embedding_model():
53
+ return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2") # 784 dimension + euclidean
54
+
55
+
56
+ @st.cache_resource(show_spinner=False)
57
+ def weaviate_embedding_model():
58
+ return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
59
+
60
+
61
+ @st.cache_resource(show_spinner=False)
62
+ def load_image_model(model):
63
+ extractor = AutoFeatureExtractor.from_pretrained(model)
64
+ im_model = AutoModel.from_pretrained(model)
65
+ return extractor, im_model
66
+
67
+
68
+ @st.cache_resource(show_spinner=False)
69
+ def load_bi_encoder():
70
+ return HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2", model_kwargs={"device": "cpu"})
71
+
72
+
73
+ @st.cache_resource(show_spinner=False)
74
+ def pine_embedding_model():
75
+ return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2") # 784 dimension + euclidean
76
+
77
+
78
+ @st.cache_resource(show_spinner=False)
79
+ def weaviate_embedding_model():
80
+ return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
81
+
82
+
83
+ @st.cache_resource(show_spinner=False)
84
+ def load_cross():
85
+ return CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu")
86
+
87
+
88
+ @st.cache_resource(show_spinner=False)
89
+ def pine_cross_encoder():
90
+ return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512, device="cpu")
91
+
92
+
93
+ @st.cache_resource(show_spinner=False)
94
+ def weaviate_cross_encoder():
95
+ return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512, device="cpu")
96
+
97
+
98
+ @st.cache_resource(show_spinner=False)
99
+ def load_chat_model():
100
+ template = '''
101
+ You are an assistant for question-answering tasks.
102
+ Use the following pieces of retrieved context to answer the question accurately.
103
+ If the question is not related to the context, just answer 'I don't know'.
104
+ Question: {question}
105
+ Context: {context}
106
+ Answer:
107
+ '''
108
+ return HuggingFaceHub(
109
+ repo_id="mistralai/Mistral-7B-Instruct-v0.1",
110
+ model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512, "query_wrapper_prompt": template}
111
+ )
112
+
113
+
114
+ @st.cache_resource(show_spinner=False)
115
+ def load_q_model():
116
+ return HuggingFaceHub(
117
+ repo_id="mistralai/Mistral-7B-Instruct-v0.3",
118
+ model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512}
119
+ )
120
+
121
+
122
+ @st.cache_resource(show_spinner=False)
123
+ def load_image_model(model):
124
+ extractor = AutoFeatureExtractor.from_pretrained(model)
125
+ im_model = AutoModel.from_pretrained(model)
126
+ return extractor, im_model
127
+
128
+
129
+ @st.cache_resource(show_spinner=False)
130
+ def load_nomic_model():
131
+ return AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5"), AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5",
132
+ trust_remote_code=True)
133
+
134
+
135
+ @st.cache_resource(show_spinner=False)
136
+ def vector_database_prep(file):
137
+ def data_prep(file):
138
+ def findWholeWord(w):
139
+ return re.compile(r'\b{0}\b'.format(re.escape(w)), flags=re.IGNORECASE).search
140
+
141
+ file_name = file.name
142
+ pdf_file_path = os.path.join(os.getcwd(), 'pdfs', file_name)
143
+ image_folder = os.path.join(os.getcwd(), f'figures_{file_name}')
144
+ if not os.path.exists(image_folder):
145
+ os.makedirs(image_folder)
146
+
147
+ # everything down here is wrt pages dir
148
+ print('1. folder made')
149
+ with spire.pdf.PdfDocument() as doc:
150
+ doc.LoadFromFile(pdf_file_path)
151
+ images = []
152
+ for page_num in range(doc.Pages.Count):
153
+ page = doc.Pages[page_num]
154
+ for image_num in range(len(page.ImagesInfo)):
155
+ imageFileName = os.path.join(image_folder, f'figure-{page_num}-{image_num}.png')
156
+ image = page.ImagesInfo[image_num]
157
+ image.Image.Save(imageFileName)
158
+ images.append({
159
+ "image_file_name": imageFileName,
160
+ "image": image
161
+ })
162
+ print('2. image extraction done')
163
+ image_info = []
164
+ for image_file in os.listdir(image_folder):
165
+ if image_file.endswith('.png'):
166
+ image_info.append({
167
+ "image_file_name": image_file[:-4],
168
+ "image": Image.open(os.path.join(image_folder, image_file)),
169
+ "pg_no": int(image_file.split('-')[1])
170
+ })
171
+ print('3. temporary')
172
+ figures = []
173
+ with fitz.open(pdf_file_path) as pdf_file:
174
+ data = ""
175
+ for page in pdf_file:
176
+ text = page.get_text()
177
+ if not (findWholeWord('table of contents')(text) or findWholeWord('index')(text)):
178
+ data += text
179
+ data = data.replace('}', '-')
180
+ data = data.replace('{', '-')
181
+ print('4. Data extraction done')
182
+ hs = []
183
+ for i in image_info:
184
+ src = i['image_file_name'] + '.png'
185
+ headers = {'_': []}
186
+ header = '_'
187
+ page = pdf_file[i['pg_no']]
188
+ texts = page.get_text('dict')
189
+ for block in texts['blocks']:
190
+ if block['type'] == 0:
191
+ for line in block['lines']:
192
+ for span in line['spans']:
193
+ if 'bol' in span['font'].lower() and not span['text'].isnumeric():
194
+ header = span['text']
195
+ print("header: ", header)
196
+ headers[header] = [header]
197
+ else:
198
+ headers[header].append(span['text'])
199
+ try:
200
+ if findWholeWord('fig')(span['text']):
201
+ i['image_file_name'] = span['text']
202
+ figures.append(span['text'].split('fig')[-1])
203
+ elif findWholeWord('figure')(span['text']):
204
+ i['image_file_name'] = span['text']
205
+ figures.append(span['text'].lower().split('figure')[-1])
206
+ else:
207
+ pass
208
+ except re.error:
209
+ pass
210
+ if not i['image_file_name'].endswith('.png'):
211
+ s = i['image_file_name'] + '.png'
212
+ i['image_file_name'] = s
213
+ os.rename(os.path.join(image_folder, src), os.path.join(image_folder, i['image_file_name']))
214
+ hs.append({"image": i, "header": headers})
215
+ print('5. header and figures done')
216
+ figure_contexts = {}
217
+ for fig in figures:
218
+ figure_contexts[fig] = []
219
+ for page_num in range(len(pdf_file)):
220
+ page = pdf_file[page_num]
221
+ texts = page.get_text('dict')
222
+ for block in texts['blocks']:
223
+ if block['type'] == 0:
224
+ for line in block['lines']:
225
+ for span in line['spans']:
226
+ if findWholeWord(fig)(span['text']):
227
+ print('figure mention: ', span['text'])
228
+ figure_contexts[fig].append(span['text'])
229
+ print('6. Figure context collected')
230
+ contexts = []
231
+ for h in hs:
232
+ context = ""
233
+ for q in h['header'].values():
234
+ context += "".join(q)
235
+ s = pytesseract.image_to_string(h['image']['image'])
236
+ qwea = context + '\n' + s if len(s) != 0 else context
237
+ contexts.append((
238
+ h['image']['image_file_name'],
239
+ qwea,
240
+ h['image']['image']
241
+ ))
242
+ print('7. Overall context collected')
243
+ image_content = []
244
+ for fig in figure_contexts:
245
+ for c in contexts:
246
+ if findWholeWord(fig)(c[0]):
247
+ s = c[1] + '\n' + "\n".join(figure_contexts[fig])
248
+ s = str("\n".join(
249
+ [
250
+ "".join([h for h in i.strip() if h.isprintable()])
251
+ for i in s.split('\n')
252
+ if len(i.strip()) != 0
253
+ ]
254
+ ))
255
+ image_content.append((
256
+ c[0],
257
+ s,
258
+ c[2]
259
+ ))
260
+ print('8. Figure context added')
261
+
262
+ return data, image_content
263
+
264
+ # Vector Database objects
265
+ extractor, i_model = st.session_state['extractor'], st.session_state['image_model']
266
+ pinecone_embed = st.session_state['pinecone_embed']
267
+ weaviate_embed = st.session_state['weaviate_embed']
268
+
269
+ vb1 = UnifiedDatabase('vb1', 'lancedb/rag')
270
+ vb1.model_prep(extractor, i_model, weaviate_embed,
271
+ RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
272
+ vb2 = UnifiedDatabase('vb2', 'lancedb/rag')
273
+ vb2.model_prep(extractor, i_model, pinecone_embed,
274
+ RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
275
+ vb_list = [vb1, vb2]
276
+
277
+ data, image_content = data_prep(file)
278
+ for vb in vb_list:
279
+ vb.upsert(data)
280
+ vb.upsert(image_content) # image_cont = dict[image_file_path, context, PIL]
281
+ return vb_list
282
+
283
+
284
+ os.environ["HUGGINGFACEHUB_API_TOKEN"] = st.secrets["HUGGINGFACEHUB_API_TOKEN"]
285
+ os.environ["LANGCHAIN_PROJECT"] = st.secrets["LANGCHAIN_PROJECT"]
286
+ os.environ["OPENAI_API_KEY"] = st.secrets["GPT_KEY"]
287
+ st.session_state['pdf_file'] = []
288
+ st.session_state['vb_list'] = []
289
+ st.session_state['Settings.embed_model'] = settings()
290
+ st.session_state['processor'], st.session_state['vision_model'] = load_nomic_model()
291
+ st.session_state['bi_encoder'] = load_bi_encoder()
292
+ st.session_state['chat_model'] = load_chat_model()
293
+ st.session_state['cross_model'] = load_cross()
294
+ st.session_state['q_model'] = load_q_model()
295
+ st.session_state['extractor'], st.session_state['image_model'] = load_image_model("google/vit-base-patch16-224-in21k")
296
+ st.session_state['pinecone_embed'] = pine_embedding_model()
297
+ st.session_state['weaviate_embed'] = weaviate_embedding_model()
298
+
299
+ st.title('Multi-modal RAG based LLM for Information Retrieval')
300
+ st.subheader('Converse with our Chatbot')
301
+ st.markdown('Enter a pdf file as a source.')
302
+ uploaded_file = st.file_uploader("Choose an pdf document...", type=["pdf"], accept_multiple_files=False)
303
+ if uploaded_file is not None:
304
+ with open(uploaded_file.name, mode='wb') as w:
305
+ w.write(uploaded_file.getvalue())
306
+ if not os.path.exists(os.path.join(os.getcwd(), 'pdfs')):
307
+ os.makedirs(os.path.join(os.getcwd(), 'pdfs'))
308
+ shutil.move(uploaded_file.name, os.path.join(os.getcwd(), 'pdfs'))
309
+ st.session_state['pdf_file'] = uploaded_file.name
310
+ with st.spinner('Extracting'):
311
+ vb_list = vector_database_prep(uploaded_file)
312
+ st.session_state['vb_list'] = vb_list
313
+ st.switch_page('pages/rag.py')
requirements.txt ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ langchain_openai
3
+ requests
4
+ langchain
5
+ langchain_community
6
+ datasets
7
+ openai
8
+ numpy
9
+ transformers
10
+ torch
11
+ sentence_transformers
12
+ langchain_huggingface
13
+ ragas
14
+ weaviate-client
15
+ streamlit_feedback
16
+ pinecone-client
17
+ langchain_pinecone
18
+ langchain_weaviate
19
+ langsmith
20
+ langgraph
21
+ pandas
22
+ scipy
23
+ pillow
24
+ torchvision
25
+ sentence-transformers
26
+ unidecode
27
+ pytesseract
28
+ langchain_mistralai
29
+ pymupdf
30
+ langchain-huggingface
31
+ llmlingua
32
+ accelerate
33
+ pyarrow
34
+ lancedb
35
+ pillow_heif
36
+ llama-index-vector-stores-lancedb
37
+ llama-index
38
+ ftfy
39
+ tqdm
40
+ llama-index-multi-modal-llms-openai
41
+ llama-index-embeddings-huggingface
42
+ llama-index-readers-file
43
+ einops
44
+ unstructured
45
+ unstructured_inference
46
+ unstructured.pytesseract
47
+ pdfminer
48
+ llama-index-embeddings-clip
49
+ scikit-image
50
+ scikit-learn
51
+ matplotlib
52
+ Spire.Pdf
53
+ python-pptx
software_data.txt ADDED
The diff for this file is too large to render. See raw diff
 
software_final.txt ADDED
The diff for this file is too large to render. See raw diff
 
streamlit_rag.sh ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/zsh
2
+
3
+ trap 'on_exit' SIGINT
4
+
5
+ on_exit() {
6
+ rm -rf figures_*
7
+ rm -rf pdfs
8
+ rm -rf lancedb
9
+ mkdir pdfs
10
+ exit 0
11
+ }
12
+
13
+ streamlit run landing_page.py &
14
+ wait $!