Spaces:
Running
Running
Merge branch 'main' into pdf-render
Browse files- CHANGELOG.md +31 -10
- README.md +7 -3
- document_qa/document_qa_engine.py +66 -25
- document_qa/grobid_processors.py +1 -1
- pyproject.toml +1 -1
- streamlit_app.py +15 -11
CHANGELOG.md
CHANGED
@@ -4,27 +4,49 @@ All notable changes to this project will be documented in this file.
|
|
4 |
|
5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
## [0.2.0] – 2023-10-31
|
9 |
|
10 |
### Added
|
|
|
11 |
+ Selection of chunk size on which embeddings are created upon
|
12 |
-
+ Mistral model to be used freely via the Huggingface free API
|
13 |
|
14 |
### Changed
|
15 |
-
|
|
|
16 |
+ Moved settings on the sidebar
|
17 |
+ Disable NER extraction by default, and allow user to activate it
|
18 |
+ Read API KEY from the environment variables and if present, avoid asking the user
|
19 |
+ Avoid changing model after update
|
20 |
|
21 |
-
|
22 |
-
|
23 |
## [0.1.3] – 2023-10-30
|
24 |
|
25 |
### Fixed
|
26 |
|
27 |
-
+ ChromaDb accumulating information even when new papers were uploaded
|
28 |
|
29 |
## [0.1.2] – 2023-10-26
|
30 |
|
@@ -36,9 +58,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
36 |
|
37 |
### Fixed
|
38 |
|
39 |
-
+ Github action build
|
40 |
-
+ dependencies of langchain and chromadb
|
41 |
-
|
42 |
|
43 |
## [0.1.0] – 2023-10-26
|
44 |
|
@@ -54,8 +75,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
54 |
+ Kick off application
|
55 |
+ Support for GPT-3.5
|
56 |
+ Support for Mistral + SentenceTransformer
|
57 |
-
+ Streamlit application
|
58 |
-
+ Docker image
|
59 |
+ pypi package
|
60 |
|
61 |
<!-- markdownlint-disable-file MD024 MD033 -->
|
|
|
4 |
|
5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
6 |
|
7 |
+
## [0.3.1] - 2023-11-22
|
8 |
+
|
9 |
+
### Added
|
10 |
+
|
11 |
+
+ Include biblio in embeddings by @lfoppiano in #21
|
12 |
+
|
13 |
+
### Fixed
|
14 |
+
|
15 |
+
+ Fix conversational memory by @lfoppiano in #20
|
16 |
+
|
17 |
+
## [0.3.0] - 2023-11-18
|
18 |
+
|
19 |
+
### Added
|
20 |
+
|
21 |
+
+ add zephyr-7b by @lfoppiano in #15
|
22 |
+
+ add conversational memory in #18
|
23 |
+
|
24 |
+
## [0.2.1] - 2023-11-01
|
25 |
+
|
26 |
+
### Fixed
|
27 |
+
|
28 |
+
+ fix env variables by @lfoppiano in #9
|
29 |
|
30 |
## [0.2.0] – 2023-10-31
|
31 |
|
32 |
### Added
|
33 |
+
|
34 |
+ Selection of chunk size on which embeddings are created upon
|
35 |
+
+ Mistral model to be used freely via the Huggingface free API
|
36 |
|
37 |
### Changed
|
38 |
+
|
39 |
+
+ Improved documentation, adding privacy statement
|
40 |
+ Moved settings on the sidebar
|
41 |
+ Disable NER extraction by default, and allow user to activate it
|
42 |
+ Read API KEY from the environment variables and if present, avoid asking the user
|
43 |
+ Avoid changing model after update
|
44 |
|
|
|
|
|
45 |
## [0.1.3] – 2023-10-30
|
46 |
|
47 |
### Fixed
|
48 |
|
49 |
+
+ ChromaDb accumulating information even when new papers were uploaded
|
50 |
|
51 |
## [0.1.2] – 2023-10-26
|
52 |
|
|
|
58 |
|
59 |
### Fixed
|
60 |
|
61 |
+
+ Github action build
|
62 |
+
+ dependencies of langchain and chromadb
|
|
|
63 |
|
64 |
## [0.1.0] – 2023-10-26
|
65 |
|
|
|
75 |
+ Kick off application
|
76 |
+ Support for GPT-3.5
|
77 |
+ Support for Mistral + SentenceTransformer
|
78 |
+
+ Streamlit application
|
79 |
+
+ Docker image
|
80 |
+ pypi package
|
81 |
|
82 |
<!-- markdownlint-disable-file MD024 MD033 -->
|
README.md
CHANGED
@@ -14,6 +14,8 @@ license: apache-2.0
|
|
14 |
|
15 |
**Work in progress** :construction_worker:
|
16 |
|
|
|
|
|
17 |
## Introduction
|
18 |
|
19 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
@@ -23,11 +25,13 @@ We target only the full-text using [Grobid](https://github.com/kermitt2/grobid)
|
|
23 |
|
24 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
25 |
|
26 |
-
The conversation is
|
|
|
|
|
27 |
|
28 |
**Demos**:
|
29 |
-
- (
|
30 |
-
- (
|
31 |
|
32 |
## Getting started
|
33 |
|
|
|
14 |
|
15 |
**Work in progress** :construction_worker:
|
16 |
|
17 |
+
<img src="https://github.com/lfoppiano/document-qa/assets/15426/f0a04a86-96b3-406e-8303-904b93f00015" width=300 align="right" />
|
18 |
+
|
19 |
## Introduction
|
20 |
|
21 |
Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
|
|
|
25 |
|
26 |
Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
|
27 |
|
28 |
+
The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
|
29 |
+
|
30 |
+
(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
|
31 |
|
32 |
**Demos**:
|
33 |
+
- (stable version): https://lfoppiano-document-qa.hf.space/
|
34 |
+
- (unstable version): https://document-insights.streamlit.app/
|
35 |
|
36 |
## Getting started
|
37 |
|
document_qa/document_qa_engine.py
CHANGED
@@ -3,17 +3,18 @@ import os
|
|
3 |
from pathlib import Path
|
4 |
from typing import Union, Any
|
5 |
|
|
|
6 |
from grobid_client.grobid_client import GrobidClient
|
7 |
-
from langchain.chains import create_extraction_chain
|
8 |
-
from langchain.chains.question_answering import load_qa_chain
|
|
|
9 |
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
10 |
from langchain.retrievers import MultiQueryRetriever
|
|
|
11 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
12 |
from langchain.vectorstores import Chroma
|
13 |
from tqdm import tqdm
|
14 |
|
15 |
-
from document_qa.grobid_processors import GrobidProcessor
|
16 |
-
|
17 |
|
18 |
class DocumentQAEngine:
|
19 |
llm = None
|
@@ -23,15 +24,24 @@ class DocumentQAEngine:
|
|
23 |
embeddings_map_from_md5 = {}
|
24 |
embeddings_map_to_md5 = {}
|
25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
def __init__(self,
|
27 |
llm,
|
28 |
embedding_function,
|
29 |
qa_chain_type="stuff",
|
30 |
embeddings_root_path=None,
|
31 |
grobid_url=None,
|
|
|
32 |
):
|
33 |
self.embedding_function = embedding_function
|
34 |
self.llm = llm
|
|
|
35 |
self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
|
36 |
|
37 |
if embeddings_root_path is not None:
|
@@ -87,14 +97,14 @@ class DocumentQAEngine:
|
|
87 |
return self.embeddings_map_from_md5[md5]
|
88 |
|
89 |
def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
|
90 |
-
verbose=False
|
91 |
Any, str):
|
92 |
# self.load_embeddings(self.embeddings_root_path)
|
93 |
|
94 |
if verbose:
|
95 |
print(query)
|
96 |
|
97 |
-
response = self._run_query(doc_id, query, context_size=context_size
|
98 |
response = response['output_text'] if 'output_text' in response else response
|
99 |
|
100 |
if verbose:
|
@@ -144,21 +154,25 @@ class DocumentQAEngine:
|
|
144 |
|
145 |
return parsed_output
|
146 |
|
147 |
-
def _run_query(self, doc_id, query,
|
148 |
relevant_documents = self._get_context(doc_id, query, context_size)
|
149 |
-
|
150 |
-
return self.chain.run(input_documents=relevant_documents,
|
151 |
question=query)
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
# return self.chain({"input_documents": relevant_documents, "question": prompt_chat_template}, return_only_outputs=True)
|
157 |
|
158 |
def _get_context(self, doc_id, query, context_size=4):
|
159 |
db = self.embeddings_dict[doc_id]
|
160 |
retriever = db.as_retriever(search_kwargs={"k": context_size})
|
161 |
relevant_documents = retriever.get_relevant_documents(query)
|
|
|
|
|
|
|
|
|
|
|
|
|
162 |
return relevant_documents
|
163 |
|
164 |
def get_all_context_by_document(self, doc_id):
|
@@ -173,8 +187,10 @@ class DocumentQAEngine:
|
|
173 |
relevant_documents = multi_query_retriever.get_relevant_documents(query)
|
174 |
return relevant_documents
|
175 |
|
176 |
-
def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, verbose=False):
|
177 |
-
"""
|
|
|
|
|
178 |
if verbose:
|
179 |
print("File", pdf_file_path)
|
180 |
filename = Path(pdf_file_path).stem
|
@@ -189,6 +205,7 @@ class DocumentQAEngine:
|
|
189 |
texts = []
|
190 |
metadatas = []
|
191 |
ids = []
|
|
|
192 |
if chunk_size < 0:
|
193 |
for passage in structure['passages']:
|
194 |
biblio_copy = copy.copy(biblio)
|
@@ -212,28 +229,49 @@ class DocumentQAEngine:
|
|
212 |
metadatas = [biblio for _ in range(len(texts))]
|
213 |
ids = [id for id, t in enumerate(texts)]
|
214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
215 |
return texts, metadatas, ids
|
216 |
|
217 |
-
def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1):
|
218 |
-
|
|
|
|
|
|
|
|
|
|
|
219 |
if doc_id:
|
220 |
hash = doc_id
|
221 |
else:
|
222 |
hash = metadata[0]['hash']
|
223 |
|
224 |
if hash not in self.embeddings_dict.keys():
|
225 |
-
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
|
|
|
|
226 |
collection_name=hash)
|
227 |
else:
|
228 |
-
self.embeddings_dict[hash].
|
229 |
-
self.embeddings_dict[hash]
|
|
|
|
|
|
|
|
|
230 |
collection_name=hash)
|
231 |
|
232 |
self.embeddings_root_path = None
|
233 |
|
234 |
return hash
|
235 |
|
236 |
-
def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1):
|
237 |
input_files = []
|
238 |
for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
|
239 |
for file_ in files:
|
@@ -250,9 +288,12 @@ class DocumentQAEngine:
|
|
250 |
if os.path.exists(data_path):
|
251 |
print(data_path, "exists. Skipping it ")
|
252 |
continue
|
253 |
-
|
254 |
-
texts, metadata, ids = self.get_text_from_document(
|
255 |
-
|
|
|
|
|
|
|
256 |
filename = metadata[0]['filename']
|
257 |
|
258 |
vector_db_document = Chroma.from_texts(texts,
|
|
|
3 |
from pathlib import Path
|
4 |
from typing import Union, Any
|
5 |
|
6 |
+
from document_qa.grobid_processors import GrobidProcessor
|
7 |
from grobid_client.grobid_client import GrobidClient
|
8 |
+
from langchain.chains import create_extraction_chain, ConversationChain, ConversationalRetrievalChain
|
9 |
+
from langchain.chains.question_answering import load_qa_chain, stuff_prompt, refine_prompts, map_reduce_prompt, \
|
10 |
+
map_rerank_prompt
|
11 |
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
|
12 |
from langchain.retrievers import MultiQueryRetriever
|
13 |
+
from langchain.schema import Document
|
14 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
15 |
from langchain.vectorstores import Chroma
|
16 |
from tqdm import tqdm
|
17 |
|
|
|
|
|
18 |
|
19 |
class DocumentQAEngine:
|
20 |
llm = None
|
|
|
24 |
embeddings_map_from_md5 = {}
|
25 |
embeddings_map_to_md5 = {}
|
26 |
|
27 |
+
default_prompts = {
|
28 |
+
'stuff': stuff_prompt,
|
29 |
+
'refine': refine_prompts,
|
30 |
+
"map_reduce": map_reduce_prompt,
|
31 |
+
"map_rerank": map_rerank_prompt
|
32 |
+
}
|
33 |
+
|
34 |
def __init__(self,
|
35 |
llm,
|
36 |
embedding_function,
|
37 |
qa_chain_type="stuff",
|
38 |
embeddings_root_path=None,
|
39 |
grobid_url=None,
|
40 |
+
memory=None
|
41 |
):
|
42 |
self.embedding_function = embedding_function
|
43 |
self.llm = llm
|
44 |
+
self.memory = memory
|
45 |
self.chain = load_qa_chain(llm, chain_type=qa_chain_type)
|
46 |
|
47 |
if embeddings_root_path is not None:
|
|
|
97 |
return self.embeddings_map_from_md5[md5]
|
98 |
|
99 |
def query_document(self, query: str, doc_id, output_parser=None, context_size=4, extraction_schema=None,
|
100 |
+
verbose=False) -> (
|
101 |
Any, str):
|
102 |
# self.load_embeddings(self.embeddings_root_path)
|
103 |
|
104 |
if verbose:
|
105 |
print(query)
|
106 |
|
107 |
+
response = self._run_query(doc_id, query, context_size=context_size)
|
108 |
response = response['output_text'] if 'output_text' in response else response
|
109 |
|
110 |
if verbose:
|
|
|
154 |
|
155 |
return parsed_output
|
156 |
|
157 |
+
def _run_query(self, doc_id, query, context_size=4):
|
158 |
relevant_documents = self._get_context(doc_id, query, context_size)
|
159 |
+
response = self.chain.run(input_documents=relevant_documents,
|
|
|
160 |
question=query)
|
161 |
+
|
162 |
+
if self.memory:
|
163 |
+
self.memory.save_context({"input": query}, {"output": response})
|
164 |
+
return response
|
|
|
165 |
|
166 |
def _get_context(self, doc_id, query, context_size=4):
|
167 |
db = self.embeddings_dict[doc_id]
|
168 |
retriever = db.as_retriever(search_kwargs={"k": context_size})
|
169 |
relevant_documents = retriever.get_relevant_documents(query)
|
170 |
+
if self.memory and len(self.memory.buffer_as_messages) > 0:
|
171 |
+
relevant_documents.append(
|
172 |
+
Document(
|
173 |
+
page_content="""Following, the previous question and answers. Use these information only when in the question there are unspecified references:\n{}\n\n""".format(
|
174 |
+
self.memory.buffer_as_str))
|
175 |
+
)
|
176 |
return relevant_documents
|
177 |
|
178 |
def get_all_context_by_document(self, doc_id):
|
|
|
187 |
relevant_documents = multi_query_retriever.get_relevant_documents(query)
|
188 |
return relevant_documents
|
189 |
|
190 |
+
def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, include=(), verbose=False):
|
191 |
+
"""
|
192 |
+
Extract text from documents using Grobid, if chunk_size is < 0 it keeps each paragraph separately
|
193 |
+
"""
|
194 |
if verbose:
|
195 |
print("File", pdf_file_path)
|
196 |
filename = Path(pdf_file_path).stem
|
|
|
205 |
texts = []
|
206 |
metadatas = []
|
207 |
ids = []
|
208 |
+
|
209 |
if chunk_size < 0:
|
210 |
for passage in structure['passages']:
|
211 |
biblio_copy = copy.copy(biblio)
|
|
|
229 |
metadatas = [biblio for _ in range(len(texts))]
|
230 |
ids = [id for id, t in enumerate(texts)]
|
231 |
|
232 |
+
if "biblio" in include:
|
233 |
+
biblio_metadata = copy.copy(biblio)
|
234 |
+
biblio_metadata['type'] = "biblio"
|
235 |
+
biblio_metadata['section'] = "header"
|
236 |
+
for key in ['title', 'authors', 'publication_year']:
|
237 |
+
if key in biblio_metadata:
|
238 |
+
texts.append("{}: {}".format(key, biblio_metadata[key]))
|
239 |
+
metadatas.append(biblio_metadata)
|
240 |
+
ids.append(key)
|
241 |
+
|
242 |
return texts, metadatas, ids
|
243 |
|
244 |
+
def create_memory_embeddings(self, pdf_path, doc_id=None, chunk_size=500, perc_overlap=0.1, include_biblio=False):
|
245 |
+
include = ["biblio"] if include_biblio else []
|
246 |
+
texts, metadata, ids = self.get_text_from_document(
|
247 |
+
pdf_path,
|
248 |
+
chunk_size=chunk_size,
|
249 |
+
perc_overlap=perc_overlap,
|
250 |
+
include=include)
|
251 |
if doc_id:
|
252 |
hash = doc_id
|
253 |
else:
|
254 |
hash = metadata[0]['hash']
|
255 |
|
256 |
if hash not in self.embeddings_dict.keys():
|
257 |
+
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
258 |
+
embedding=self.embedding_function,
|
259 |
+
metadatas=metadata,
|
260 |
collection_name=hash)
|
261 |
else:
|
262 |
+
# if 'documents' in self.embeddings_dict[hash].get() and len(self.embeddings_dict[hash].get()['documents']) == 0:
|
263 |
+
# self.embeddings_dict[hash].delete(ids=self.embeddings_dict[hash].get()['ids'])
|
264 |
+
self.embeddings_dict[hash].delete_collection()
|
265 |
+
self.embeddings_dict[hash] = Chroma.from_texts(texts,
|
266 |
+
embedding=self.embedding_function,
|
267 |
+
metadatas=metadata,
|
268 |
collection_name=hash)
|
269 |
|
270 |
self.embeddings_root_path = None
|
271 |
|
272 |
return hash
|
273 |
|
274 |
+
def create_embeddings(self, pdfs_dir_path: Path, chunk_size=500, perc_overlap=0.1, include_biblio=False):
|
275 |
input_files = []
|
276 |
for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False):
|
277 |
for file_ in files:
|
|
|
288 |
if os.path.exists(data_path):
|
289 |
print(data_path, "exists. Skipping it ")
|
290 |
continue
|
291 |
+
include = ["biblio"] if include_biblio else []
|
292 |
+
texts, metadata, ids = self.get_text_from_document(
|
293 |
+
input_file,
|
294 |
+
chunk_size=chunk_size,
|
295 |
+
perc_overlap=perc_overlap,
|
296 |
+
include=include)
|
297 |
filename = metadata[0]['filename']
|
298 |
|
299 |
vector_db_document = Chroma.from_texts(texts,
|
document_qa/grobid_processors.py
CHANGED
@@ -171,7 +171,7 @@ class GrobidProcessor(BaseProcessor):
|
|
171 |
}
|
172 |
try:
|
173 |
year = dateparser.parse(doc_biblio.header.date).year
|
174 |
-
biblio["
|
175 |
except:
|
176 |
pass
|
177 |
|
|
|
171 |
}
|
172 |
try:
|
173 |
year = dateparser.parse(doc_biblio.header.date).year
|
174 |
+
biblio["publication_year"] = year
|
175 |
except:
|
176 |
pass
|
177 |
|
pyproject.toml
CHANGED
@@ -3,7 +3,7 @@ requires = ["setuptools", "setuptools-scm"]
|
|
3 |
build-backend = "setuptools.build_meta"
|
4 |
|
5 |
[tool.bumpversion]
|
6 |
-
current_version = "0.3.
|
7 |
commit = "true"
|
8 |
tag = "true"
|
9 |
tag_name = "v{new_version}"
|
|
|
3 |
build-backend = "setuptools.build_meta"
|
4 |
|
5 |
[tool.bumpversion]
|
6 |
+
current_version = "0.3.2"
|
7 |
commit = "true"
|
8 |
tag = "true"
|
9 |
tag_name = "v{new_version}"
|
streamlit_app.py
CHANGED
@@ -115,6 +115,7 @@ def clear_memory():
|
|
115 |
|
116 |
# @st.cache_resource
|
117 |
def init_qa(model, api_key=None):
|
|
|
118 |
if model == 'chatgpt-3.5-turbo':
|
119 |
if api_key:
|
120 |
chat = ChatOpenAI(model_name="gpt-3.5-turbo",
|
@@ -143,7 +144,7 @@ def init_qa(model, api_key=None):
|
|
143 |
st.stop()
|
144 |
return
|
145 |
|
146 |
-
return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'])
|
147 |
|
148 |
|
149 |
@st.cache_resource
|
@@ -252,7 +253,8 @@ with st.sidebar:
|
|
252 |
|
253 |
st.button(
|
254 |
'Reset chat memory.',
|
255 |
-
|
|
|
256 |
help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
|
257 |
|
258 |
left_column, right_column = st.columns([1, 1])
|
@@ -264,7 +266,9 @@ with right_column:
|
|
264 |
st.markdown(
|
265 |
":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
|
266 |
|
267 |
-
|
|
|
|
|
268 |
disabled=st.session_state['model'] is not None and st.session_state['model'] not in
|
269 |
st.session_state['api_keys'],
|
270 |
help="The full-text is extracted using Grobid. ")
|
@@ -331,7 +335,8 @@ if uploaded_file and not st.session_state.loaded_embeddings:
|
|
331 |
|
332 |
st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
|
333 |
chunk_size=chunk_size,
|
334 |
-
|
|
|
335 |
st.session_state['loaded_embeddings'] = True
|
336 |
st.session_state.messages = []
|
337 |
|
@@ -384,8 +389,7 @@ with right_column:
|
|
384 |
elif mode == "LLM":
|
385 |
with st.spinner("Generating response..."):
|
386 |
_, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
|
387 |
-
|
388 |
-
memory=st.session_state.memory)
|
389 |
|
390 |
if not text_response:
|
391 |
st.error("Something went wrong. Contact Luca Foppiano (Foppiano.Luca@nims.co.jp) to report the issue.")
|
@@ -404,11 +408,11 @@ with right_column:
|
|
404 |
st.write(text_response)
|
405 |
st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
|
406 |
|
407 |
-
|
408 |
-
|
409 |
-
|
410 |
-
|
411 |
-
|
412 |
|
413 |
elif st.session_state.loaded_embeddings and st.session_state.doc_id:
|
414 |
play_old_messages()
|
|
|
115 |
|
116 |
# @st.cache_resource
|
117 |
def init_qa(model, api_key=None):
|
118 |
+
## For debug add: callbacks=[PromptLayerCallbackHandler(pl_tags=["langchain", "chatgpt", "document-qa"])])
|
119 |
if model == 'chatgpt-3.5-turbo':
|
120 |
if api_key:
|
121 |
chat = ChatOpenAI(model_name="gpt-3.5-turbo",
|
|
|
144 |
st.stop()
|
145 |
return
|
146 |
|
147 |
+
return DocumentQAEngine(chat, embeddings, grobid_url=os.environ['GROBID_URL'], memory=st.session_state['memory'])
|
148 |
|
149 |
|
150 |
@st.cache_resource
|
|
|
253 |
|
254 |
st.button(
|
255 |
'Reset chat memory.',
|
256 |
+
key="reset-memory-button",
|
257 |
+
on_click=clear_memory,
|
258 |
help="Clear the conversational memory. Currently implemented to retrain the 4 most recent messages.")
|
259 |
|
260 |
left_column, right_column = st.columns([1, 1])
|
|
|
266 |
st.markdown(
|
267 |
":warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use or handling of the data submitted to third parties LLMs.")
|
268 |
|
269 |
+
uploaded_file = st.file_uploader("Upload an article",
|
270 |
+
type=("pdf", "txt"),
|
271 |
+
on_change=new_file,
|
272 |
disabled=st.session_state['model'] is not None and st.session_state['model'] not in
|
273 |
st.session_state['api_keys'],
|
274 |
help="The full-text is extracted using Grobid. ")
|
|
|
335 |
|
336 |
st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(tmp_file.name,
|
337 |
chunk_size=chunk_size,
|
338 |
+
perc_overlap=0.1,
|
339 |
+
include_biblio=True)
|
340 |
st.session_state['loaded_embeddings'] = True
|
341 |
st.session_state.messages = []
|
342 |
|
|
|
389 |
elif mode == "LLM":
|
390 |
with st.spinner("Generating response..."):
|
391 |
_, text_response = st.session_state['rqa'][model].query_document(question, st.session_state.doc_id,
|
392 |
+
context_size=context_size)
|
|
|
393 |
|
394 |
if not text_response:
|
395 |
st.error("Something went wrong. Contact Luca Foppiano (Foppiano.Luca@nims.co.jp) to report the issue.")
|
|
|
408 |
st.write(text_response)
|
409 |
st.session_state.messages.append({"role": "assistant", "mode": mode, "content": text_response})
|
410 |
|
411 |
+
# if len(st.session_state.messages) > 1:
|
412 |
+
# last_answer = st.session_state.messages[len(st.session_state.messages)-1]
|
413 |
+
# if last_answer['role'] == "assistant":
|
414 |
+
# last_question = st.session_state.messages[len(st.session_state.messages)-2]
|
415 |
+
# st.session_state.memory.save_context({"input": last_question['content']}, {"output": last_answer['content']})
|
416 |
|
417 |
elif st.session_state.loaded_embeddings and st.session_state.doc_id:
|
418 |
play_old_messages()
|