Luca Foppiano commited on
Commit
4e6f989
·
unverified ·
1 Parent(s): 048eb6f

Fix typo, acknowledge more contributors

Browse files
Files changed (1) hide show
  1. README.md +16 -11
README.md CHANGED
@@ -19,13 +19,13 @@ license: apache-2.0
19
  ## Introduction
20
 
21
  Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
22
- The streamlit application demonstrate the implementaiton of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
23
- Differently to most of the projects, we focus on scientific articles.
24
- We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) that provide and cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
25
 
26
  Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
27
 
28
- The conversation is kept in memory up by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
29
 
30
  (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
31
 
@@ -35,9 +35,9 @@ The conversation is kept in memory up by a buffered sliding window memory (top 4
35
 
36
  ## Getting started
37
 
38
- - Select the model+embedding combination you want ot use
39
  - Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
40
- - Upload a scientific article as PDF document. You will see a spinner or loading indicator while the processing is in progress.
41
  - Once the spinner stops, you can proceed to ask your questions
42
 
43
  ![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
@@ -53,9 +53,9 @@ With default settings, each question uses around 1000 tokens.
53
 
54
  ### Chunks size
55
  When uploaded, each document is split into blocks of a determined size (250 tokens by default).
56
- This setting allow users to modify the size of such blocks.
57
- Smaller blocks will result in smaller context, yielding more precise sections of the document.
58
- Larger blocks will result in larger context less constrained around the question.
59
 
60
  ### Query mode
61
  Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
@@ -65,7 +65,7 @@ Indicates whether sending a question to the LLM (Language Model) or to the vecto
65
  ### NER (Named Entities Recognition)
66
 
67
  This feature is specifically crafted for people working with scientific documents in materials science.
68
- It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, masurements).
69
  This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
70
 
71
 
@@ -78,7 +78,9 @@ To release a new version:
78
 
79
  To use docker:
80
 
81
- - docker run `lfoppiano/document-insights-qa:latest`
 
 
82
 
83
  To install the library with Pypi:
84
 
@@ -88,6 +90,9 @@ To install the library with Pypi:
88
  ## Acknolwedgement
89
 
90
  This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
 
 
 
91
 
92
 
93
 
 
19
  ## Introduction
20
 
21
  Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, Mistral-7b-instruct and Zephyr-7b-beta.
22
+ The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
23
+ Different to most of the projects, we focus on scientific articles.
24
+ We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
25
 
26
  Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
27
 
28
+ The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
29
 
30
  (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
31
 
 
35
 
36
  ## Getting started
37
 
38
+ - Select the model+embedding combination you want to use
39
  - Enter your API Key ([Open AI](https://platform.openai.com/account/api-keys) or [Huggingface](https://huggingface.co/docs/hub/security-tokens)).
40
+ - Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
41
  - Once the spinner stops, you can proceed to ask your questions
42
 
43
  ![screenshot2.png](docs%2Fimages%2Fscreenshot2.png)
 
53
 
54
  ### Chunks size
55
  When uploaded, each document is split into blocks of a determined size (250 tokens by default).
56
+ This setting allows users to modify the size of such blocks.
57
+ Smaller blocks will result in a smaller context, yielding more precise sections of the document.
58
+ Larger blocks will result in a larger context less constrained around the question.
59
 
60
  ### Query mode
61
  Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
 
65
  ### NER (Named Entities Recognition)
66
 
67
  This feature is specifically crafted for people working with scientific documents in materials science.
68
+ It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements).
69
  This feature leverages both [grobid-quantities](https://github.com/kermitt2/grobid-quanities) and [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors) external services.
70
 
71
 
 
78
 
79
  To use docker:
80
 
81
+ - docker run `lfoppiano/document-insights-qa:{latest_version)`
82
+
83
+ - docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version
84
 
85
  To install the library with Pypi:
86
 
 
90
  ## Acknolwedgement
91
 
92
  This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
93
+ Contributed by Pedro Ortiz Suarez (@pjox), Tomoya Mato (@t29mato).
94
+ Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).
95
+
96
 
97
 
98