# **Document Search System** ## **Overview** The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. --- ## **Features** 1. **Query Classification:** - Detects malicious or inappropriate queries using a sentiment analysis model. - Blocks malicious queries and prevents them from further processing. 2. **Query Transformation:** - Rephrases or enhances ambiguous queries to improve retrieval accuracy. - Uses rule-based transformations and advanced text-to-text models. 3. **RAG Pipeline:** - Retrieves top-k documents based on semantic similarity. - Generates context-aware responses using generative models. 4. **Blockchain Integration (Chagu):** - Logs all stages of query processing into a blockchain for integrity and traceability. - Validates blockchain integrity. 5. **Neo4j Integration:** - Stores and visualizes relationships between queries, responses, and documents. - Allows detailed querying and visualization of the data flow. --- ## **Workflow** The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: ### **1. Input Query** - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. --- ### **2. Detection Module** - **Purpose**: Classify the query as "bad" or "good." - **Steps**: 1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. 2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. 3. If "good," proceed to the **Transformation Module**. --- ### **3. Transformation Module** - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. - **Steps**: 1. Identify missing context or ambiguous phrasing. 2. Transform the query using: - Rule-based transformations for simple fixes. - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. 3. Pass the transformed query to the **RAG Pipeline**. --- ### **4. RAG Pipeline** - **Purpose**: Retrieve relevant data and generate a context-aware response. - **Steps**: 1. **Document Retrieval**: - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. - Compute semantic similarity between the query and stored documents. - Retrieve the top-k documents relevant to the query. 2. **Response Generation**: - Use the retrieved documents as context. - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. --- ### **5. Semantic Response Generation** - **Purpose**: Provide a concise and meaningful answer. - **Steps**: 1. Combine the retrieved documents into a coherent context. 2. Generate a response tailored to the query using the generative model. 3. Return the response to the user, ensuring clarity and relevance. --- ### **6. Logging and Storage** - **Blockchain Logging:** - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. - Ensures data integrity and tamper-proof records. - **Neo4j Storage:** - Relationships between queries, responses, and retrieved documents are stored in Neo4j. - Enables detailed analysis and graph-based visualization. --- ## **Neo4j Visualization** Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: ![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png) - **Nodes**: - Query: Represents the user query. - TransformedQuery: Rephrased or improved query. - Document: Relevant documents retrieved based on the query. - Response: The generated response. - **Relationships**: - `RETRIEVED`: Links the query to retrieved documents. - `TRANSFORMED_TO`: Links the original query to the transformed query. - `GENERATED`: Links the query to the generated response. --- ## **Setup Instructions** 1. Clone the repository: ```bash git clone https://github.com/your-repo/document-search-system.git ``` Here’s the updated README.md content in proper Markdown format with the embedded image reference: markdown # **Document Search System** ## **Overview** The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. --- ## **Features** 1. **Query Classification:** - Detects malicious or inappropriate queries using a sentiment analysis model. - Blocks malicious queries and prevents them from further processing. 2. **Query Transformation:** - Rephrases or enhances ambiguous queries to improve retrieval accuracy. - Uses rule-based transformations and advanced text-to-text models. 3. **RAG Pipeline:** - Retrieves top-k documents based on semantic similarity. - Generates context-aware responses using generative models. 4. **Blockchain Integration (Chagu):** - Logs all stages of query processing into a blockchain for integrity and traceability. - Validates blockchain integrity. 5. **Neo4j Integration:** - Stores and visualizes relationships between queries, responses, and documents. - Allows detailed querying and visualization of the data flow. --- ## **Workflow** The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: ### **1. Input Query** - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. --- ### **2. Detection Module** - **Purpose**: Classify the query as "bad" or "good." - **Steps**: 1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. 2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. 3. If "good," proceed to the **Transformation Module**. --- ### **3. Transformation Module** - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. - **Steps**: 1. Identify missing context or ambiguous phrasing. 2. Transform the query using: - Rule-based transformations for simple fixes. - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. 3. Pass the transformed query to the **RAG Pipeline**. --- ### **4. RAG Pipeline** - **Purpose**: Retrieve relevant data and generate a context-aware response. - **Steps**: 1. **Document Retrieval**: - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. - Compute semantic similarity between the query and stored documents. - Retrieve the top-k documents relevant to the query. 2. **Response Generation**: - Use the retrieved documents as context. - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. --- ### **5. Semantic Response Generation** - **Purpose**: Provide a concise and meaningful answer. - **Steps**: 1. Combine the retrieved documents into a coherent context. 2. Generate a response tailored to the query using the generative model. 3. Return the response to the user, ensuring clarity and relevance. --- ### **6. Logging and Storage** - **Blockchain Logging:** - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. - Ensures data integrity and tamper-proof records. - **Neo4j Storage:** - Relationships between queries, responses, and retrieved documents are stored in Neo4j. - Enables detailed analysis and graph-based visualization. --- ## **Neo4j Visualization** Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: ![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png) - **Nodes**: - Query: Represents the user query. - TransformedQuery: Rephrased or improved query. - Document: Relevant documents retrieved based on the query. - Response: The generated response. - **Relationships**: - `RETRIEVED`: Links the query to retrieved documents. - `TRANSFORMED_TO`: Links the original query to the transformed query. - `GENERATED`: Links the query to the generated response. --- ## **Setup Instructions** 1. Clone the repository: ```bash git clone https://github.com/your-repo/document-search-system.git ``` Install dependencies: ```bash pip install -r requirements.txt ``` Initialize the Neo4j database: Connect to your Neo4j Aura instance. Set up credentials in the code. Load the dataset: Place your documents in the dataset directory (e.g., data-sets/aclImdb/train). Run the system: ```bash python document_search_system.py ``` Neo4j Queries Retrieve All Queries Logged ```cypher MATCH (q:Query) RETURN q.text AS query, q.timestamp AS timestamp ORDER BY timestamp DESC ``` Visualize Query Relationships ```cypher MATCH (n)-[r]->(m) RETURN n, r, m Find Documents for a Query ``` ```cypher MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document) RETURN d.name AS document_name ``` ### Key Technologies Machine Learning Models: distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis. google/flan-t5-small for query transformation. distilgpt2 for response generation. Vector Similarity Search: all-MiniLM-L6-v2 embeddings for document retrieval. Blockchain Logging: Powered by chainguard.blockchain_logger. Graph-Based Storage: Relationships visualized and queried via Neo4j. vbnet