Spaces:

chagu13
/

chagu-demo

Running

App Files Files Community

chagu-demo / rag_sec /README.md

talexm

adding blockchain logger

e893d68 19 days ago

preview code

raw

history blame

10.2 kB

	# Document Search System

	## Overview
	The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.

	---

	## Features
	1. Query Classification:
	- Detects malicious or inappropriate queries using a sentiment analysis model.
	- Blocks malicious queries and prevents them from further processing.

	2. Query Transformation:
	- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
	- Uses rule-based transformations and advanced text-to-text models.

	3. RAG Pipeline:
	- Retrieves top-k documents based on semantic similarity.
	- Generates context-aware responses using generative models.

	4. Blockchain Integration (Chagu):
	- Logs all stages of query processing into a blockchain for integrity and traceability.
	- Validates blockchain integrity.

	5. Neo4j Integration:
	- Stores and visualizes relationships between queries, responses, and documents.
	- Allows detailed querying and visualization of the data flow.

	---

	## Workflow

	The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

	### 1. Input Query
	- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

	---

	### 2. Detection Module
	- Purpose: Classify the query as "bad" or "good."
	- Steps:
	1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
	2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
	3. If "good," proceed to the Transformation Module.

	---

	### 3. Transformation Module
	- Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
	- Steps:
	1. Identify missing context or ambiguous phrasing.
	2. Transform the query using:
	- Rule-based transformations for simple fixes.
	- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
	3. Pass the transformed query to the RAG Pipeline.

	---

	### 4. RAG Pipeline
	- Purpose: Retrieve relevant data and generate a context-aware response.
	- Steps:
	1. Document Retrieval:
	- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
	- Compute semantic similarity between the query and stored documents.
	- Retrieve the top-k documents relevant to the query.
	2. Response Generation:
	- Use the retrieved documents as context.
	- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.

	---

	### 5. Semantic Response Generation
	- Purpose: Provide a concise and meaningful answer.
	- Steps:
	1. Combine the retrieved documents into a coherent context.
	2. Generate a response tailored to the query using the generative model.
	3. Return the response to the user, ensuring clarity and relevance.

	---

	### 6. Logging and Storage
	- Blockchain Logging:
	- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
	- Ensures data integrity and tamper-proof records.
	- Neo4j Storage:
	- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
	- Enables detailed analysis and graph-based visualization.

	---

	## Neo4j Visualization

	Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

	![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png)

	- Nodes:
	- Query: Represents the user query.
	- TransformedQuery: Rephrased or improved query.
	- Document: Relevant documents retrieved based on the query.
	- Response: The generated response.

	- Relationships:
	- `RETRIEVED`: Links the query to retrieved documents.
	- `TRANSFORMED_TO`: Links the original query to the transformed query.
	- `GENERATED`: Links the query to the generated response.

	---

	## Setup Instructions
	1. Clone the repository:
	```bash
	git clone https://github.com/your-repo/document-search-system.git
	```

	Here’s the updated README.md content in proper Markdown format with the embedded image reference:

	markdown

	# Document Search System

	## Overview
	The Document Search System provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.

	---

	## Features
	1. Query Classification:
	- Detects malicious or inappropriate queries using a sentiment analysis model.
	- Blocks malicious queries and prevents them from further processing.

	2. Query Transformation:
	- Rephrases or enhances ambiguous queries to improve retrieval accuracy.
	- Uses rule-based transformations and advanced text-to-text models.

	3. RAG Pipeline:
	- Retrieves top-k documents based on semantic similarity.
	- Generates context-aware responses using generative models.

	4. Blockchain Integration (Chagu):
	- Logs all stages of query processing into a blockchain for integrity and traceability.
	- Validates blockchain integrity.

	5. Neo4j Integration:
	- Stores and visualizes relationships between queries, responses, and documents.
	- Allows detailed querying and visualization of the data flow.

	---

	## Workflow

	The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:

	### 1. Input Query
	- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.

	---

	### 2. Detection Module
	- Purpose: Classify the query as "bad" or "good."
	- Steps:
	1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
	2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
	3. If "good," proceed to the Transformation Module.

	---

	### 3. Transformation Module
	- Purpose: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
	- Steps:
	1. Identify missing context or ambiguous phrasing.
	2. Transform the query using:
	- Rule-based transformations for simple fixes.
	- Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
	3. Pass the transformed query to the RAG Pipeline.

	---

	### 4. RAG Pipeline
	- Purpose: Retrieve relevant data and generate a context-aware response.
	- Steps:
	1. Document Retrieval:
	- Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
	- Compute semantic similarity between the query and stored documents.
	- Retrieve the top-k documents relevant to the query.
	2. Response Generation:
	- Use the retrieved documents as context.
	- Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.

	---

	### 5. Semantic Response Generation
	- Purpose: Provide a concise and meaningful answer.
	- Steps:
	1. Combine the retrieved documents into a coherent context.
	2. Generate a response tailored to the query using the generative model.
	3. Return the response to the user, ensuring clarity and relevance.

	---

	### 6. Logging and Storage
	- Blockchain Logging:
	- Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
	- Ensures data integrity and tamper-proof records.
	- Neo4j Storage:
	- Relationships between queries, responses, and retrieved documents are stored in Neo4j.
	- Enables detailed analysis and graph-based visualization.

	---

	## Neo4j Visualization

	Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:

	![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png)

	- Nodes:
	- Query: Represents the user query.
	- TransformedQuery: Rephrased or improved query.
	- Document: Relevant documents retrieved based on the query.
	- Response: The generated response.

	- Relationships:
	- `RETRIEVED`: Links the query to retrieved documents.
	- `TRANSFORMED_TO`: Links the original query to the transformed query.
	- `GENERATED`: Links the query to the generated response.

	---

	## Setup Instructions
	1. Clone the repository:
	```bash
	git clone https://github.com/your-repo/document-search-system.git
	```
	Install dependencies:

	```bash

	pip install -r requirements.txt
	```
	Initialize the Neo4j database:

	Connect to your Neo4j Aura instance.
	Set up credentials in the code.
	Load the dataset:

	Place your documents in the dataset directory (e.g., data-sets/aclImdb/train).
	Run the system:

	```bash

	python document_search_system.py
	```
	Neo4j Queries
	Retrieve All Queries Logged
	```cypher

	MATCH (q:Query)
	RETURN q.text AS query, q.timestamp AS timestamp
	ORDER BY timestamp DESC
	```

	Visualize Query Relationships
	```cypher

	MATCH (n)-[r]->(m)
	RETURN n, r, m
	Find Documents for a Query

	```

	```cypher

	MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
	RETURN d.name AS document_name
	```

	### Key Technologies
	Machine Learning Models:
	distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
	google/flan-t5-small for query transformation.
	distilgpt2 for response generation.
	Vector Similarity Search:
	all-MiniLM-L6-v2 embeddings for document retrieval.
	Blockchain Logging:
	Powered by chainguard.blockchain_logger.
	Graph-Based Storage:
	Relationships visualized and queried via Neo4j.
	vbnet