metadata

title: Chagu Demo
emoji: 📊
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.40.1
app_file: app.py
pinned: false
license: mit
short_description: 'this is demo for chain guard protocol, assistant, RAG '

AI-Powered Document Search with Malicious Query Detection

This project implements a semantic search engine for documents using AI-based malicious query detection. It allows users to search through movie reviews (IMDB dataset) and additional .txt files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.

Features

Semantic Search: Uses fuzzy matching for normal queries, allowing context-aware searches.
AI-Based Malicious Query Detection: Utilizes a pre-trained NLP model (DistilBERT) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
Flexible Document Ingestion: Supports loading documents from the IMDB dataset and additional .txt files.
Efficient Path Handling: Automatically handles dataset paths using the HOME environment variable.

Technologies Used

Python 3.8+
Transformers: For NLP-based malicious query detection.
Hugging Face Pipeline: Uses the distilbert-base-uncased-finetuned-sst-2-english model for sentiment analysis.
Pathlib: For robust file and path handling.

Project Structure

├── rag_chagu_demo.py # Main script containing the DocumentSearcher class ├── README.md # This file ├── data-sets/ - this part shifted to $HOME │ ├── aclImdb/ │ │ ├── train/ │ │ │ ├── pos/ # Positive movie reviews │ │ │ └── neg/ # Negative movie reviews │ └── txt-files/ # Additional .txt files for document search

Installation

Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:

pip install transformers

Dataset Setup Place the IMDB dataset in the following structure:

bash Copy code $HOME/data-sets/aclImdb/train/pos/ $HOME/data-sets/aclImdb/train/neg/ Optionally, place additional .txt files under:

bash Copy code $HOME/data-sets/txt-files/ Usage Run the script with the following command:

bash

python rag_chagu_demo.py

Example Output


Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
Loaded 5000 movie reviews from IMDB dataset.

Normal Query Results:
Document: This movie had great acting and a compelling storyline. The characters were well-developed...

Malicious Query Detected - Confidence: 0.95
Malicious Query Results:

Document: ANOMALY: Query blocked due to detected malicious intent.

How It Works

The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents. The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis. If a query is flagged as malicious, it is blocked and an anomaly message is returned. For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches. AI Model Used The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.

Why Use AI for Malicious Query Detection?

Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.

Improvements and Future Work

Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results. Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process. Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis. Contributing Feel free to fork this repository and submit pull requests. Contributions are welcome!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any questions or issues, please contact the project maintainer:

Name: Talex Maxim Email: taimax13@gmail.com GitHub: taimax13