--- title: Chagu Demo emoji: 📊 colorFrom: pink colorTo: purple sdk: streamlit sdk_version: 1.40.1 app_file: app.py pinned: false license: mit short_description: 'this is demo for chain guard protocol, assistant, RAG ' --- # **AI-Powered Document Search with Malicious Query Detection** This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model. ## **Features** - **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches. - **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries. - **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files. - **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable. ## **Technologies Used** - **Python 3.8+** - **Transformers**: For NLP-based malicious query detection. - **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis. - **Pathlib**: For robust file and path handling. ## **Project Structure** ├── rag_chagu_demo.py # Main script containing the DocumentSearcher class ├── README.md # This file ├── data-sets/ - this part shifted to $HOME │ ├── aclImdb/ │ │ ├── train/ │ │ │ ├── pos/ # Positive movie reviews │ │ │ └── neg/ # Negative movie reviews │ └── txt-files/ # Additional .txt files for document search ## **Installation** Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies: ```bash pip install transformers ``` Dataset Setup Place the IMDB dataset in the following structure: bash Copy code $HOME/data-sets/aclImdb/train/pos/ $HOME/data-sets/aclImdb/train/neg/ Optionally, place additional .txt files under: bash Copy code $HOME/data-sets/txt-files/ Usage Run the script with the following command: bash ``` python rag_chagu_demo.py ``` Example Output ``` Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg Loaded 5000 movie reviews from IMDB dataset. Normal Query Results: Document: This movie had great acting and a compelling storyline. The characters were well-developed... Malicious Query Detected - Confidence: 0.95 Malicious Query Results: Document: ANOMALY: Query blocked due to detected malicious intent. ``` ## How It Works The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents. The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis. If a query is flagged as malicious, it is blocked and an anomaly message is returned. For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches. AI Model Used The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis. ## Why Use AI for Malicious Query Detection? Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries. #### Improvements and Future Work Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results. Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process. Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis. Contributing Feel free to fork this repository and submit pull requests. Contributions are welcome! #### License This project is licensed under the MIT License - see the LICENSE file for details. #### Contact For any questions or issues, please contact the project maintainer: Name: Talex Maxim Email: taimax13@gmail.com GitHub: taimax13