|
--- |
|
title: AskMyPDF |
|
emoji: ⚡ |
|
colorFrom: red |
|
colorTo: gray |
|
sdk: gradio |
|
sdk_version: 5.8.0 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# AskMyPDF |
|
|
|
A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks. |
|
|
|
## Features |
|
|
|
- **PDF Parsing & Splitting:** |
|
Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings. |
|
|
|
- **Semantic Embeddings & Vector Store:** |
|
Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors. |
|
FAISS vector storage for efficient similarity search. |
|
|
|
- **Few-Shot Prompting & Structured Answers:** |
|
Integrate few-shot examples to guide the model towards a specific output format. |
|
Return answers in a structured JSON format. |
|
|
|
- **Chain Orchestration with LangChain:** |
|
Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries. |
|
|
|
- **Token-Safe Implementation:** |
|
Custom token splitting and truncation ensure input fits within model token limits, avoiding errors. |
|
|
|
## Installation |
|
|
|
This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated. |
|
|
|
1. **Clone the Repository** |
|
```bash |
|
git clone https://github.com/yourusername/AskMyPDF.git |
|
cd AskMyPDF |
|
``` |
|
|
|
2. Set up a Python 3.11 environment (optional but recommended) |
|
```bash |
|
python3.11 -m venv venv |
|
source venv/bin/activate |
|
``` |
|
|
|
|
|
3. Install Dependencies |
|
```bash |
|
pip install --upgrade pip |
|
pip install -r requirements.txt |
|
``` |
|
|
|
|
|
|
|
4. Usage |
|
```bash |
|
gradio app.py |
|
``` |
|
|
|
|
|
## Output |
|
The system will: |
|
- Parse and split the PDF into token-limited chunks. |
|
- Embed the chunks using all-MiniLM embeddings. |
|
- Store them in FAISS. |
|
- Retrieve the top chunks relevant to your query. |
|
- Use the language model to produce a final JSON-structured answer. |
|
|
|
## Implementation Details |
|
- Token-Based Splitting: |
|
We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected. |
|
- Vector Store & Retrieval: |
|
With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses. |
|
- Few-Shot Prompting: |
|
The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output. |
|
- Chain Invocation: |
|
Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later. |
|
|
|
## Improvements |
|
- **Multi-File Support:** |
|
- Extend the script to handle multiple PDFs at once. |
|
- Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections. |
|
- **Model Agnosticism:** |
|
- Easily switch embeddings or language models. |
|
- Try different Sentence Transformers models or local LLMs like LLaMA or Falcon. |
|
- **Caching & Persistence:** |
|
- Store FAISS indexes on disk for instant reloads without re-embedding. |
|
- Implement caching of embeddings and query results to speed up repeated queries. |
|
- **Advanced Prompt Engineering:** |
|
- Experiment with different few-shot examples, chain-of-thought prompting, system messages, and instructions to improve answer quality and formatting. |
|
- **Chunking Strategies:** |
|
- Implement advanced chunking strategies: |
|
- Use semantic chunking to divide text based on meaning or coherence rather than fixed sizes. |
|
- Include options for overlapping chunks to improve retrieval precision. |
|
- Integrate hierarchical chunking to preserve context across sections (e.g., chapters, headings, subheadings). |
|
- **Improved Retrieval Techniques:** |
|
- Leverage Approximate Nearest Neighbor (ANN) algorithms to accelerate similarity search. |
|
- Integrate with advanced vector databases (e.g., Pinecone, Weaviate, Milvus) for efficient and scalable retrieval. |
|
- Use hybrid retrieval models, combining vector similarity with traditional keyword-based retrieval for better query coverage. |
|
- **Cross-Encoder Reranker:** |
|
- Introduce a cross-encoder reranker to improve the quality of retrieved results: |
|
- Apply a fine-tuned cross-encoder model to rerank top candidates from the initial vector search. |
|
- Use a pre-trained or task-specific cross-encoder (e.g., models from Hugging Face like cross-encoder/ms-marco-TinyBERT-L-6). |
|
- Improve relevance by jointly encoding the query and candidate passages, allowing contextual alignment and a more accurate similarity score. |
|
- Dynamically adjust the balance between retrieval speed and reranking quality by tuning the number of top candidates to rerank. |
|
- **Graph-Based Retrieval Augmentation:** |
|
- Adopt GraphRAG approaches: |
|
- Represent documents and queries as nodes in a graph for relational context. |
|
- Use graph-based algorithms to enhance retrieval by modeling relationships (e.g., citations, semantic links). |
|
- Introduce parent document retrievers that prioritize and rank content based on its originating document or source reliability. |
|
|
|
|
|
With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source. |
|
|
|
|