AskMyPDF / README.md
agoyal496's picture
Update git repo
d16857c verified

A newer version of the Gradio SDK is available: 5.9.1

Upgrade
metadata
title: AskMyPDF
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: false

AskMyPDF

A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks.

Features

  • PDF Parsing & Splitting:
    Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings.

  • Semantic Embeddings & Vector Store:
    Use sentence-transformers/all-MiniLM-L6-v2 embeddings to represent text as vectors.
    FAISS vector storage for efficient similarity search.

  • Few-Shot Prompting & Structured Answers:
    Integrate few-shot examples to guide the model towards a specific output format.
    Return answers in a structured JSON format.

  • Chain Orchestration with LangChain:
    Utilize LangChain’s LLMChain and prompt templates for controlled and reproducible queries.

  • Token-Safe Implementation:
    Custom token splitting and truncation ensure input fits within model token limits, avoiding errors.

Installation

This project requires Python 3.11. We recommend using a virtual environment to keep dependencies isolated.

  1. Clone the Repository

    git clone https://huggingface.co/spaces/agoyal496/AskMyPDF
    cd AskMyPDF
    
  2. Set up a Python 3.11 environment (optional but recommended)

    python3.11 -m venv venv
    source venv/bin/activate
    
  3. Install Dependencies

    pip install --upgrade pip
    pip install -r requirements.txt
    
  4. Usage

gradio app.py

Output

The system will:

  • Parse and split the PDF into token-limited chunks.
  • Embed the chunks using all-MiniLM embeddings.
  • Store them in FAISS.
  • Retrieve the top chunks relevant to your query.
  • Use the language model to produce a final JSON-structured answer.

Implementation Details

  • Token-Based Splitting: We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected.
  • Vector Store & Retrieval: With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses.
  • Few-Shot Prompting: The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output.
  • Chain Invocation: Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later.

Improvements

  • Multi-File Support:
    • Extend the script to handle multiple PDFs at once.
    • Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections.
  • Model Agnosticism:
    • Easily switch embeddings or language models.
    • Try different Sentence Transformers models or local LLMs like LLaMA or Falcon.
  • Caching & Persistence:
    • Store FAISS indexes on disk for instant reloads without re-embedding.
    • Implement caching of embeddings and query results to speed up repeated queries.
  • Advanced Prompt Engineering:
    • Experiment with different few-shot examples, chain-of-thought prompting, system messages, and instructions to improve answer quality and formatting.
  • Chunking Strategies:
    • Implement advanced chunking strategies:
    • Use semantic chunking to divide text based on meaning or coherence rather than fixed sizes.
    • Include options for overlapping chunks to improve retrieval precision.
    • Integrate hierarchical chunking to preserve context across sections (e.g., chapters, headings, subheadings).
  • Improved Retrieval Techniques:
    • Leverage Approximate Nearest Neighbor (ANN) algorithms to accelerate similarity search.
    • Integrate with advanced vector databases (e.g., Pinecone, Weaviate, Milvus) for efficient and scalable retrieval.
    • Use hybrid retrieval models, combining vector similarity with traditional keyword-based retrieval for better query coverage.
  • Cross-Encoder Reranker:
    • Introduce a cross-encoder reranker to improve the quality of retrieved results:
    • Apply a fine-tuned cross-encoder model to rerank top candidates from the initial vector search.
    • Use a pre-trained or task-specific cross-encoder (e.g., models from Hugging Face like cross-encoder/ms-marco-TinyBERT-L-6).
    • Improve relevance by jointly encoding the query and candidate passages, allowing contextual alignment and a more accurate similarity score.
    • Dynamically adjust the balance between retrieval speed and reranking quality by tuning the number of top candidates to rerank.
  • Graph-Based Retrieval Augmentation:
    • Adopt GraphRAG approaches:
    • Represent documents and queries as nodes in a graph for relational context.
    • Use graph-based algorithms to enhance retrieval by modeling relationships (e.g., citations, semantic links).
    • Introduce parent document retrievers that prioritize and rank content based on its originating document or source reliability.

With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source.