Pi

pipwnz

AI & ML interests

None yet

Recent Activity

liked a model 12 days ago

deepvk/RuModernBERT-small

liked a model 13 days ago

GoidaAlignment/KremlinAI-2

liked a model about 2 months ago

DavidAU/Llama-3.2-8X3B-MOE-Dark-Champion-Instruct-uncensored-abliterated-18.4B-GGUF

View all activity

Organizations

None yet

pipwnz's activity

liked a model 12 days ago

deepvk/RuModernBERT-small

Fill-Mask • Updated 15 days ago • 5.87k • 12

liked a model 13 days ago

GoidaAlignment/KremlinAI-2

Text Generation • Updated Nov 19, 2024 • 88 • 7

liked 4 models about 2 months ago

DavidAU/Llama-3.2-8X3B-MOE-Dark-Champion-Instruct-uncensored-abliterated-18.4B-GGUF

Text Generation • Updated 23 days ago • 123k • 150

replied to singhsidhukuldeep's post 3 months ago

Creates dense visual embeddings for each page while maintaining visual information integrity.

I'm sorry, but where can I read about dense visual embeddings? In this article I found only about the colpali strategy (similar to sparse)

reacted to singhsidhukuldeep's post with 🤗 3 months ago

Post

1319

Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding.

The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with:
- Process 40,000+ pages across 3,000+ documents
- Answer questions requiring information from multiple pages
- Understand visual elements like charts, tables, and figures
- Support both closed-domain (single document) and open-domain (multiple documents) queries

Under the hood, M3DocRAG operates through three sophisticated stages:

>> Document Embedding:
- Converts PDF pages to RGB images
- Uses ColPali to project both text queries and page images into a shared embedding space
- Creates dense visual embeddings for each page while maintaining visual information integrity

>> Page Retrieval:
- Employs MaxSim scoring to compute relevance between queries and pages
- Implements inverted file indexing (IVFFlat) for efficient search
- Reduces retrieval latency from 20s to under 2s when searching 40K+ pages
- Supports approximate nearest neighbor search via Faiss

>> Question Answering:
- Leverages Qwen2-VL 7B as the multi-modal language model
- Processes retrieved pages through a visual encoder
- Generates answers considering both textual and visual context

The results are impressive:
- State-of-the-art performance on MP-DocVQA benchmark
- Superior handling of non-text evidence compared to text-only systems
- Significantly better performance on multi-hop reasoning tasks

This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.