Pi's picture
2 4

Pi

pipwnz
·

AI & ML interests

None yet

Recent Activity

replied to singhsidhukuldeep's post about 1 month ago
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding. The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with: - Process 40,000+ pages across 3,000+ documents - Answer questions requiring information from multiple pages - Understand visual elements like charts, tables, and figures - Support both closed-domain (single document) and open-domain (multiple documents) queries Under the hood, M3DocRAG operates through three sophisticated stages: >> Document Embedding: - Converts PDF pages to RGB images - Uses ColPali to project both text queries and page images into a shared embedding space - Creates dense visual embeddings for each page while maintaining visual information integrity >> Page Retrieval: - Employs MaxSim scoring to compute relevance between queries and pages - Implements inverted file indexing (IVFFlat) for efficient search - Reduces retrieval latency from 20s to under 2s when searching 40K+ pages - Supports approximate nearest neighbor search via Faiss >> Question Answering: - Leverages Qwen2-VL 7B as the multi-modal language model - Processes retrieved pages through a visual encoder - Generates answers considering both textual and visual context The results are impressive: - State-of-the-art performance on MP-DocVQA benchmark - Superior handling of non-text evidence compared to text-only systems - Significantly better performance on multi-hop reasoning tasks This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
reacted to singhsidhukuldeep's post with 🤗 about 1 month ago
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding. The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with: - Process 40,000+ pages across 3,000+ documents - Answer questions requiring information from multiple pages - Understand visual elements like charts, tables, and figures - Support both closed-domain (single document) and open-domain (multiple documents) queries Under the hood, M3DocRAG operates through three sophisticated stages: >> Document Embedding: - Converts PDF pages to RGB images - Uses ColPali to project both text queries and page images into a shared embedding space - Creates dense visual embeddings for each page while maintaining visual information integrity >> Page Retrieval: - Employs MaxSim scoring to compute relevance between queries and pages - Implements inverted file indexing (IVFFlat) for efficient search - Reduces retrieval latency from 20s to under 2s when searching 40K+ pages - Supports approximate nearest neighbor search via Faiss >> Question Answering: - Leverages Qwen2-VL 7B as the multi-modal language model - Processes retrieved pages through a visual encoder - Generates answers considering both textual and visual context The results are impressive: - State-of-the-art performance on MP-DocVQA benchmark - Superior handling of non-text evidence compared to text-only systems - Significantly better performance on multi-hop reasoning tasks This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
View all activity

Organizations

None yet

pipwnz's activity

replied to singhsidhukuldeep's post about 1 month ago
view reply

Creates dense visual embeddings for each page while maintaining visual information integrity.

I'm sorry, but where can I read about dense visual embeddings? In this article I found only about the colpali strategy (similar to sparse)

reacted to singhsidhukuldeep's post with 🤗 about 1 month ago
view post
Post
1308
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding.

The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with:
- Process 40,000+ pages across 3,000+ documents
- Answer questions requiring information from multiple pages
- Understand visual elements like charts, tables, and figures
- Support both closed-domain (single document) and open-domain (multiple documents) queries

Under the hood, M3DocRAG operates through three sophisticated stages:

>> Document Embedding:
- Converts PDF pages to RGB images
- Uses ColPali to project both text queries and page images into a shared embedding space
- Creates dense visual embeddings for each page while maintaining visual information integrity

>> Page Retrieval:
- Employs MaxSim scoring to compute relevance between queries and pages
- Implements inverted file indexing (IVFFlat) for efficient search
- Reduces retrieval latency from 20s to under 2s when searching 40K+ pages
- Supports approximate nearest neighbor search via Faiss

>> Question Answering:
- Leverages Qwen2-VL 7B as the multi-modal language model
- Processes retrieved pages through a visual encoder
- Generates answers considering both textual and visual context

The results are impressive:
- State-of-the-art performance on MP-DocVQA benchmark
- Superior handling of non-text evidence compared to text-only systems
- Significantly better performance on multi-hop reasoning tasks

This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
·
upvoted an article about 1 month ago
view article
Article

ColPali: Efficient Document Retrieval with Vision Language Models 👀

By manu
183