Rajdeep Ghosh's picture

1 2 18

Rajdeep Ghosh

rumbleFTW

·

AI & ML interests

Transformers, GANs, Audio synthesis, LLMs, Diffusion.

Recent Activity

Reacted to merve's post with ❤️ about 2 months ago

If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try 🤗 Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. 🥲 How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🤝 This is much faster + you do not lose out on any information + much easier to maintain too! 🥳 Multimodal RAG https://huggingface.co/collections/merve/multimodal-rag-66d97602e781122aae0a5139 💬 Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) https://huggingface.co/collections/merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e 📖

Reacted to merve's post with 👍 about 2 months ago

If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try 🤗 Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. 🥲 How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🤝 This is much faster + you do not lose out on any information + much easier to maintain too! 🥳 Multimodal RAG https://huggingface.co/collections/merve/multimodal-rag-66d97602e781122aae0a5139 💬 Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) https://huggingface.co/collections/merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e 📖

Reacted to merve's post with 🔥 about 2 months ago

If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try 🤗 Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. 🥲 How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🤝 This is much faster + you do not lose out on any information + much easier to maintain too! 🥳 Multimodal RAG https://huggingface.co/collections/merve/multimodal-rag-66d97602e781122aae0a5139 💬 Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) https://huggingface.co/collections/merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e 📖

View all activity

Organizations

rumbleFTW's activity

upvoted 2 collections 4 months ago

H2O Danube3

6 items • Updated Oct 17 • 53

H2O Danube2

4 items • Updated Oct 17 • 15