Use AI on Your PC: Optimize and Deploy a Multimodal Agentic Pipeline on AI PC Powered by Intel
✨ Highlights
- Learn how to build and deploy a multimodal agentic RAG pipeline on next-generation AI PCs powered by Intel.
- This structured pipeline ensures well-grounded, timely responses, and efficient processing on local hardware.
- Whether you're a researcher, developer, or privacy-conscious user, this stack unlocks a new era of personal, private AI.
Agentic AI marks a major leap forward in how machines perceive, reason, and interact with the world, across language, vision, and video. Yet, bringing this intelligence into everyday use remains challenging: How do we make advanced multimodal systems fast, private, and accessible enough to run locally, on a laptop, not in the cloud?
Follow our demo featured at CVPR 2025 to learn how to build and deploy a multimodal agentic retrieval-augmented generation (RAG) pipeline that can understand complex visual-textual input, retrieve relevant information, and respond, all while running efficiently on next-generation AI PCs powered by Intel® Core™ Ultra processors.
The demo shows you how to:
- Structure and retrieve from vector databases with images and videos.
- Connect to large vision language models (VLMs) to generate rich responses.
- Use agentic logic with model context protocol (MCP) to support iterative reasoning, dynamic tool use, and multimodal task planning.
- Optimize pipelines using OpenVINO™ GenAI APIs, NNCF, and INT4 quantization for fast inference across central processing unit (CPU), graphics processing unit (GPU), and neural processing unit (NPU). Read the guide and watch the demo below to understand how to make a full working pipeline that embodies the future of fast, private, and personalized agentic AI on an AI PC.
📚 Core Concepts
Q&A Based on Video Understanding with Multimodal RAG and VLM
Efficiently handling multimodal data by combining text, images, videos, and other forms of media, is essential for modern AI applications such as personal assistants, medical diagnostics, and educational tools. The complexity arises from the need to seamlessly integrate retrieval and generation capabilities, ensuring precise, context-aware responses. The multimodal RAG pipeline addresses this need by structuring and retrieving multimodal data, interpreting it through advanced vision-language models, and optimizing it for fast local inference. See Figure below for an overview of the method.
Image credit: Intel.
Structuring and Retrieving from Multimodal Vector Databases
Multimodal RAG relies on efficiently retrieving relevant content from vector databases containing text, images, and videos. In our setup, BridgeTower, a multimodal embedding model, is used to create unified vector representations capturing semantic relationships across modalities. By encoding queries and multimedia content into these embeddings, BridgeTower enables rapid and accurate retrieval of contextually relevant multimodal data.
Connecting to Large Vision-Language Models (VLMs)
Once the relevant multimodal content is retrieved, powerful vision-language models such as Phi-4-multimodal-instruct interpret this data. These models perform advanced visual understanding by extracting context to identify objects, scenes, and relationships within images and videos, enabling the generation of detailed and relevant textual descriptions. This allows the model to deliver precise and contextually accurate responses.
Optimizing with OpenVINO
To ensure efficient local execution, models are optimized using OpenVINO GenAI APIs, Neural Network Compression Framework (NNCF), and INT4 quantization. OpenVINO optimizes and converts models into its intermediate representation (IR) format, significantly enhancing inference speed and efficiency across CPUs, GPUs, and NPUs. The optimization workflow involves: • Converting pre-trained models into OpenVINO IR format using the optimum-cli export command. • Applying INT4 quantization to reduce computational load and memory usage. • Deploying optimized models on Intel hardware for high-performance, low-latency inference. This optimized pipeline enables seamless multimodal interactions locally on AI PCs powered by Intel.
Image credit: Intel.
Using Agent Logic with MCP
As AI models become more capable, they also need better ways to interact with the world around them. MCP is a lightweight framework that standardizes how context is passed into AI models during inference. It enables the use of tools and external APIs, allowing models to dynamically decide when and how to use these tools.
Agentic Pipeline
Agentic frameworks are AI systems built to think and act on their own when tackling complex goals. They break down those complex tasks into subtasks, make decisions over multiple steps, and dynamically interact with tools or external environments to complete objectives. They also use tools or connect with outside resources as needed. This way, they can handle challenges more flexibly and with little to no human help. So far, we have gone through the core concepts behind setting up the pipeline. Now, the question is: How can we run it locally?
💻 Hardware and Technologies: AI PC Powered by Intel Core Ultra
Unlike a typical computer, an AI PC comes with advanced hardware, such as powerful GPUs for fast AI model inference even for workloads requiring heavy computations. AI PCs also come with specialized AI chips called neural processing units. The NPU is optimized for low power consumption while maintaining high computational performance. Our demo code is designed to run smoothly on an AI PC, leveraging its powerful hardware and specialized components to deliver fast, efficient, and local AI processing. With the right setup, you will be able to experience firsthand how next-generation AI hardware can transform complex tasks into seamless workflows right on your own machine. Together, OpenVINO and the AIPC deliver a powerful combination to optimize, accelerate, and deploy cutting-edge large language and vision-language AI models locally.
🧩 Putting It All Together in the Demo
Image credit: Intel.
Our agentic framework (see figure above) is orchestrated by a central routing agent that determines the appropriate interactions with other agents and tools based on the input query. Specialized agents, such as a video understanding agent for interpreting visual content in videos and a shopping assistant agent for handling product-related queries, can be invoked by the routing agent as needed. For example, if the input involves video content, the routing agent will bring in the video understanding agent to interpret what is happening on screen. If it’s a product-related question, a shopping assistant agent will be invoked to help. These agents are complemented by tools such as RAG modules that provide access to external knowledge, making sure the AI has access to the latest and most relevant information. In our setup, all three agents employ a single shared instance of the Qwen2.5-Instruct model running on GPU as their central “brain” to autonomously achieve a given goal by reasoning, planning, and invoking MCP tools if needed. All the tools run on CPU, except for multimodal RAG tools that run on GPUs. We have integrated two RAG tools:
- One multimodal RAG tool specialized for video inputs by employing the BridgeTower model (running on CPU) for multimodal embedding and Phi-4-Multimodal-Instruct (running on GPU) as VLM. LangChain is used to develop this multimodal RAG.
- Another RAG tool retrieves information from documents such as PDFs by employing BAAI/bge-large-en-v1.5 model (running on CPU) as the embedding model. LlamaIndex was utilized to develop this RAG. The demo finally combines video search and shopping cart functionalities through MCP servers and a Gradio-based web interface.
How to Reproduce the Demo
To set up and use the demo, refer to the OpenVINO toolkit on GitHub for step-by-step guidelines.
Below is an example showcasing the capabilities of our framework. It is based on this video ans was run on an AIPC powered by Intel® Core™ Ultra processors.
Please note that this is still experimental, and all results should be independently verified by the user to prevent mistakes or misuse.
Conclusion
The demo shows that powerful AI does not need to be remote, expensive, or intrusive. By combining OpenVINO + MCP and an AI PC powered by Intel Core Ultra processors, models can run efficiently and securely, right from your laptop. Whether you're a researcher, developer, or privacy-conscious user, this stack unlocks a new era of personal, private AI.
🙏 Acknowledgment
This tutorial builds on the work and insights of several contributors. While not all of them are authors of this blog, their efforts were critical in working on the demo. We would like to thank Raymond Lo, Gustavo Lujan Moreno, Adrian Boguszewski, James Lee and Vasudev Lal.
 
					 
					

