This repo is deprecated. Please use the new repos:
Version HuggingFace fp16 (full precision) Mano-CUA-4B-Thinking-1.1 MLX-8bit (Apple Silicon) Mano-CUA-4B-Thinking-1.1-MLX-8bit
Mano-P: Open-source GUI-VLA Agent for Edge Devices
Mano-P is a GUI-VLA agent project designed specifically for edge devices. It serves both as an open-source project and a hardware product solution. As an open-source project, Mano-P is being released in a phased, progressive manner, targeting three distinct groups of developers. In the first phase, we will open-source the Mano-CUA Skills. This phase is aimed at Agent enthusiasts—such as users of OpenClaw or Claude Code—enabling them to leverage the capabilities of Mano-CUA Skills to construct more intelligent CUA task workflows and overcome the bottlenecks associated with human intervention. In the second phase, we will open-source the local-side models and SDK components of Mano-CUA. This phase targets developers with high security requirements, allowing them to directly utilize GUI-VLA models capable of running inference locally on a Mac mini to build their own custom Skills, Tools, and more; crucially, all your CUA operations will be executed entirely on your local Mac mini and will not be uploaded to external servers. In the third phase, we will open-source the training methodologies and the pruning and quantization techniques used for the Mano-P models. This phase is designed for developers with specific model training needs, empowering them to apply our training methods to create their own on-device GUI-VLA models tailored to their unique requirements.
Regarding our GUI-VLA models—which are capable of running inference directly on Mac mini and MacBook devices—we currently support two deployment methods: First, direct deployment on Mac mini or MacBook models equipped with an M4 chip and 32GB or more of RAM; and second, deployment utilizing a compute stick connected via a USB 4.0 port or higher. We will be releasing detailed instructions for both deployment methods in the near future, and we plan to expand our support to include additional deployment options in the future.
Main Capabilities
- Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
- Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
- Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
- Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries
Technical Background
Mano-P builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.
Quick Start
Requirements
- macOS with Apple Silicon (M1+)
- Python >= 3.12
Installation
With Cider (recommended, includes W8A8 acceleration):
pip install mlx-vlm
pip install git+https://github.com/Mininglamp-AI/cider.git
Without Cider (FP16, PyTorch):
pip install transformers torch torchvision qwen-vl-utils
Single-Step Demo (FP16, PyTorch)
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# 1. Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Mininglamp-2718/Mano-P",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-P")
# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)
# 3. Build prompt
task = "Click the search bar and type hello"
prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
<action>具体动作</action>
## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
hover(start_box='<|box_start|>(x1,y1)<|box_end|>')
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
triple_click(start_box='<|box_start|>(x1,y1)<|box_end|>') left click at the coordinate (x1,y1) three times.
hotkey_click(start_box='<|box_start|>(x1,y1)<|box_end|>', key='') press command key and click at the coordinate (x1,y1).
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>') right click at the coordinate (x1,y1).
type(content='') type the content.
doubleclick(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') # Drag an element from the start coordinate (x1,y1) to the end coordinate (x3,y3).
hotkey(key='') # Trigger a keyboard shortcut.
wait(duration='') # Sleep for specified duration (in seconds) and take a screenshot to check for any changes.
call_user() # Request human assistance
stop(reason='') # If the item can not found in the image, give the reason
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount') # Scroll on the specified direction at the coordinate (x1,y1) by the given amount
finish() # The task is completed.
## User Instruction
{task}"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": img},
{"type": "text", "text": prompt_text},
]},
]
# 4. Run inference
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text_input], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt",
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
output_ids = output_ids[:, inputs.input_ids.shape[1]:]
output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(output)
Single-Step Demo (with Cider)
import mlx_vlm as pm
from vlm_service import custom_generate
from PIL import Image
# 1. Load model
model, processor = pm.load("Mininglamp-2718/Mano-P")
# 2. Load a screenshot (or any desktop screenshot image)
img = Image.open("screenshot.png")
# Resize to 1280px width (model's expected input resolution)
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)
# 3. Build prompt
task = "Click the search bar and type hello"
prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
<action>具体动作</action>
## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
hover(start_box='<|box_start|>(x1,y1)<|box_end|>')
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
triple_click(start_box='<|box_start|>(x1,y1)<|box_end|>') left click at the coordinate (x1,y1) three times.
hotkey_click(start_box='<|box_start|>(x1,y1)<|box_end|>', key='') press command key and click at the coordinate (x1,y1).
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>') right click at the coordinate (x1,y1).
type(content='') type the content.
doubleclick(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') # Drag an element from the start coordinate (x1,y1) to the end coordinate (x3,y3).
hotkey(key='') # Trigger a keyboard shortcut.
wait(duration='') # Sleep for specified duration (in seconds) and take a screenshot to check for any changes.
call_user() # Request human assistance
stop(reason='') # If the item can not found in the image, give the reason
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount') # Scroll on the specified direction at the coordinate (x1,y1) by the given amount
finish() # The task is completed.
## User Instruction
{task}"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt_text},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
prompt = prompt.replace("<image>", "<|vision_start|><|image_pad|><|vision_end|>")
# 4. Run inference
result = custom_generate(
model, processor, prompt,
[img],
max_tokens=512,
temperature=0.0,
prefill_step_size=2048,
)
print(f"Tokens: {result.generation_tokens}, Speed: {result.generation_tps:.1f} tok/s")
print(result.text)
Multi-Step Agent Loop
The model is designed for multi-turn interaction: execute an action, take a new screenshot, feed it back with action history.
import mlx_vlm as pm
from vlm_service import custom_generate
from PIL import Image
import re
model, processor = pm.load("Mininglamp-2718/Mano-P")
SYSTEM_PROMPT = "You are a helpful assistant."
INSTRUCTION_TEMPLATE = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
<think>思考过程</think>
<action_desp>动作描述</action_desp>
<action>具体动作</action>
## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
hover(start_box='<|box_start|>(x1,y1)<|box_end|>')
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
triple_click(start_box='<|box_start|>(x1,y1)<|box_end|>') left click at the coordinate (x1,y1) three times.
hotkey_click(start_box='<|box_start|>(x1,y1)<|box_end|>', key='') press command key and click at the coordinate (x1,y1).
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>') right click at the coordinate (x1,y1).
type(content='') type the content.
doubleclick(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') # Drag an element from the start coordinate (x1,y1) to the end coordinate (x3,y3).
hotkey(key='') # Trigger a keyboard shortcut.
wait(duration='') # Sleep for specified duration (in seconds) and take a screenshot to check for any changes.
call_user() # Request human assistance
stop(reason='') # If the item can not found in the image, give the reason
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount') # Scroll on the specified direction at the coordinate (x1,y1) by the given amount
finish() # The task is completed.
## Note
- Use Chinese in `<think>` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `<action_desp>` part.
## User Instruction
{task}
{history}
当前步骤的截图为<image>"""
def resize(img, width=1280):
ratio = width / img.width
return img.resize((width, int(img.height * ratio)), Image.LANCZOS)
def build_prompt(task, history_steps, current_img):
"""Build prompt with action history and current screenshot."""
images = []
# Include last history screenshot + current screenshot
history_lines = []
for i, step in enumerate(history_steps):
if i == len(history_steps) - 1 and step.get("screenshot"):
images.append(step["screenshot"])
history_lines.append(f"第{i+1}步: {step['desc']}, 对应截图为: <image>")
else:
history_lines.append(f"第{i+1}步: {step['desc']}")
history_text = "\n".join(history_lines) if history_lines else ""
images.append(current_img)
text = INSTRUCTION_TEMPLATE.format(task=task, history=history_text)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Replace <image> placeholders with vision tokens (right-to-left)
for _ in range(len(images)):
pos = prompt.rfind("<image>")
if pos >= 0:
prompt = prompt[:pos] + "<|vision_start|><|image_pad|><|vision_end|>" + prompt[pos + 7:]
return prompt, images
def parse_output(text):
"""Extract think, action_desp, action from model output."""
def extract(tag):
m = re.search(rf"<{tag}>(.*?)</{tag}>", text, re.DOTALL)
return m.group(1).strip() if m else ""
return extract("think"), extract("action_desp"), extract("action")
# --- Agent loop ---
task = "Open Safari and search for 'MLX framework'"
history = []
max_steps = 10
for step in range(max_steps):
# Take screenshot (replace with your own screenshot capture)
screenshot = resize(Image.open(f"step_{step}.png"))
# Build prompt and run inference
prompt, images = build_prompt(task, history, screenshot)
result = custom_generate(
model, processor, prompt, images,
max_tokens=512, temperature=0.0, prefill_step_size=2048,
)
think, action_desp, action = parse_output(result.text)
print(f"[Step {step+1}] {action_desp}")
print(f" Action: {action}")
# Check terminal actions
if action.startswith("finish"):
print("Task completed!")
break
if action.startswith("stop"):
print("Task infeasible.")
break
# Record history for next step
history.append({"desc": action_desp, "screenshot": screenshot})
# >>> Execute the action on screen, then loop back to take new screenshot <<<
Output Format
The model outputs structured XML:
<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>
Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:
pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)
W8A8 Acceleration (M5+ only)
On Apple M5 or later, enable INT8 acceleration for ~15-19% faster prefill:
from cider import convert_model, is_available
if is_available():
convert_model(model.language_model)
Full Action Space
| Action | Syntax | Description |
|---|---|---|
| open_app | open_app(app_name='') |
Open an application |
| open_url | open_url(url='') |
Open a URL |
| click | click(start_box='<|box_start|>(x,y)<|box_end|>') |
Left click |
| doubleclick | doubleclick(start_box='<|box_start|>(x,y)<|box_end|>') |
Double click |
| triple_click | triple_click(start_box='<|box_start|>(x,y)<|box_end|>') |
Triple click (select line) |
| right_single | right_single(start_box='<|box_start|>(x,y)<|box_end|>') |
Right click |
| hover | hover(start_box='<|box_start|>(x,y)<|box_end|>') |
Mouse hover |
| type | type(content='text') |
Type text |
| hotkey | hotkey(key='cmd+c') |
Keyboard shortcut |
| hotkey_click | hotkey_click(start_box='<|box_start|>(x,y)<|box_end|>', key='shift') |
Modifier + click |
| scroll | scroll(start_box='<|box_start|>(x,y)<|box_end|>', direction='down', amount='3') |
Scroll |
| drag | drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x2,y2)<|box_end|>') |
Drag and drop |
| wait | wait(duration='2') |
Wait (seconds) |
| finish | finish() |
Task completed |
| stop | stop(reason='...') |
Task infeasible |
| call_user | call_user() |
Request human help |
📮 Contact
- 🏠 Website: https://github.com/Mininglamp-AI/Mano-P
- 📧 Email: model@mininglamp.com