Instructions to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="agentscope-ai/QwenPaw-Flash-9B-Q8_0",
	filename="QwenPaw-flash-9B-20260330-q8.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0
# Run inference directly in the terminal:
llama-cli -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0
# Run inference directly in the terminal:
llama-cli -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0
# Run inference directly in the terminal:
./llama-cli -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Use Docker

docker model run hf.co/agentscope-ai/QwenPaw-Flash-9B-Q8_0

LM Studio
Jan
Ollama
How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Ollama:
```
ollama run hf.co/agentscope-ai/QwenPaw-Flash-9B-Q8_0
```

Unsloth Studio new

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for agentscope-ai/QwenPaw-Flash-9B-Q8_0 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for agentscope-ai/QwenPaw-Flash-9B-Q8_0 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for agentscope-ai/QwenPaw-Flash-9B-Q8_0 to start chatting

Pi new

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "agentscope-ai/QwenPaw-Flash-9B-Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf agentscope-ai/QwenPaw-Flash-9B-Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default agentscope-ai/QwenPaw-Flash-9B-Q8_0

Run Hermes

hermes

Docker Model Runner
How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Docker Model Runner:
```
docker model run hf.co/agentscope-ai/QwenPaw-Flash-9B-Q8_0
```

Lemonade

How to use agentscope-ai/QwenPaw-Flash-9B-Q8_0 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull agentscope-ai/QwenPaw-Flash-9B-Q8_0

Run and chat with the model

lemonade run user.QwenPaw-Flash-9B-Q8_0-{{QUANT_TAG}}

List all available models

lemonade list

QwenPaw-Flash

QwenPaw-Flash is a lightweight model deeply optimized for the QwenPaw autonomous agent scenario. Since its training phase, the model has been specifically refined for QwenPaw tasks, delivering enhanced agentic performance in tool invocation, command execution, memory management, and multi-step planning.

Capability

The core strength of QwenPaw-Flash stems from its native integration with the QwenPaw ecosystem. We have constructed extensive, high-quality agent trajectory data sampled from real QwenPaw environments, systematically enhancing the model's proficiency in high-frequency daily scenarios. Key features include:

Active Memory Management: Autonomously identifies, stores, and retrieves persistent user preferences and task states, ensuring high logical consistency across multi-turn interactions.
Native File Parsing: Optimized for terminal operations and file system orchestration. Excels at generating precise CLI commands and executing complex, multi-step file I/O tasks.
Efficient Information Search: Enhanced for web-search tool invocation. Features precise search intent recognition and multi-step web navigation to effectively identify and query online information.
Intelligent Guidance: Built-in awareness of the QwenPaw feature map. Proactively suggests functional paths and troubleshooting based on real-time operational context.

Model Overview

QwenPaw-Flash-2B/4B/9B is fine-tuned from Qwen3.5-2B/4B/9B, sharing the same architectural parameters.

Type: Causal Language Model with Vision Encoder
Training Stage: Post-training
Number of Parameters: 2B/4B/9B
Hidden Dimension: 2048/2560/4096
Token Embedding: 248320 (Padded)
Number of Layers: 24/32/32
Hidden Layout: 6/8/8 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
Gated DeltaNet:
- Number of Linear Attention Heads: 16/32/32 for V and 16/16/16 for QK
- Head Dimension: 128
Gated Attention:
- Number of Attention Heads: 8/16/16 for Q and 2/4/4 for KV
- Head Dimension: 256
Rotary Position Embedding Dimension: 64
Feed Forward Network: Intermediate Dimension: 6144/9216/12288
LM Output: 248320 (Tied to token embedding)
Context Length: 262,144 tokens natively

Benchmark Results

The complexity of QwenPaw's context engineering and tool usage poses heightened challenges for model evaluation. To address this, we have developed a dedicated benchmark tailored to the QwenPaw environment. This benchmark systematically evaluates model performance across five high-frequency usage scenarios, covering key operational dimensions.

Results indicate that QwenPaw-Flash delivers substantial improvements across multiple task categories, achieving performance comparable to leading flagship models—all while maintaining significantly lower resource requirements.

Figure 1: QwenPaw-Flash-9B compared with other models.

Figure 2: QwenPaw-Flash-2B/4B/9B compared with their respective baseline models.

Quickstart

Serving QwenPaw-Flash

QwenPaw-Flash can be served via APIs using popular inference frameworks. Below are example commands to launch OpenAI-compatible API servers for QwenPaw-Flash.

llama.cpp

Check out Qwen llama.cpp documentation for more usage guide.

We advise you to clone llama.cpp and install it following the official guide. We follow the latest version of llama.cpp.

llama-server -m /path/to/.gguf

Using QwenPaw-Flash via Chat Completions API

Once the server is running, you can access QwenPaw-Flash via standard HTTP requests or OpenAI-compatible SDKs.

Prerequisites

Ensure the OpenAI Python SDK is installed and your environment variables are configured:

pip install -U openai

# Set the following accordingly
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

Text-Only Input Example

The following Python script demonstrates how to interact with the model using the OpenAI SDK:

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

messages = [
    {"role": "user", "content": "Hello, QwenPaw!"},
]

chat_response = client.chat.completions.create(
    model=<your_model_path>,
    messages=messages,
    max_tokens=81920,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
    }, 
)
print("Chat response:", chat_response)