Qwen3.6-VL-REAP-26B-A3B — GGUF

GGUF quantizations of atbender/Qwen3.6-VL-REAP-26B-A3B, a REAP-pruned variant of Qwen3.6-VL. Both the language model (text quants) and the vision tower (mmproj) are included - drop the mmproj alongside any text quant for full multimodal (image + text) inference.

Files

File Quant Size
Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf Q4_K_M ~15 GB
Qwen3.6-VL-REAP-26B-A3B-text-IQ4_XS.gguf IQ4_XS ~14 GB
Qwen3.6-VL-REAP-26B-A3B-text-Q3_K_S.gguf Q3_K_S ~11 GB
mmproj-REAP-26B-F16.gguf F16 (vision tower) ~860 MB

Quality vs bf16 (wikitext-2-raw, llama.cpp perplexity)

Each text quant was scored against the bf16 reference using llama-perplexity on the wikitext-2-raw test split (580 chunks, n_ctx=512, ~297k tokens). Bench run on a single RTX PRO 6000 (Blackwell, 96 GB).

Quant PPL ΔPPL vs bf16 Mean KLD Top-1 token agree
bf16 (reference) 9.2369 0 100%
Q4_K_M 9.3858 +1.62% 0.0449 90.41%
IQ4_XS 9.4293 +2.08% 0.0457 90.03%
Q3_K_S 10.4822 +13.51% 0.1626 81.85%

On Apple Silicon (llama.cpp, q8_0 KV cache): Q4_K_M had the best speed/quality trade-off across both standalone code-gen and agentic tasks. Q3_K_S held up reasonably well on quality at a smaller footprint, and IQ4_XS produced correct outputs but ran noticeably slower in the same harness. Your mileage may vary depending on your hardware and setup.

Usage

Text-only

llama-cli -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf -cnv

Vision-language (multimodal)

The vision tower is loaded via --mmproj. Both llama-mtmd-cli (one-shot image+prompt) and llama-server (OpenAI-compatible HTTP server with image input) are supported.

One-shot CLI:

llama-mtmd-cli \
    -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \
    --mmproj mmproj-REAP-26B-F16.gguf \
    --image path/to/photo.jpg \
    -p "Describe this image."

Server (OpenAI-compatible /v1/chat/completions with image_url):

llama-server \
    -m Qwen3.6-VL-REAP-26B-A3B-text-Q4_K_M.gguf \
    --mmproj mmproj-REAP-26B-F16.gguf \
    --port 8080

Then send a chat-completions request with an image_url content part (data URL or http URL) — the server routes it through the mmproj automatically. Capability is advertised as multimodal on the /v1/models endpoint when --mmproj is set.

Notes on the vision tower

  • Converted from atbender's source vision encoder weights (BF16) to GGUF F16 via llama.cpp's convert_hf_to_gguf.py --mmproj pipeline.
  • Validated end-to-end with llama-mtmd-cli on two test images (a real-world product photo and a music album scan); both produced accurate descriptions including readable on-image text, matching expectations from the bf16 reference.
  • F16 was kept (not quantized further) because the vision tower is small (~860 MB) and quality-sensitive; the marginal disk savings of FP8/Q8 don't justify the risk of degrading image grounding.

Acknowledgements

License inherited from the base model (Apache 2.0).

Downloads last month
3,080
GGUF
Model size
27B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF

Quantized
(2)
this model