Quantized Qwen3-4B-Thinking-2507

This repository provides the Q4 and Q5 quantized version of the Qwen3-4B-Thinking-2507 model. This model emphasizes depth of reasoning (โ€œthinking modeโ€). It outputs an internal โ€œthoughtโ€ component (within / boundaries) followed by the final generated content. This model Significantly improved reasoning and general capabilitiesโ€”including logic, math, science, coding, instruction following, tool usage, text generation, alignment with human preferences, and 256K long-context understanding. These quantized models will able to run on the CPU's and edge devices and can utilize the model capabality without limitation of the hardware.

Model Overview

  • Original Model: Qwen3-4B-Thinking-2507
  • Thinking Mode: Default enabled; tag not required
  • Architecture: decoder-only model
  • Base Model: Qwen3-4B-Thinking-2507
  • Quantized Version:
    • Q4_K_M
    • Q5_K_M
  • Modalities: Text
  • Developer: Qwen
  • License: Apache 2.0 License
  • Languages: English

Quantization Details

Q4_K_M Version

  • Approx. ~71% size reduction
  • Lower memory footprint (~2.33 GB)
  • Best suited for deployment on edge devices or low-resource GPUs
  • Slight performance degradation in complex reasoning scenarios

Q5_K_M Version

  • Approx. ~66% size reduction
  • Lower memory footprint (~2.69 GB)
  • Better performance retention, recommended when quality is a priority

Key Features

  • Significantly improved reasoning: logical reasoning, mathematics, science, coding, and academic benchmarks.
  • Better general capabilities: instruction following, tool usage, text generation, and alignment with human preferences.
  • Long-Context Understanding: Can handle up to 256K tokens, enabling analysis of very large documents.
  • Large-Scale Transformer: 36-layer decoder-only Transformer with Grouped Query Attention for efficient computation..
  • Deployment Ready: Compatible with CPUs, lightweight GPUs, and frameworks like vllm, llama.cpp.

Usage Example

Using llama.cpp for inference:

./llama-cli -hf SandLogicTechnologies/Qwen3-4B-Thinking-2507 -p "Give me a short introduction to large language model."

Recommended Use Cases

  • Advanced Reasoning Tasks: Logical reasoning, mathematics, science, and coding problems.
  • Academic Assistance: Solving benchmark questions, research summarization, and educational content generation.
  • Instruction Following: Chatbots and virtual assistants that respond accurately to user instruction.
  • Long-Context Applications: Analyzing and generating content from very large documents (up to 256K tokens).
  • Deployment in Low-Resource Environments: Running efficiently on CPUs, edge devices, or lightweight GPUs.

Acknowledgments

These quantized models are based on the original work by the Qwen development team.

Special thanks to:

  • The Qwen team for developing and releasing the Qwen3-4B-Thinking-2507 model.

  • Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.


Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month
5
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SandLogicTechnologies/Qwen3-4B-Thinking-2507-GGUF

Quantized
(82)
this model