ROUTER LAYERS WERE ACCIDENTALLY QUANTIZED - UPDATED FILES COMING SOON™

Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER-GPTQ-(W4A16)

Standard quality 4-bit GPTQ quantization of DavidAU's Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER using llm-compressor with W4A16 format.

Model Details

Model Description

This is a 4-bit GPTQ quantized version of the Qwen3-42B MASTER-CODER model, optimized for efficient deployment while maintaining excellent performance. The quantization uses W4A16 format (4-bit weights, 16-bit activations) with standard quality settings for optimal balance between size, speed, and quality.

The base model is a 42-billion parameter mixture-of-experts model enhanced with "Brainstorm 20X" adapter, excelling in coding, reasoning, and creative tasks with native 256K context support.

  • Quantized by: RobPS
  • Model type: Causal Language Model (Mixture of Experts)
  • Language(s): English (primary), multilingual support
  • License: Apache 2.0
  • Quantization format: GPTQ W4A16 (4-bit weights, 16-bit activations)
  • Finetuned from model: DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER
  • Base model: Qwen/Qwen3-30B-A3B-Thinking-2507

Model Sources

Uses

Direct Use

This quantized model is optimized for:

  • Coding and Programming: Excellent performance across major and minor programming languages
  • Reasoning Tasks: Extended thinking capability for complex problem-solving
  • Creative Writing: Enhanced prose quality and detail through Brainstorm adapter
  • Instruction Following: Strong adherence to user instructions
  • Tool Usage: Integration with external tools and APIs
  • Agentic Applications: Multi-step reasoning and planning

The W4A16 quantization enables deployment on consumer hardware while maintaining near-FP16 quality.

Downstream Use

Suitable for:

  • Code generation and analysis pipelines
  • AI-assisted development environments
  • Creative writing tools
  • Educational tutoring systems
  • Research assistants
  • Agentic frameworks requiring reasoning capabilities

Out-of-Scope Use

  • Real-time safety-critical applications without human oversight
  • Medical diagnosis or legal advice without expert verification
  • Applications requiring 100% factual accuracy without validation
  • Tasks requiring context beyond 256K tokens

Quantization Details

Quantization Configuration

  • Method: GPTQ (Optimal Brain Quantization)
  • Format: W4A16 (4-bit weights, 16-bit activations)
  • Group Size: 128 (standard quality, optimal balance)
  • Symmetric Quantization: Yes
  • Strategy: Group quantization
  • Preserved Layers: lm_head (FP16)
  • Quantized Layers: All Linear layers

Calibration Dataset

  • Dataset: open-platypus
  • Samples: 128
  • Sequence Length: 512 tokens
  • Total Calibration Tokens: ~65,536

Quantization Results

  • Original Size: ~84 GB (FP16)
  • Quantized Size: ~21 GB (W4A16, gs=128)
  • Compression Ratio: 75% size reduction
  • Expected Quality Loss: 2-5% perplexity increase (untested at this time)

Model Architecture

Technical Specifications

  • Total Parameters: 42B (67 layers, 807 tensors)
  • Active Parameters: 3.3B per forward pass
  • Architecture: Mixture of Experts (MoE)
  • Total Experts: 128
  • Active Experts: 8 per token
  • Context Length: 262,144 tokens (256K)
  • Vocabulary Size: Standard Qwen3 tokenizer

Performance Benchmarks (FP16 Base Model)

  • AIME25: 85.0
  • HMMT25: 71.4
  • LiveCodeBench: 66.0
  • WritingBench: 85.0

Recommended Generation Settings

For coding tasks

temperature: 0.3-0.6 top_p: 0.95 top_k: 20-40 repetition_penalty: 1.05-1.1 min_p: 0.05

For creative writing

temperature: 0.6-0.9 top_p: 0.95 top_k: 40-100 repetition_penalty: 1.08-1.12

Known Limitations

  • Quantization may introduce minor quality degradation (2-5% perplexity increase)
  • Very long context (>128K) may show reduced coherence compared to FP16
  • Arithmetic and factual accuracy should be verified for critical applications
  • May generate plausible but incorrect code without proper testing

Citation

Original Model

@misc{qwen3-master-coder-2024, author = {DavidAU}, title = {Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}} }

Base Model

@article{qwen3-2024, title={Qwen3 Technical Report}, author={Qwen Team}, year={2024}, journal={arXiv preprint}, }

Acknowledgments

  • Original Model: DavidAU for the enhanced MASTER-CODER variant
  • Base Model: Qwen Team at Alibaba Cloud
  • Quantization Tool: vLLM team for llm-compressor
  • Brainstorm Adapter: David AU's custom enhancement

Model Card Authors

RobPS aka tcclaviger

Model Card Contact

For questions about this quantized version, please open an issue in the model repository.

For questions about the original model, refer to https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER.

Downloads last month
37
Safetensors
Model size
6B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tcclaviger/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER