ROUTER LAYERS WERE ACCIDENTALLY QUANTIZED - UPDATED FILES COMING SOON™

Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER-GPTQ-(W4A16)

Standard quality 4-bit GPTQ quantization of DavidAU's Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER using llm-compressor with W4A16 format.

Model Details

Model Description

This is a 4-bit GPTQ quantized version of the Qwen3-42B MASTER-CODER model, optimized for efficient deployment while maintaining excellent performance. The quantization uses W4A16 format (4-bit weights, 16-bit activations) with standard quality settings for optimal balance between size, speed, and quality.

The base model is a 42-billion parameter mixture-of-experts model enhanced with "Brainstorm 20X" adapter, excelling in coding, reasoning, and creative tasks with native 256K context support.

Quantized by: RobPS
Model type: Causal Language Model (Mixture of Experts)
Language(s): English (primary), multilingual support
License: Apache 2.0
Quantization format: GPTQ W4A16 (4-bit weights, 16-bit activations)
Finetuned from model: DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER
Base model: Qwen/Qwen3-30B-A3B-Thinking-2507

Model Sources

Original Model: [DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER](https://huggingface.co/DavidAU/Qwe n3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER)
Base Model Repository: Qwen/Qwen3-30B-A3B-Thinking-2507
Quantization Tool: llm-compressor

Uses

Direct Use

This quantized model is optimized for:

Coding and Programming: Excellent performance across major and minor programming languages
Reasoning Tasks: Extended thinking capability for complex problem-solving
Creative Writing: Enhanced prose quality and detail through Brainstorm adapter
Instruction Following: Strong adherence to user instructions
Tool Usage: Integration with external tools and APIs
Agentic Applications: Multi-step reasoning and planning

The W4A16 quantization enables deployment on consumer hardware while maintaining near-FP16 quality.

Downstream Use

Suitable for:

Code generation and analysis pipelines
AI-assisted development environments
Creative writing tools
Educational tutoring systems
Research assistants
Agentic frameworks requiring reasoning capabilities

Out-of-Scope Use

Real-time safety-critical applications without human oversight
Medical diagnosis or legal advice without expert verification
Applications requiring 100% factual accuracy without validation
Tasks requiring context beyond 256K tokens

Quantization Details

Quantization Configuration

Method: GPTQ (Optimal Brain Quantization)
Format: W4A16 (4-bit weights, 16-bit activations)
Group Size: 128 (standard quality, optimal balance)
Symmetric Quantization: Yes
Strategy: Group quantization
Preserved Layers: lm_head (FP16)
Quantized Layers: All Linear layers

Calibration Dataset

Dataset: open-platypus
Samples: 128
Sequence Length: 512 tokens
Total Calibration Tokens: ~65,536

Quantization Results

Original Size: ~84 GB (FP16)
Quantized Size: ~21 GB (W4A16, gs=128)
Compression Ratio: 75% size reduction
Expected Quality Loss: 2-5% perplexity increase (untested at this time)

Model Architecture

Technical Specifications

Total Parameters: 42B (67 layers, 807 tensors)
Active Parameters: 3.3B per forward pass
Architecture: Mixture of Experts (MoE)
Total Experts: 128
Active Experts: 8 per token
Context Length: 262,144 tokens (256K)
Vocabulary Size: Standard Qwen3 tokenizer

Performance Benchmarks (FP16 Base Model)

AIME25: 85.0
HMMT25: 71.4
LiveCodeBench: 66.0
WritingBench: 85.0

Recommended Generation Settings

For coding tasks

temperature: 0.3-0.6 top_p: 0.95 top_k: 20-40 repetition_penalty: 1.05-1.1 min_p: 0.05

For creative writing

temperature: 0.6-0.9 top_p: 0.95 top_k: 40-100 repetition_penalty: 1.08-1.12

Known Limitations

Quantization may introduce minor quality degradation (2-5% perplexity increase)
Very long context (>128K) may show reduced coherence compared to FP16
Arithmetic and factual accuracy should be verified for critical applications
May generate plausible but incorrect code without proper testing

Citation

Original Model

@misc{qwen3-master-coder-2024, author = {DavidAU}, title = {Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}} }

Base Model

@article{qwen3-2024, title={Qwen3 Technical Report}, author={Qwen Team}, year={2024}, journal={arXiv preprint}, }

Acknowledgments

Original Model: DavidAU for the enhanced MASTER-CODER variant
Base Model: Qwen Team at Alibaba Cloud
Quantization Tool: vLLM team for llm-compressor
Brainstorm Adapter: David AU's custom enhancement

Model Card Authors

RobPS aka tcclaviger

Model Card Contact

For questions about this quantized version, please open an issue in the model repository.

For questions about the original model, refer to https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER.

Downloads last month: 37

Safetensors

Model size

6B params

Tensor type

I64

I32

BF16

Model tree for tcclaviger/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Finetuned

DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER

Quantized

(3)

this model