ROUTER LAYERS WERE ACCIDENTALLY QUANTIZED - UPDATED FILES COMING SOON™
Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER-GPTQ-(W4A16)
Standard quality 4-bit GPTQ quantization of DavidAU's Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER using llm-compressor with W4A16 format.
Model Details
Model Description
This is a 4-bit GPTQ quantized version of the Qwen3-42B MASTER-CODER model, optimized for efficient deployment while maintaining excellent performance. The quantization uses W4A16 format (4-bit weights, 16-bit activations) with standard quality settings for optimal balance between size, speed, and quality.
The base model is a 42-billion parameter mixture-of-experts model enhanced with "Brainstorm 20X" adapter, excelling in coding, reasoning, and creative tasks with native 256K context support.
- Quantized by: RobPS
- Model type: Causal Language Model (Mixture of Experts)
- Language(s): English (primary), multilingual support
- License: Apache 2.0
- Quantization format: GPTQ W4A16 (4-bit weights, 16-bit activations)
- Finetuned from model: DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER
- Base model: Qwen/Qwen3-30B-A3B-Thinking-2507
Model Sources
- Original Model: [DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER](https://huggingface.co/DavidAU/Qwe n3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER)
- Base Model Repository: Qwen/Qwen3-30B-A3B-Thinking-2507
- Quantization Tool: llm-compressor
Uses
Direct Use
This quantized model is optimized for:
- Coding and Programming: Excellent performance across major and minor programming languages
- Reasoning Tasks: Extended thinking capability for complex problem-solving
- Creative Writing: Enhanced prose quality and detail through Brainstorm adapter
- Instruction Following: Strong adherence to user instructions
- Tool Usage: Integration with external tools and APIs
- Agentic Applications: Multi-step reasoning and planning
The W4A16 quantization enables deployment on consumer hardware while maintaining near-FP16 quality.
Downstream Use
Suitable for:
- Code generation and analysis pipelines
- AI-assisted development environments
- Creative writing tools
- Educational tutoring systems
- Research assistants
- Agentic frameworks requiring reasoning capabilities
Out-of-Scope Use
- Real-time safety-critical applications without human oversight
- Medical diagnosis or legal advice without expert verification
- Applications requiring 100% factual accuracy without validation
- Tasks requiring context beyond 256K tokens
Quantization Details
Quantization Configuration
- Method: GPTQ (Optimal Brain Quantization)
- Format: W4A16 (4-bit weights, 16-bit activations)
- Group Size: 128 (standard quality, optimal balance)
- Symmetric Quantization: Yes
- Strategy: Group quantization
- Preserved Layers: lm_head (FP16)
- Quantized Layers: All Linear layers
Calibration Dataset
- Dataset: open-platypus
- Samples: 128
- Sequence Length: 512 tokens
- Total Calibration Tokens: ~65,536
Quantization Results
- Original Size: ~84 GB (FP16)
- Quantized Size: ~21 GB (W4A16, gs=128)
- Compression Ratio: 75% size reduction
- Expected Quality Loss: 2-5% perplexity increase (untested at this time)
Model Architecture
Technical Specifications
- Total Parameters: 42B (67 layers, 807 tensors)
- Active Parameters: 3.3B per forward pass
- Architecture: Mixture of Experts (MoE)
- Total Experts: 128
- Active Experts: 8 per token
- Context Length: 262,144 tokens (256K)
- Vocabulary Size: Standard Qwen3 tokenizer
Performance Benchmarks (FP16 Base Model)
- AIME25: 85.0
- HMMT25: 71.4
- LiveCodeBench: 66.0
- WritingBench: 85.0
Recommended Generation Settings
For coding tasks
temperature: 0.3-0.6 top_p: 0.95 top_k: 20-40 repetition_penalty: 1.05-1.1 min_p: 0.05
For creative writing
temperature: 0.6-0.9 top_p: 0.95 top_k: 40-100 repetition_penalty: 1.08-1.12
Known Limitations
- Quantization may introduce minor quality degradation (2-5% perplexity increase)
- Very long context (>128K) may show reduced coherence compared to FP16
- Arithmetic and factual accuracy should be verified for critical applications
- May generate plausible but incorrect code without proper testing
Citation
Original Model
@misc{qwen3-master-coder-2024, author = {DavidAU}, title = {Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER}} }
Base Model
@article{qwen3-2024, title={Qwen3 Technical Report}, author={Qwen Team}, year={2024}, journal={arXiv preprint}, }
Acknowledgments
- Original Model: DavidAU for the enhanced MASTER-CODER variant
- Base Model: Qwen Team at Alibaba Cloud
- Quantization Tool: vLLM team for llm-compressor
- Brainstorm Adapter: David AU's custom enhancement
Model Card Authors
RobPS aka tcclaviger
Model Card Contact
For questions about this quantized version, please open an issue in the model repository.
For questions about the original model, refer to https://huggingface.co/DavidAU/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER.
- Downloads last month
- 37
Model tree for tcclaviger/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER
Base model
Qwen/Qwen3-30B-A3B-Thinking-2507