Logo

πŸ™ Github   |   πŸ“„ Paper

A-SINQ 3-bit Quantized Qwen3-14B model

This repository contains the official 3-bit quantized version of the Qwen3-14B model using the calibrated version of SINQ (Sinkhorn-Normalized Quantization) method.
SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.

To support the project please put a star ⭐ in the official SINQ github repository.

Model Details

  • Model Name: Qwen3-14B-3bit-ASINQ
  • Base Model: Qwen/Qwen3-14B
  • Task: Text Generation
  • Framework: PyTorch / Transformers
  • License: Apache-2.0
  • Quantized By: Huawei - Computing Systems Lab

Quantization Details

  • Quantization Method: A-SINQ (Sinkhorn-Normalized Quantization)
  • Precision: INT3
  • Group Size: 64
  • Framework: PyTorch
  • Quantization Library: sinq

πŸš€ Usage

Prerequisite

Before running the quantization script, make sure the SINQ library is installed. Installation instructions and setup details are available in the SINQ official github repository.

Usage example

You can load and use the model with our wrapper based on the πŸ€— Transformers library:

from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Qwen3-14B-3bit-ASINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

prompt = "Explain neural network quantization in one sentence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.inference_mode():
    out_ids = sinq_model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out_ids[0], skip_special_tokens=True))
🧩 Quantization Process

The quantized model was obtained using the SINQ quantization library, following the steps below:

from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "Qwen/Qwen3-14B"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 3-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=3,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="asinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)

Reproducibility Note: This model was quantized using the SINQ implementation from commit 14ad847 of the SINQ repository.



🧾 How to Cite This Work

If you find SINQ useful in your research or applications, please

  • Put a star ⭐ in the official SINQ github repository.
  • Cite our paper:
@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}
Downloads last month
149
Safetensors
Model size
3B params
Tensor type
I32
Β·
BF16
Β·
F16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for huawei-csl/Qwen3-14B-3bit-ASINQ

Finetuned
Qwen/Qwen3-14B
Quantized
(126)
this model

Collection including huawei-csl/Qwen3-14B-3bit-ASINQ