bits: 4
group size: 128
tiling mode: 1D
method: sinq

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLMs

inference.py

from sinq.patch_model import AutoSINQHFModel
from transformers import AutoTokenizer
import torch
import time

# path
model_path = r"Path\to\folder\Qwen3-30B-A3B-Thinking-2507-sinq-2bit"

# loading
model = AutoSINQHFModel.from_quantized(
    model_path,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-30B-A3B-Thinking-2507",
    trust_remote_code=True
)

# generation
prompt = "Describe the future of artificial intelligence."
start_time = time.time()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=1024)
end_time = time.time()

# t/s
num_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
tokens_per_second = num_tokens / (end_time - start_time)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"\nGenerated {num_tokens} tokens in {end_time - start_time:.2f} seconds")
print(f"Tokens per second: {tokens_per_second:.2f}")

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support