Qwen/Qwen3-30B-A3B-Thinking-2507 quantized with SINQ
- bits: 4
- group size: 128
- tiling mode: 1D
- method: sinq
SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLMs
inference.py
from sinq.patch_model import AutoSINQHFModel
from transformers import AutoTokenizer
import torch
import time
# path
model_path = r"Path\to\folder\Qwen3-30B-A3B-Thinking-2507-sinq-2bit"
# loading
model = AutoSINQHFModel.from_quantized(
model_path,
device="cuda:0",
compute_dtype=torch.bfloat16
)
# tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen3-30B-A3B-Thinking-2507",
trust_remote_code=True
)
# generation
prompt = "Describe the future of artificial intelligence."
start_time = time.time()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=1024)
end_time = time.time()
# t/s
num_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
tokens_per_second = num_tokens / (end_time - start_time)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(f"\nGenerated {num_tokens} tokens in {end_time - start_time:.2f} seconds")
print(f"Tokens per second: {tokens_per_second:.2f}")
- Downloads last month
- 6
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support