Qwen3-0.6B Chat CoreML

CoreML conversion of Qwen/Qwen3-0.6B for on-device inference on Apple Silicon (iPhone, iPad, Mac).

Model Details

Property Value
Parameters 596M
Architecture Qwen3 (GQA, RoPE, SwiGLU, RMSNorm)
Hidden size 1024
Layers 28
Attention heads 16 (8 KV heads)
Head dimension 128
Vocab size 151,936
Max sequence length 2048
Quantization INT4 (per-block, linear symmetric)
Model size ~317 MB
CoreML target iOS 18+ / macOS 15+

Usage

This model is designed for use with the speech-swift Qwen3Chat module:

import Qwen3Chat

let model = try await Qwen3ChatModel.fromPretrained()

// Single generation
let response = try model.generate(messages: [
    ChatMessage(role: .user, content: "Hello!")
])

// Streaming
let stream = model.chatStream("What is Swift?", systemPrompt: "Be brief.")
for try await chunk in stream {
    print(chunk, terminator: "")
}

Prompt caching

The chat() / chatStream() methods cache the system prompt KV state. Subsequent turns restore from cache instead of re-prefilling (~300ms saved per turn).

Files

File Description
Qwen3Chat.mlpackage/ CoreML model (INT4 weights, float16 activations)
chat_config.json Model architecture config
vocab.json BPE vocabulary (151,936 tokens)
merges.txt BPE merge rules
tokenizer_config.json Tokenizer settings + added tokens
tokenizer.json Full tokenizer (HuggingFace format)

Conversion

Converted using coremltools 9.0 from the original PyTorch weights:

python scripts/convert_qwen3_chat_coreml.py \
    --hf-model Qwen/Qwen3-0.6B \
    --output models/Qwen3-0.6B-Chat-CoreML \
    --quantize int4

KV Cache Design

The CoreML model uses explicit KV cache inputs/outputs per layer:

  • Inputs: layer_{i}_key_cache, layer_{i}_value_cache (float16)
  • Outputs: layer_{i}_key_cache_out, layer_{i}_value_cache_out (float16)
  • Shape: [1, 8, seq_len, 128] (batch, kv_heads, sequence, head_dim)

License

Apache 2.0 (same as base model)

Links


Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Qwen3-0.6B-Chat-CoreML

Finetuned
Qwen/Qwen3-0.6B
Quantized
(307)
this model

Collection including aufklarer/Qwen3-0.6B-Chat-CoreML