CoreML Speech Models
Collection
Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 23 items • Updated • 3
CoreML conversion of Qwen/Qwen3-0.6B for on-device inference on Apple Silicon (iPhone, iPad, Mac).
| Property | Value |
|---|---|
| Parameters | 596M |
| Architecture | Qwen3 (GQA, RoPE, SwiGLU, RMSNorm) |
| Hidden size | 1024 |
| Layers | 28 |
| Attention heads | 16 (8 KV heads) |
| Head dimension | 128 |
| Vocab size | 151,936 |
| Max sequence length | 2048 |
| Quantization | INT4 (per-block, linear symmetric) |
| Model size | ~317 MB |
| CoreML target | iOS 18+ / macOS 15+ |
This model is designed for use with the speech-swift Qwen3Chat module:
import Qwen3Chat
let model = try await Qwen3ChatModel.fromPretrained()
// Single generation
let response = try model.generate(messages: [
ChatMessage(role: .user, content: "Hello!")
])
// Streaming
let stream = model.chatStream("What is Swift?", systemPrompt: "Be brief.")
for try await chunk in stream {
print(chunk, terminator: "")
}
The chat() / chatStream() methods cache the system prompt KV state. Subsequent turns restore from cache instead of re-prefilling (~300ms saved per turn).
| File | Description |
|---|---|
Qwen3Chat.mlpackage/ |
CoreML model (INT4 weights, float16 activations) |
chat_config.json |
Model architecture config |
vocab.json |
BPE vocabulary (151,936 tokens) |
merges.txt |
BPE merge rules |
tokenizer_config.json |
Tokenizer settings + added tokens |
tokenizer.json |
Full tokenizer (HuggingFace format) |
Converted using coremltools 9.0 from the original PyTorch weights:
python scripts/convert_qwen3_chat_coreml.py \
--hf-model Qwen/Qwen3-0.6B \
--output models/Qwen3-0.6B-Chat-CoreML \
--quantize int4
The CoreML model uses explicit KV cache inputs/outputs per layer:
layer_{i}_key_cache, layer_{i}_value_cache (float16)layer_{i}_key_cache_out, layer_{i}_value_cache_out (float16)[1, 8, seq_len, 128] (batch, kv_heads, sequence, head_dim)Apache 2.0 (same as base model)