base_model: Qwen/Qwen2-0.5B
pipeline_tag: text-generation
ANE-compatible stateful CoreML models. Maximum context length of 512. Multifunction models that process 1 or 64 tokens.
6 bits quantized models apply a grouped-per-output-channel LUT with group size 4. For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
Current issues:
- Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of
16_384
, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for theweights.bin
file.
This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (pip install coremltools==8.0b2
)
And example on how to use the models can be found in the coreml_example.py
and can be run with the following command python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5