qwen-2-coreml-ane / README.md
seba's picture
Create README.md
e6c9b34 verified
---
base_model: Qwen/Qwen2-0.5B
pipeline_tag: text-generation
---
ANE-compatible stateful CoreML models. Maximum context length of 512.
Multifunction models that process 1 or 64 tokens.
6 bits quantized models apply a grouped-per-output-channel LUT with group size 4.
For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support
per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster.
After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes).
Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings).
Current issues:
- Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file.
This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`)
And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5`