|
--- |
|
base_model: Qwen/Qwen2-0.5B |
|
pipeline_tag: text-generation |
|
--- |
|
ANE-compatible stateful CoreML models. Maximum context length of 512. |
|
Multifunction models that process 1 or 64 tokens. |
|
|
|
6 bits quantized models apply a grouped-per-output-channel LUT with group size 4. |
|
For example if the weights have shape (32, 64), the LUT has shape (8, 1, 36), ANE does not support |
|
per-input channel grouping, and smaller group sizes are considerably slower, while larger group size are barely faster. |
|
|
|
After LUT dequantization a per-output-channel scaling is applied (would have size (32, 1) for the same example shapes). |
|
|
|
Quantization is not applied to the first and last layers, and embeddings (head weights are shared with input embeddings). |
|
|
|
Current issues: |
|
- Input embeddings are duplicated, once for the input and another for the prediction head, since ANE supports a maximum size of `16_384`, the weights have to be split, which causes CoreML to duplicate the weights. It should be possible to remove the input embeddings and read the weights directly for the `weights.bin` file. |
|
|
|
This model requires iOS18 or MacOS 15 to run, and CoreMLTools Beta if running in Python (`pip install coremltools==8.0b2`) |
|
|
|
And example on how to use the models can be found in the `coreml_example.py` and can be run with the following command `python src/coreml_example.py --model-path ./nbs/Qwen-2-1.5B-6Bits-MF.mlmodelc -p "Write a joke in a poem of Harry Potter" --max-tokens 200 --min_p 0.2 --temp 1.5` |