Bringing SOTA quantization to mobile LLM deployment: A practical Executorch integration guide
Article: https://blacksamorez.substack.com/p/aqlm-executorch-android
Usage
- Download and install the
.apk
file on your Android phone. (llama3-aqlm.apk
for ~1.1 tok/s at low power consumption.llama3-aqlm-4cores.apk
for ~2.7 tok/s at high loads) - Download the
.pte
and.model
files and put them into the/data/local/tmp/llama
folder on your Android phone. - Running the app you will see the option to load the
.pte
and.model
files. After loading them, you'll be able to chat with the model.
Requirements
This app was tested on Samsung S24 Ultra
running Android 14
.
Limitations
- Although the app looks like chat, generation requests are independent.
- Llama-3 chat template is hard-coded into the app.