adding export and demo instructions
#4
by
bhushans
- opened
README.md
CHANGED
@@ -44,6 +44,59 @@ accross various devices, can be found [here](https://aihub.qualcomm.com/models/l
|
|
44 |
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
|
45 |
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## License
|
49 |
- The license for the original implementation of Llama-v2-7B-Chat can be found
|
|
|
44 |
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 104.953 ms | 316 - 4785 MB | UINT16 | NPU | Llama-TokenGenerator-KVCache-Quantized
|
45 |
| Samsung Galaxy S23 Ultra (Android 13) | Snapdragon® 8 Gen 2 | QNN Model Library | 1917.811 ms | 0 - 1028 MB | UINT16 | NPU | Llama-PromptProcessor-Quantized
|
46 |
|
47 |
+
## Deploying Llama 2 on-device
|
48 |
+
|
49 |
+
Large Language Model (LLM) such as [Llama 2](https://llama.meta.com/llama2/) has the following complexities to deploy on-device:
|
50 |
+
1. Model size is too large to fit in device memory for inference
|
51 |
+
2. Multi-Head Attention (MHA) has large activations leading to fallback from accelerators
|
52 |
+
3. High model load and inference time
|
53 |
+
|
54 |
+
We can tackle the above constraints with the following steps:
|
55 |
+
1. Quantize weights to reduce on-disk model size, e.g., int8 or int4 weights
|
56 |
+
2. Quantize activations to reduce inference time memory pressure
|
57 |
+
3. Graph transformations to reduce inference time memory pressure, e.g., Multi-Head to Split-Head Attention (MHA -> SHA)
|
58 |
+
4. Graph transformations to convert or decompose operations into more accelerator friendly operations e.g. Linear to Conv
|
59 |
+
5. For LLM with 7B or more parameters, above steps are still not good enough on mobile,
|
60 |
+
hence we go one step further and split model into sub-parts.
|
61 |
+
|
62 |
+
Here, we divide the model into 4 parts in order to
|
63 |
+
1. Make model exportable with low memory usage
|
64 |
+
2. Avoid inference time out-of-memory errors
|
65 |
+
|
66 |
+
In order to export Llama 2, please ensure
|
67 |
+
1. Host machine has >40GB memory (RAM+swap-space)
|
68 |
+
2. If you don't have enough memory, export.py will dump instructions to increase swap space accordingly
|
69 |
+
|
70 |
+
|
71 |
+
## Example & Usage
|
72 |
+
|
73 |
+
Install the package via pip:
|
74 |
+
```bash
|
75 |
+
pip install "qai_hub_models[llama_v2_7b_chat_quantized]"
|
76 |
+
```
|
77 |
+
|
78 |
+
|
79 |
+
Once installed, run the following simple CLI demo:
|
80 |
+
|
81 |
+
```bash
|
82 |
+
python -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo
|
83 |
+
```
|
84 |
+
More details on the CLI tool can be found with the `--help` option. See
|
85 |
+
[demo.py](demo.py) for sample usage of the model including pre/post processing
|
86 |
+
scripts. Please refer to our [general instructions on using
|
87 |
+
models](../../../#getting-started) for more usage instructions.
|
88 |
+
|
89 |
+
## Export for on-device deployment
|
90 |
+
|
91 |
+
This repository contains export scripts that produce a model optimized for
|
92 |
+
on-device deployment. This can be run as follows:
|
93 |
+
|
94 |
+
```bash
|
95 |
+
python -m qai_hub_models.models.llama_v2_7b_chat_quantized.export
|
96 |
+
```
|
97 |
+
Additional options are documented with the `--help` option. Note that the above
|
98 |
+
script requires access to Deployment instructions for Qualcomm® AI Hub.
|
99 |
+
|
100 |
|
101 |
## License
|
102 |
- The license for the original implementation of Llama-v2-7B-Chat can be found
|