mgoin commited on
Commit
c499065
1 Parent(s): 94eea63

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Xenova/llama2.c-stories110M
3
+ inference: true
4
+ model_type: llama
5
+ quantized_by: mgoin
6
+ tags:
7
+ - nm-vllm
8
+ - sparse
9
+ ---
10
+
11
+ ## llama2.c-stories110M-pruned50
12
+ This repo contains model files for [llama2.c 110M tinystories](https://huggingface.co/Xenova/llama2.c-stories110M) optimized for [NM-vLLM](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs.
13
+
14
+ This model was pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
15
+
16
+ ## Inference
17
+ Install [NM-vLLM](https://github.com/neuralmagic/nm-vllm) for fast inference:
18
+ ```bash
19
+ pip install nm-vllm[sparse]
20
+ ```
21
+ Run in a Python pipeline for local inference:
22
+ ```python
23
+ from vllm import LLM, SamplingParams
24
+
25
+ model = LLM("nm-testing/llama2.c-stories110M-pruned50", sparsity="sparse_w16a16")
26
+ prompt = "Hello my name is"
27
+
28
+ sampling_params = SamplingParams(max_tokens=100, temperature=0)
29
+ outputs = model.generate(prompt, sampling_params=sampling_params)
30
+ print(outputs[0].outputs[0].text)
31
+ ```
32
+
33
+ ## Prompt template
34
+
35
+ N/A
36
+
37
+ ## Sparsification
38
+ For details on how this model was sparsified, see the `recipe.yaml` in this repo and follow the instructions below.
39
+
40
+ Install [SparseML](https://github.com/neuralmagic/sparseml):
41
+ ```bash
42
+ git clone https://github.com/neuralmagic/sparseml
43
+ pip install -e "sparseml[transformers]"
44
+ ```
45
+
46
+ Replace the recipe as you like and run this one-shot compression script to apply SparseGPT:
47
+ ```python
48
+ import sparseml.transformers
49
+
50
+ original_model_name = "Xenova/llama2.c-stories110M"
51
+ calibration_dataset = "open_platypus"
52
+ output_directory = "output/"
53
+
54
+ recipe = """
55
+ test_stage:
56
+ obcq_modifiers:
57
+ SparseGPTModifier:
58
+ sparsity: 0.5
59
+ sequential_update: true
60
+ targets: ['re:model.layers.\d*$']
61
+ """
62
+
63
+ # Apply SparseGPT to the model
64
+ sparseml.transformers.oneshot(
65
+ model=original_model_name,
66
+ dataset=calibration_dataset,
67
+ recipe=recipe,
68
+ output_dir=output_directory,
69
+ )
70
+ ```
71
+
72
+ ## Slack
73
+
74
+ For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)