mgoin commited on
Commit
958b455
1 Parent(s): 51aa29a

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +10 -5
app.py CHANGED
@@ -7,12 +7,17 @@ deepsparse.cpu.print_hardware_capability()
7
  MODEL_ID = "hf:neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds"
8
 
9
  DESCRIPTION = f"""
10
- # LLM Chat on CPU with DeepSparse
11
- The model stub for this example is: {MODEL_ID}
12
 
13
- #### Accelerated Inference on CPUs
14
- The Llama 2 model runs purely on CPU courtesy of [sparse software execution by DeepSparse](https://github.com/neuralmagic/deepsparse).
15
- DeepSparse provides accelerated inference by taking advantage of the model's weight sparsity to deliver tokens fast!
 
 
 
 
 
 
16
  """
17
 
18
  MAX_MAX_NEW_TOKENS = 1024
 
7
  MODEL_ID = "hf:neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds"
8
 
9
  DESCRIPTION = f"""
10
+ # Chat with an Efficient Sparse Llama 2 Model on CPU
 
11
 
12
+ This demo showcases a groundbreaking [sparse Llama 2 7B model](https://huggingface.co/neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds) that has been pruned to 70% sparsity, retrained on pretraining data, and then sparse transferred for chat using the UltraChat 200k dataset. By leveraging the power of sparse transfer learning, this model delivers high-quality chat capabilities while significantly reducing computational costs and inference times.
13
+
14
+ ### Under the Hood
15
+
16
+ - **Sparse Transfer Learning**: The model's pre-sparsified structure enables efficient fine-tuning on new tasks, minimizing the need for extensive hyperparameter tuning and reducing training times.
17
+ - **Accelerated Inference**: Powered by the [DeepSparse CPU inference runtime](https://github.com/neuralmagic/deepsparse), this model takes advantage of its inherent sparsity to provide lightning-fast token generation on CPUs.
18
+ - **Quantization**: 8-bit weight and activation quantization further optimizes the model's performance and memory footprint without compromising quality.
19
+
20
+ By combining state-of-the-art sparsity techniques with the robustness of the Llama 2 architecture, this model pushes the boundaries of efficient generation. Experience the future of AI-powered chat, where cutting-edge sparse models deliver exceptional performance on everyday hardware.
21
  """
22
 
23
  MAX_MAX_NEW_TOKENS = 1024