Spaces:

neuralmagic
/

llama-2-sparse-transfer-chat-deepsparse

Running on CPU Upgrade

App Files Files Community

mgoin commited on Mar 19

Commit

958b455

•

1 Parent(s): 51aa29a

Update app.py

Browse files

Files changed (1) hide show

app.py +10 -5

app.py CHANGED Viewed

@@ -7,12 +7,17 @@ deepsparse.cpu.print_hardware_capability()
 MODEL_ID = "hf:neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds"
 DESCRIPTION = f"""
-# LLM Chat on CPU with DeepSparse
-The model stub for this example is: {MODEL_ID}
-#### Accelerated Inference on CPUs
-The Llama 2 model runs purely on CPU courtesy of [sparse software execution by DeepSparse](https://github.com/neuralmagic/deepsparse).
-DeepSparse provides accelerated inference by taking advantage of the model's weight sparsity to deliver tokens fast!
 """
 MAX_MAX_NEW_TOKENS = 1024

 MODEL_ID = "hf:neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds"
 DESCRIPTION = f"""
+# Chat with an Efficient Sparse Llama 2 Model on CPU
+This demo showcases a groundbreaking [sparse Llama 2 7B model](https://huggingface.co/neuralmagic/Llama-2-7b-pruned70-retrained-ultrachat-quant-ds) that has been pruned to 70% sparsity, retrained on pretraining data, and then sparse transferred for chat using the UltraChat 200k dataset. By leveraging the power of sparse transfer learning, this model delivers high-quality chat capabilities while significantly reducing computational costs and inference times.
+### Under the Hood
+- **Sparse Transfer Learning**: The model's pre-sparsified structure enables efficient fine-tuning on new tasks, minimizing the need for extensive hyperparameter tuning and reducing training times.
+- **Accelerated Inference**: Powered by the [DeepSparse CPU inference runtime](https://github.com/neuralmagic/deepsparse), this model takes advantage of its inherent sparsity to provide lightning-fast token generation on CPUs.
+- **Quantization**: 8-bit weight and activation quantization further optimizes the model's performance and memory footprint without compromising quality.
+By combining state-of-the-art sparsity techniques with the robustness of the Llama 2 architecture, this model pushes the boundaries of efficient generation. Experience the future of AI-powered chat, where cutting-edge sparse models deliver exceptional performance on everyday hardware.
 """
 MAX_MAX_NEW_TOKENS = 1024