|
--- |
|
base_model: meta-llama/Llama-2-70b-chat-hf |
|
inference: true |
|
model_type: llama |
|
quantized_by: softmax |
|
tags: |
|
- nm-vllm |
|
- marlin |
|
- int4 |
|
--- |
|
|
|
## Llama-2-70b-chat-hf |
|
This repo contains model files for [Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) optimized for [nm-vllm](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs. |
|
|
|
This model was quantized with [GPTQ](https://arxiv.org/abs/2210.17323) and saved in the Marlin format for efficient 4-bit inference. Marlin is a highly optimized inference kernel for 4 bit models. |
|
|
|
## Inference |
|
Install [nm-vllm](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage: |
|
```bash |
|
pip install nm-vllm[sparse] |
|
``` |
|
|
|
Run in a Python pipeline for local inference: |
|
```python |
|
from transformers import AutoTokenizer |
|
from vllm import LLM, SamplingParams |
|
|
|
model_id = "softmax/Llama-2-70b-chat-hf-marlin" |
|
model = LLM(model_id) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
messages = [ |
|
{"role": "user", "content": "What is synthetic data in machine learning?"}, |
|
] |
|
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
sampling_params = SamplingParams(max_tokens=200) |
|
outputs = model.generate(formatted_prompt, sampling_params=sampling_params) |
|
print(outputs[0].outputs[0].text) |
|
|
|
""" |
|
Synthetic data, also known as artificial data or simulated data, is data that is artificially generated using various methods, rather than being collected from real-world sources. Synthetic data can be used to augment or substitute real-world data in machine learning applications, and can be particularly useful when real-world data is limited, expensive, or difficult to obtain. |
|
|
|
There are several ways to generate synthetic data, including: |
|
|
|
1. Data augmentation: This involves transforming existing data, such as images or time series data, to create new data that can be used to augment a training set. For example, an image recognition model can be trained on a dataset of images that have been rotated, scaled, and flipped to create new images that the model has not seen before. |
|
2. Generative models: These models use algorithms to generate new data that resembles real-world data. Generative adversarial networks (GAN |
|
""" |
|
``` |
|
|
|
## Quantization |
|
For details on how this model was quantized and converted to marlin format, please refer to this [notebook](https://github.com/neuralmagic/nm-vllm/blob/c2f8ec48464511188dcca6e49f841ebf67b97153/examples-neuralmagic/marlin_quantization_and_deploy/Performantly_Quantize_LLMs_to_4_bits_with_Marlin_and_nm_vllm.ipynb). |
|
|