File size: 4,779 Bytes
2ebc29a 8453337 2ebc29a 8453337 22a0289 8453337 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
license: mit
language: en
tags:
- LLM
- LLaMA
- Baichuan
- Baichuan2
- XVERSE
---
# Model Card for lyraLLMs
## Introduction
We have released **lyraLLMs**, a highly optimized and easy-to-use inference engine for LLMs.
**lyraLLMs** is suitable for NVIDIA GPUs:
- Volta (V100)
- Turing (T4)
- Ampere (A100/A10)
- Ada Lovelace (RTX 4090, etc.)
**lyraLLMs** supports many popular HuggingFace models as follows:
- [BELLE](https://huggingface.co/TMElyralab/lyraBELLE)
- [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM)
- LLaMA
- LLaMA 2
- XVERSE
- Baichuan 1 & 2
**lyraLLMs** is fast, memory-efficient & easy to use with:
- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
- Efficient memory usage of attention with FlashAttention2
- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
- Easy-to-use Python API to serve LLMs
- Streaming outputs
If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com
## Speed
### Settings
* Evaluated at tokens/s (input + output)
* Test on A100 40G, CUDA 12.0
* Enable the use of MEMOPT mode and KVCache Int8
### Throughputs
### XVERSE-13B-Chat
#### Input
北京的景点:故宫、天坛、万里长城等。\n深圳的景点:
| Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 52.9 | 2308.1 | OOM | | |
| lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 |
### Baichuan2-7B-Base
#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |
### Baichuan2-13B-Base
#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |
### Yi-6B
#### Input
\# write the quick sort algorithm
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 |
| lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 |
### Yi-34B
Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.
#### Input
Let me tell you an interesting story about cat Tom and mouse Jerry,
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 |
## Usage
### Environment (Docker recommended)
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```
```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3
pip install -r requirements.txt
```
### Convert Models
We have released multiple optimized models converted from original HuggingFace ones:
- ChatGLM-6B
- XVERSE-13B-Chat
- LLaMA-Ziya-13B
- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
- Yi-6B, Yi-34B
Feel free to contact us if you would like to convert a finetuned version of LLMs.
### Inference
Refer to [README.md](./lyrallms/README.md) for inference of converted models with **lyraLLMs**.
### Python Demo
```python
from lyra_llama import lyraLlama
model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1
model = lyraLlama(model_path, data_type, memopt_mode)
prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64
output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)
```
## Citation
``` bibtex
@Misc{lyraLLMs2024,
author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
title = {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
year = {2024}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
- report bug with a `[bug]` mark in the title.
|