File size: 4,779 Bytes
2ebc29a
 
8453337
 
 
 
 
 
 
2ebc29a
8453337
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22a0289
8453337
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
language: en
tags:
- LLM
- LLaMA
- Baichuan
- Baichuan2
- XVERSE
---
# Model Card for lyraLLMs

## Introduction

We have released **lyraLLMs**, a highly optimized and easy-to-use inference engine for LLMs.

**lyraLLMs** is suitable for NVIDIA GPUs:
- Volta (V100)
- Turing (T4)
- Ampere (A100/A10)
- Ada Lovelace (RTX 4090, etc.)

**lyraLLMs** supports many popular HuggingFace models as follows:
- [BELLE](https://huggingface.co/TMElyralab/lyraBELLE)
- [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM)
- LLaMA
- LLaMA 2
- XVERSE
- Baichuan 1 & 2

**lyraLLMs** is fast, memory-efficient & easy to use with:
- State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
- Efficient memory usage of attention with FlashAttention2
- Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
- Easy-to-use Python API to serve LLMs
- Streaming outputs

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com

## Speed

### Settings
* Evaluated at tokens/s (input + output)
* Test on A100 40G, CUDA 12.0
* Enable the use of MEMOPT mode and KVCache Int8

### Throughputs

### XVERSE-13B-Chat

#### Input
北京的景点:故宫、天坛、万里长城等。\n深圳的景点:

| Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 52.9 | 2308.1 | OOM |  |  |
| lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 |

### Baichuan2-7B-Base

#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |

### Baichuan2-13B-Base

#### Input
北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |

### Yi-6B

#### Input
\# write the quick sort algorithm

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 |
| lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 |

### Yi-34B

Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.

#### Input
Let me tell you an interesting story about cat Tom and mouse Jerry,

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 |

## Usage

### Environment (Docker recommended)

- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
```

### Convert Models

We have released multiple optimized models converted from original HuggingFace ones:
- ChatGLM-6B
- XVERSE-13B-Chat
- LLaMA-Ziya-13B
- Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
- Yi-6B, Yi-34B

Feel free to contact us if you would like to convert a finetuned version of LLMs.

### Inference

Refer to [README.md](./lyrallms/README.md) for inference of converted models with **lyraLLMs**.

### Python Demo

```python
from lyra_llama import lyraLlama

model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1

model = lyraLlama(model_path, data_type, memopt_mode)

prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64

output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)

```

## Citation
``` bibtex
@Misc{lyraLLMs2024,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
  title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
  year =         {2024}
}
```

## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
- report bug with a `[bug]` mark in the title.