File size: 3,564 Bytes
d1850c3
 
 
 
 
 
 
7991465
d1850c3
 
 
7991465
d1850c3
7991465
d1850c3
 
 
 
935be87
7991465
d1850c3
 
 
 
 
 
b977cd1
d1850c3
060055d
 
 
 
 
05581a1
 
 
 
d1850c3
 
 
 
 
 
308345c
 
 
 
 
 
 
 
 
 
 
d1850c3
 
 
05581a1
d1850c3
 
05581a1
f8cc4df
 
 
05581a1
 
 
 
f8cc4df
d1850c3
f8cc4df
05581a1
d1850c3
05581a1
d1850c3
 
 
 
 
f8cc4df
 
d1850c3
 
 
05581a1
 
 
d1850c3
 
 
05581a1
78ab63e
d1850c3
05581a1
78ab63e
 
 
 
 
d1850c3
 
 
 
 
 
308345c
7991465
d1850c3
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: creativeml-openrail-m
language:
- en
tags:
- LLM
- tensorRT
- ChatGLM
---
## Model Card for lyraChatGLM

lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.

The inference speed of lyraChatGLM has achieved **10x** acceleration upon the original version. We are still working hard to further improve the performance.

Among its main features are:

- weights: original ChatGLM-6B weights released by THUDM.
- device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
- batch_size: compiled with dynamic batch size, max batch_size = 8

## Speed

### test environment

- device: Nvidia A100 40G
- batch size: 8

**Since early chatGLM version dosen't suport batch inference, `original` in below table is measured on batch_size=1**


**According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s**

|version|speed|
|:-:|:-:|
|original|30 tokens/s|
|lyraChatGLM|310 tokens/s|


## Model Sources

- **Repository:** [https://huggingface.co/THUDM/chatglm-6b]

## Try Demo in 2 fast steps

``` bash
#step 1
git clone https://huggingface.co/TMElyralab/lyraChatGLM
cd lyraChatGLM

#step 2
docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
```

## Uses

```python
from transformers import AutoTokenizer
from faster_chat_glm import GLM6B, FasterChatGLM


MAX_OUT_LEN = 100
tokenizer = AutoTokenizer.from_pretrained('./models', trust_remote_code=True)
input_str = ["为什么我们需要对深度学习模型加速?", ]
inputs = tokenizer(input_str, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to('cuda:0')


plan_path = './models/glm6b-bs8.ftm'
# kernel for chat model.
kernel = GLM6B(plan_path=plan_path,
               batch_size=1,
               num_beams=1,
               use_cache=True,
               num_heads=32,
               emb_size_per_heads=128,
               decoder_layers=28,
               vocab_size=150528,
               max_seq_len=MAX_OUT_LEN)

chat = FasterChatGLM(model_dir="./models", kernel=kernel).half().cuda()

# generate
sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
# de-tokenize model output to text
res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
print(res)
```
## Demo output

### input
为什么我们需要对深度学习模型加速? 。

### output
为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。

以下是一些原因我们需要对深度学习模型加速:

1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。



## Citation
``` bibtex
@Misc{lyraChatGLM2023,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraChatGLM: Accelerating ChatGLM by 10x+},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
  year =         {2023}
}
```

## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
- report bug with a `[bug]` mark in the title.