File size: 3,564 Bytes
d1850c3 7991465 d1850c3 7991465 d1850c3 7991465 d1850c3 935be87 7991465 d1850c3 b977cd1 d1850c3 060055d 05581a1 d1850c3 308345c d1850c3 05581a1 d1850c3 05581a1 f8cc4df 05581a1 f8cc4df d1850c3 f8cc4df 05581a1 d1850c3 05581a1 d1850c3 f8cc4df d1850c3 05581a1 d1850c3 05581a1 78ab63e d1850c3 05581a1 78ab63e d1850c3 308345c 7991465 d1850c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: creativeml-openrail-m
language:
- en
tags:
- LLM
- tensorRT
- ChatGLM
---
## Model Card for lyraChatGLM
lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
The inference speed of lyraChatGLM has achieved **10x** acceleration upon the original version. We are still working hard to further improve the performance.
Among its main features are:
- weights: original ChatGLM-6B weights released by THUDM.
- device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
- batch_size: compiled with dynamic batch size, max batch_size = 8
## Speed
### test environment
- device: Nvidia A100 40G
- batch size: 8
**Since early chatGLM version dosen't suport batch inference, `original` in below table is measured on batch_size=1**
**According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s**
|version|speed|
|:-:|:-:|
|original|30 tokens/s|
|lyraChatGLM|310 tokens/s|
## Model Sources
- **Repository:** [https://huggingface.co/THUDM/chatglm-6b]
## Try Demo in 2 fast steps
``` bash
#step 1
git clone https://huggingface.co/TMElyralab/lyraChatGLM
cd lyraChatGLM
#step 2
docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
```
## Uses
```python
from transformers import AutoTokenizer
from faster_chat_glm import GLM6B, FasterChatGLM
MAX_OUT_LEN = 100
tokenizer = AutoTokenizer.from_pretrained('./models', trust_remote_code=True)
input_str = ["为什么我们需要对深度学习模型加速?", ]
inputs = tokenizer(input_str, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to('cuda:0')
plan_path = './models/glm6b-bs8.ftm'
# kernel for chat model.
kernel = GLM6B(plan_path=plan_path,
batch_size=1,
num_beams=1,
use_cache=True,
num_heads=32,
emb_size_per_heads=128,
decoder_layers=28,
vocab_size=150528,
max_seq_len=MAX_OUT_LEN)
chat = FasterChatGLM(model_dir="./models", kernel=kernel).half().cuda()
# generate
sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
# de-tokenize model output to text
res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
print(res)
```
## Demo output
### input
为什么我们需要对深度学习模型加速? 。
### output
为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。
以下是一些原因我们需要对深度学习模型加速:
1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。
## Citation
``` bibtex
@Misc{lyraChatGLM2023,
author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
title = {lyraChatGLM: Accelerating ChatGLM by 10x+},
howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
year = {2023}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
- report bug with a `[bug]` mark in the title. |