lyraChatGLM / README.md

Update README.md

245f2b6 over 1 year ago

3.83 kB

	---
	license: creativeml-openrail-m
	language:
	- en
	tags:
	- LLM
	- tensorRT
	- ChatGLM
	---
	## Model Card for lyraChatGLM

	lyraChatGLM is currently the fastest ChatGLM-6B available. To the best of our knowledge, it is the first accelerated version of ChatGLM-6B.

	The inference speed of lyraChatGLM has achieved 10x acceleration upon the ealry original version. We are still working hard to further improve the performance.

	Among its main features are:

	- weights: original ChatGLM-6B weights released by THUDM.
	- device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
	- batch_size: compiled with dynamic batch size, max batch_size = 8

	## Speed

	### test environment

	- device: Nvidia A100 40G
	- batch size: 8

	Since early chatGLM version didn't suport batch inference, `original` in below table was measured on batch_size=1


	According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s. We will evaluate and update the latest performance.

	\|version\|speed\|
	\|:-:\|:-:\|
	\|original\|30 tokens/s\|
	\|lyraChatGLM\|310 tokens/s\|


	## Model Sources

	- Repository: [https://huggingface.co/THUDM/chatglm-6b]

	## Try Demo in 2 fast steps

	``` bash
	#step 1
	git clone https://huggingface.co/TMElyralab/lyraChatGLM
	cd lyraChatGLM

	#step 2
	docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
	```

	## Uses

	```python
	from transformers import AutoTokenizer
	from faster_chat_glm import GLM6B, FasterChatGLM


	MAX_OUT_LEN = 100
	tokenizer = AutoTokenizer.from_pretrained('./models', trust_remote_code=True)
	input_str = ["为什么我们需要对深度学习模型加速？", ]
	inputs = tokenizer(input_str, return_tensors="pt", padding=True)
	input_ids = inputs.input_ids.to('cuda:0')


	plan_path = './models/glm6b-bs8.ftm'
	# kernel for chat model.
	kernel = GLM6B(plan_path=plan_path,
	batch_size=1,
	num_beams=1,
	use_cache=True,
	num_heads=32,
	emb_size_per_heads=128,
	decoder_layers=28,
	vocab_size=150528,
	max_seq_len=MAX_OUT_LEN)

	chat = FasterChatGLM(model_dir="./models", kernel=kernel).half().cuda()

	# generate
	sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
	# de-tokenize model output to text
	res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
	print(res)
	```
	## Demo output

	### input
	为什么我们需要对深度学习模型加速? 。

	### output
	为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。

	以下是一些原因我们需要对深度学习模型加速:

	1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。

	### TODO：

	We have implemented some special operators in ChatGLM, such as 2D rotary embedding, alpha residual, etcs.

	We plan to add these operators into FasterTransformer to release a faster version.

	## Citation
	``` bibtex
	@Misc{lyraChatGLM2023,
	author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
	title = {lyraChatGLM: Accelerating ChatGLM by 10x+},
	howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
	year = {2023}
	}
	```

	## Report bug
	- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
	- report bug with a `[bug]` mark in the title.