bys0318 commited on
Commit
d8cde7a
·
verified ·
1 Parent(s): 6884326

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -1
README.md CHANGED
@@ -19,6 +19,7 @@ pipeline_tag: text-generation
19
 
20
  LongWriter-glm4-9b is trained based on [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b), and is capable of generating 10,000+ words at once.
21
 
 
22
 
23
  A simple demo for deployment of the model:
24
  ```python
@@ -31,7 +32,36 @@ query = "Write a 10000-word China travel guide"
31
  response, history = model.chat(tokenizer, query, history=[], max_new_tokens=32768, temperature=0.5)
32
  print(response)
33
  ```
34
- Environment: Same environment requirement as [glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) (`transforemrs>=4.44.0`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  License: [glm-4-9b License](https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE)
37
 
 
19
 
20
  LongWriter-glm4-9b is trained based on [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b), and is capable of generating 10,000+ words at once.
21
 
22
+ Environment: Same environment requirement as [glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) (`transforemrs>=4.44.0`).
23
 
24
  A simple demo for deployment of the model:
25
  ```python
 
32
  response, history = model.chat(tokenizer, query, history=[], max_new_tokens=32768, temperature=0.5)
33
  print(response)
34
  ```
35
+ You can also deploy the model with [vllm](https://github.com/vllm-project/vllm), which allows 10,000+ words generation within a minute. Here is an example code:
36
+ ```python
37
+ from vllm import LLM, SamplingParams
38
+ model = LLM(
39
+ model= "THUDM/LongWriter-glm4-9b",
40
+ dtype="auto",
41
+ trust_remote_code=True,
42
+ tensor_parallel_size=1,
43
+ max_model_len=32768,
44
+ gpu_memory_utilization=1,
45
+ )
46
+ tokenizer = model.get_tokenizer()
47
+ stop_token_ids = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"), tokenizer.get_command("<|observation|>")]
48
+ generation_params = SamplingParams(
49
+ temperature=0.5,
50
+ top_p=0.8,
51
+ top_k=50,
52
+ max_tokens=32768,
53
+ repetition_penalty=1,
54
+ stop_token_ids=stop_token_ids
55
+ )
56
+ query = "Write a 10000-word China travel guide"
57
+ input_ids = tokenizer.build_chat_input(query, history=[], role='user').input_ids[0].tolist()
58
+ outputs = model.generate(
59
+ sampling_params=generation_params,
60
+ prompt_token_ids=[input_ids],
61
+ )
62
+ output = outputs[0]
63
+ print(output.outputs[0].text)
64
+ ```
65
 
66
  License: [glm-4-9b License](https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/LICENSE)
67