Update README.md
Browse files
README.md
CHANGED
@@ -48,6 +48,8 @@ InternLM2.5-7B-Chat-1M is the 1M-long-context version of InternLM2.5-7B-Chat. Si
|
|
48 |
|
49 |
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
|
50 |
|
|
|
|
|
51 |
```bash
|
52 |
pip install lmdeploy
|
53 |
```
|
@@ -57,7 +59,12 @@ You can run batch inference locally with the following python code:
|
|
57 |
```python
|
58 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
59 |
|
60 |
-
backend_config = TurbomindEngineConfig(
|
|
|
|
|
|
|
|
|
|
|
61 |
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
62 |
prompt = 'Use a long prompt to replace this sentence'
|
63 |
response = pipe(prompt)
|
@@ -69,6 +76,7 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
|
|
69 |
|
70 |
### Import from Transformers
|
71 |
|
|
|
72 |
To load the InternLM2 7B Chat model using Transformers, use the following code:
|
73 |
|
74 |
```python
|
@@ -114,6 +122,8 @@ pip install vllm
|
|
114 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
|
115 |
```
|
116 |
|
|
|
|
|
117 |
Then you can send a chat request to the server:
|
118 |
|
119 |
```bash
|
@@ -164,6 +174,8 @@ InternLM2.5-7B-Chat-1M 支持 1 百万字超长上下文推理,且性能和 In
|
|
164 |
|
165 |
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
|
166 |
|
|
|
|
|
167 |
```bash
|
168 |
pip install lmdeploy
|
169 |
```
|
@@ -174,8 +186,13 @@ pip install lmdeploy
|
|
174 |
```python
|
175 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
176 |
|
177 |
-
backend_config = TurbomindEngineConfig(
|
178 |
-
|
|
|
|
|
|
|
|
|
|
|
179 |
prompt = 'Use a long prompt to replace this sentence'
|
180 |
response = pipe(prompt)
|
181 |
print(response)
|
@@ -183,6 +200,8 @@ print(response)
|
|
183 |
|
184 |
### 通过 Transformers 加载
|
185 |
|
|
|
|
|
186 |
通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
|
187 |
|
188 |
```python
|
@@ -228,6 +247,8 @@ pip install vllm
|
|
228 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
|
229 |
```
|
230 |
|
|
|
|
|
231 |
然后你可以向服务端发起一个聊天请求:
|
232 |
|
233 |
```bash
|
|
|
48 |
|
49 |
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
|
50 |
|
51 |
+
Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
|
52 |
+
|
53 |
```bash
|
54 |
pip install lmdeploy
|
55 |
```
|
|
|
59 |
```python
|
60 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
61 |
|
62 |
+
backend_config = TurbomindEngineConfig(
|
63 |
+
rope_scaling_factor=2.5,
|
64 |
+
session_len=1048576, # 1M context length
|
65 |
+
max_batch_size=1,
|
66 |
+
cache_max_entry_count=0.7,
|
67 |
+
tp=4) # 4xA100-80G.
|
68 |
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
69 |
prompt = 'Use a long prompt to replace this sentence'
|
70 |
response = pipe(prompt)
|
|
|
76 |
|
77 |
### Import from Transformers
|
78 |
|
79 |
+
Since Transformers does not support 1M long context, we only show the usage of non-long context.
|
80 |
To load the InternLM2 7B Chat model using Transformers, use the following code:
|
81 |
|
82 |
```python
|
|
|
122 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
|
123 |
```
|
124 |
|
125 |
+
If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
|
126 |
+
|
127 |
Then you can send a chat request to the server:
|
128 |
|
129 |
```bash
|
|
|
174 |
|
175 |
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
|
176 |
|
177 |
+
以下是一个 1M 上下文推理的例子. **注意: 1M 上下文需要 4xA100-80G!**
|
178 |
+
|
179 |
```bash
|
180 |
pip install lmdeploy
|
181 |
```
|
|
|
186 |
```python
|
187 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
188 |
|
189 |
+
backend_config = TurbomindEngineConfig(
|
190 |
+
rope_scaling_factor=2.5,
|
191 |
+
session_len=1048576, # 1M context length
|
192 |
+
max_batch_size=1,
|
193 |
+
cache_max_entry_count=0.7,
|
194 |
+
tp=4) # 4xA100-80G.
|
195 |
+
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
196 |
prompt = 'Use a long prompt to replace this sentence'
|
197 |
response = pipe(prompt)
|
198 |
print(response)
|
|
|
200 |
|
201 |
### 通过 Transformers 加载
|
202 |
|
203 |
+
由于 Transformers 无法支持 1M 长上下文推理,这里仅演示非长文本的用法。
|
204 |
+
|
205 |
通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
|
206 |
|
207 |
```python
|
|
|
247 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
|
248 |
```
|
249 |
|
250 |
+
如果你遇到 OOM, 请减小 `--max-model-len` 或增加 `--tensor-parallel-size` 参数.
|
251 |
+
|
252 |
然后你可以向服务端发起一个聊天请求:
|
253 |
|
254 |
```bash
|