internlm
/

internlm2_5-7b-chat-1m

@@ -48,6 +48,8 @@ InternLM2.5-7B-Chat-1M is the 1M-long-context version of InternLM2.5-7B-Chat. Si
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```bash
 pip install lmdeploy
 ```
@@ -57,7 +59,12 @@ You can run batch inference locally with the following python code:
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
 pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
@@ -69,6 +76,7 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
 ### Import from Transformers
 To load the InternLM2 7B Chat model using Transformers, use the following code:
 ```python
@@ -114,6 +122,8 @@ pip install vllm
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
 ```
 Then you can send a chat request to the server:
 ```bash
@@ -164,6 +174,8 @@ InternLM2.5-7B-Chat-1M 支持 1 百万字超长上下文推理，且性能和 In
 LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
 ```bash
 pip install lmdeploy
 ```
@@ -174,8 +186,13 @@ pip install lmdeploy
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
-pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 print(response)
@@ -183,6 +200,8 @@ print(response)
 ### 通过 Transformers 加载
 通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
 ```python
@@ -228,6 +247,8 @@ pip install vllm
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
 ```
 然后你可以向服务端发起一个聊天请求:
 ```bash

 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
+Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
 ```bash
 pip install lmdeploy
 ```
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1048576,  # 1M context length
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)  # 4xA100-80G.
 pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 ### Import from Transformers
+Since Transformers does not support 1M long context, we only show the usage of non-long context.
 To load the InternLM2 7B Chat model using Transformers, use the following code:
 ```python
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
 ```
+If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
 Then you can send a chat request to the server:
 ```bash
 LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
+以下是一个 1M 上下文推理的例子. **注意: 1M 上下文需要 4xA100-80G!**
 ```bash
 pip install lmdeploy
 ```
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1048576,  # 1M context length
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)  # 4xA100-80G.
+pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 print(response)
 ### 通过 Transformers 加载
+由于 Transformers 无法支持 1M 长上下文推理，这里仅演示非长文本的用法。
 通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
 ```python
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
 ```
+如果你遇到 OOM, 请减小 `--max-model-len` 或增加 `--tensor-parallel-size` 参数.
 然后你可以向服务端发起一个聊天请求:
 ```bash