# 使用VLLM


## 1. 首先启动 VLLM，自行选择模型

```
python -m vllm.entrypoints.openai.api_server --model /home/hmp/llm/cache/Qwen1___5-32B-Chat --tensor-parallel-size 2 --dtype=half
```

这里使用了存储在 `/home/hmp/llm/cache/Qwen1___5-32B-Chat` 的本地模型，可以根据自己的需求更改。

## 2. 测试 VLLM

```
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "/home/hmp/llm/cache/Qwen1___5-32B-Chat",
  "messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "怎么实现一个去中心化的控制器?"}
  ]
}'
```

## 3. 配置本项目

```
API_KEY = "sk-123456789xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123456789"
LLM_MODEL = "vllm-/home/hmp/llm/cache/Qwen1___5-32B-Chat(max_token=4096)"
API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "http://localhost:8000/v1/chat/completions"}
```

```
"vllm-/home/hmp/llm/cache/Qwen1___5-32B-Chat(max_token=4096)"
其中
  "vllm-"                                     是前缀（必要）
  "/home/hmp/llm/cache/Qwen1___5-32B-Chat"    是模型名（必要）
  "(max_token=6666)"                          是配置（非必要）
```

## 4. 启动！

```
python main.py
```