internlm
/

internlm2_5-7b-chat-gguf

Text Generation

Inference Endpoints

Model card Files Files and versions Community

unsubscribe commited on Jul 3

Commit

dcfaa25

•

1 Parent(s): aaf74ae

add serving section in readme

Files changed (1) hide show

README.md +29 -2

README.md CHANGED Viewed

@@ -55,7 +55,34 @@ huggingface-cli download internlm/internlm2_5-7b-chat-gguf internlm2_5-7b-chat-f
 You can use `llama-cli` for conducting inference. For a detailed explanation of `llama-cli`, please refer to [this guide](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
 ```shell
-build/bin/llama-cli -m internlm2_5-7b-chat-fp16.gguf
 ```
-## Serving

 You can use `llama-cli` for conducting inference. For a detailed explanation of `llama-cli`, please refer to [this guide](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)
 ```shell
+build/bin/llama-cli -m internlm2_5-7b-chat-fp16.gguf -ngl 32
 ```
+## Serving
+`llama.cpp` provides an OpenAI API compatible server - `llama-server`. You can deploy `internlm2_5-7b-chat-fp16.gguf` into a service like this:
+```shell
+./build/bin/llama-server -m ./internlm2_5-7b-chat-fp16.gguf -ngl 32
+```
+At the client side, you can access the service through OpenAI API:
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key='YOUR_API_KEY',
+    base_url='http://localhost:8080/v1'
+)
+model_name = client.models.list().data[0].id
+response = client.chat.completions.create(
+  model=model_name,
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": " provide three suggestions about time management"},
+  ],
+  temperature=0.8,
+  top_p=0.8
+)
+print(response)
+```