Spaces:
Sleeping
Using with open/local models
Use gpte
first with OpenAI models to get a feel for the gpte
tool.
Then go play with experimental Open LLMs π support and try not to get π₯!!
At the moment the best option for coding is still the use of gpt-4
models provided by OpenAI. But open models are catching up and are a good free and privacy-oriented alternative if you possess the proper hardware.
You can integrate gpt-engineer
with open-source models by leveraging an OpenAI-compatible API.
We provide the minimal and cleanest solution below. What is described is not the only way to use open/local models, but the one we tested and would recommend to most users.
More details on why the solution below is recommended in this blog post.
Setup
For inference engine we recommend for the users to use llama.cpp with its python
bindings llama-cpp-python
.
We choose llama.cpp
because:
- 1.) It supports the largest amount of hardware acceleration backends.
- 2.) It supports the diverse set of open LLMs.
- 3.) Is written in
python
and directly on top ofllama.cpp
inference engine. - 4.) Supports the
openAI
API andlangchain
interface.
To install llama-cpp-python
follow the official installation docs and those docs for MacOS with Metal support.
If you want to benefit from proper hardware acceleration on your machine make sure to set up the proper compiler flags before installing your package.
linux
:CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
macos
with Metal support:CMAKE_ARGS="-DLLAMA_METAL=on"
windows
:$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
This will enable the pip
installer to compile the llama.cpp
with the proper hardware acceleration backend.
Then run:
pip install llama-cpp-python
For our use case we also need to set up the web server that llama-cpp-python
library provides. To install:
pip install 'llama-cpp-python[server]'
For detailed use consult the llama-cpp-python
docs.
Before we proceed we need to obtain the model weights in the gguf
format. That should be a single file on your disk.
In case you have weights in other formats check the llama-cpp-python
docs for conversion to gguf
format.
Models in other formats ggml
, .safetensors
, etc. won't work without prior conversion to gguf
file format with the solution described below!
Which open model to use?
Your best choice would be:
- CodeLlama 70B
- Mixtral 8x7B
We are still testing this part, but the larger the model you can run the better. Sure the responses might be slower in terms of (token/s), but code quality will be higher.
For testing that the open LLM gpte
setup works we recommend starting with a smaller model. You can download weights of CodeLlama-13B-GGUF by the TheBloke
choose the largest model version you can run (for example Q6_K
), since quantisation will degrade LLM performance.
Feel free to try out larger models on your hardware and see what happens.
Running the Example
To see that your setup works check test open LLM setup.
If above tests work proceed π
For checking that gpte
works with the CodeLLama
we recommend for you to create a project with prompt
file content:
Write a python script that sums up two numbers. Provide only the `sum_two_numbers` function and nothing else.
Provide two tests:
assert(sum_two_numbers(100, 10) == 110)
assert(sum_two_numbers(10.1, 10) == 20.1)
Now run the LLM in separate terminal:
python -m llama_cpp.server --model $model_path --n_batch 256 --n_gpu_layers 30
Then in another terminal window set the following environment variables:
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="sk-xxx"
export MODEL_NAME="CodeLLama"
export LOCAL_MODEL=true
And run gpt-engineer
with the following command:
gpte <project_dir> $MODEL_NAME --lite --temperature 0.1
The --lite
mode is needed for now since open models for some reason behave worse with too many instructions at the moment. Temperature is set to 0.1
to get consistent best possible results.
That's it.
If sth. doesn't work as expected, or you figure out how to improve the open LLM support please let us know.
Using Open Router models
In case you don't posses the hardware to run local LLM's yourself you can use the hosting on Open Router and pay as you go for the tokens.
To set it up you need to Sign In and load purchase π° the LLM credits. Pricing per token is different for (each model](https://openrouter.ai/models), but mostly cheaper then Open AI.
Then create the API key.
To for example use Meta: Llama 3 8B Instruct (extended) with gpte
we need to set:
export OPENAI_API_BASE="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="sk-key-from-open-router"
export MODEL_NAME="meta-llama/llama-3-8b-instruct:extended"
export LOCAL_MODEL=true
gpte <project_dir> $MODEL_NAME --lite --temperature 0.1
Using Azure models
You set your Azure OpenAI key:
export OPENAI_API_KEY=[your api key]
Then you call gpt-engineer
with your service endpoint --azure https://aoi-resource-name.openai.azure.com
and set your deployment name (which you created in the Azure AI Studio) as the model name (last gpt-engineer
argument).
Example:
gpt-engineer --azure https://myairesource.openai.azure.com ./projects/example/ my-gpt4-project-name