|
## Triton Inference Server |
|
|
|
To get optimal performance for inference for h2oGPT models, we will be using the [FastTransformer Backend for Triton](https://github.com/triton-inference-server/fastertransformer_backend/). |
|
|
|
Make sure to [Set Up GPU Docker](README_DOCKER.md#setup-docker-for-gpus) first. |
|
|
|
### Build Docker image for Triton with FasterTransformer backend: |
|
|
|
```bash |
|
git clone https://github.com/triton-inference-server/fastertransformer_backend.git |
|
cd fastertransformer_backend |
|
git clone https://github.com/NVIDIA/FasterTransformer.git |
|
export WORKSPACE=$(pwd) |
|
export CONTAINER_VERSION=22.12 |
|
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION} |
|
docker build --rm \ |
|
--build-arg TRITON_VERSION=${CONTAINER_VERSION} \ |
|
-t ${TRITON_DOCKER_IMAGE} \ |
|
-f docker/Dockerfile \ |
|
. |
|
``` |
|
|
|
### Create model definition files |
|
|
|
We convert the h2oGPT model from [HF to FT format](https://github.com/NVIDIA/FasterTransformer/pull/569): |
|
|
|
#### Fetch model from Hugging Face |
|
```bash |
|
export MODEL=h2ogpt-oig-oasst1-512-6_9b |
|
if [ ! -d ${MODEL} ]; then |
|
git lfs clone https://huggingface.co/h2oai/${MODEL} |
|
fi |
|
``` |
|
If `git lfs` fails, make sure to install it first. For Ubuntu: |
|
```bash |
|
sudo apt-get install git-lfs |
|
``` |
|
|
|
#### Convert to FasterTransformer format |
|
|
|
```bash |
|
export WORKSPACE=$(pwd) |
|
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION} |
|
# Go into Docker |
|
docker run -it --rm --runtime=nvidia --shm-size=1g \ |
|
--ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} \ |
|
-e CUDA_VISIBLE_DEVICES=0 \ |
|
-e MODEL=${MODEL} \ |
|
-e WORKSPACE=${WORKSPACE} \ |
|
-w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash |
|
export PYTHONPATH=${WORKSPACE}/FasterTransformer/:$PYTHONPATH |
|
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \ |
|
-i_g 1 \ |
|
-m_n gptneox \ |
|
-i ${WORKSPACE}/${MODEL} \ |
|
-o ${WORKSPACE}/FT-${MODEL} |
|
``` |
|
|
|
#### Test the FasterTransformer model |
|
|
|
FIXME |
|
```bash |
|
echo "Hi, who are you?" > gptneox_input |
|
echo "And you are?" >> gptneox_input |
|
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/gptneox_example.py \ |
|
--ckpt_path ${WORKSPACE}/FT-${MODEL}/1-gpu \ |
|
--tokenizer_path ${WORKSPACE}/${MODEL} \ |
|
--sample_input_file gptneox_input |
|
``` |
|
|
|
#### Update Triton configuration files |
|
|
|
Fix a typo in the example: |
|
```bash |
|
sed -i -e 's@postprocessing@preprocessing@' all_models/gptneox/preprocessing/config.pbtxt |
|
``` |
|
|
|
Update the path to the PyTorch model, and set to use 1 GPU: |
|
```bash |
|
sed -i -e "s@/workspace/ft/models/ft/gptneox/@${WORKSPACE}/FT-${MODEL}/1-gpu@" all_models/gptneox/fastertransformer/config.pbtxt |
|
sed -i -e 's@string_value: "2"@string_value: "1"@' all_models/gptneox/fastertransformer/config.pbtxt |
|
``` |
|
|
|
#### Launch Triton |
|
|
|
```bash |
|
CUDA_VISIBLE_DEVICES=0 mpirun -n 1 \ |
|
--allow-run-as-root /opt/tritonserver/bin/tritonserver \ |
|
--model-repository=${WORKSPACE}/all_models/gptneox/ & |
|
``` |
|
|
|
Now, you should see something like this: |
|
```bash |
|
+-------------------+---------+--------+ |
|
| Model | Version | Status | |
|
+-------------------+---------+--------+ |
|
| ensemble | 1 | READY | |
|
| fastertransformer | 1 | READY | |
|
| postprocessing | 1 | READY | |
|
| preprocessing | 1 | READY | |
|
+-------------------+---------+--------+ |
|
``` |
|
which means the pipeline is ready to make predictions! |
|
|
|
### Run client test |
|
|
|
Let's test the endpoint: |
|
```bash |
|
python3 ${WORKSPACE}/tools/gpt/identity_test.py |
|
``` |
|
|
|
And now the end-to-end test: |
|
|
|
We first have to fix a bug in the inputs for postprocessing: |
|
```bash |
|
sed -i -e 's@prepare_tensor("RESPONSE_INPUT_LENGTHS", output2, FLAGS.protocol)@prepare_tensor("sequence_length", output1, FLAGS.protocol)@' ${WORKSPACE}/tools/gpt/end_to_end_test.py |
|
``` |
|
|
|
```bash |
|
python3 ${WORKSPACE}/tools/gpt/end_to_end_test.py |
|
``` |
|
|
|
|
|
|