test / docs /TRITON.md
iblfe's picture
Upload folder using huggingface_hub
b585c7f verified

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

Triton Inference Server

To get optimal performance for inference for h2oGPT models, we will be using the FastTransformer Backend for Triton.

Make sure to Set Up GPU Docker first.

Build Docker image for Triton with FasterTransformer backend:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
git clone https://github.com/NVIDIA/FasterTransformer.git
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .

Create model definition files

We convert the h2oGPT model from HF to FT format:

Fetch model from Hugging Face

export MODEL=h2ogpt-oig-oasst1-512-6_9b
if [ ! -d ${MODEL} ]; then
    git lfs clone https://huggingface.co/h2oai/${MODEL}
fi

If git lfs fails, make sure to install it first. For Ubuntu:

sudo apt-get install git-lfs

Convert to FasterTransformer format

export WORKSPACE=$(pwd)
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
# Go into Docker
docker run -it --rm --runtime=nvidia --shm-size=1g \
       --ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} \
       -e CUDA_VISIBLE_DEVICES=0 \
       -e MODEL=${MODEL} \
       -e WORKSPACE=${WORKSPACE} \
       -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
export PYTHONPATH=${WORKSPACE}/FasterTransformer/:$PYTHONPATH
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \
        -i_g 1 \
        -m_n gptneox \
        -i ${WORKSPACE}/${MODEL} \
        -o ${WORKSPACE}/FT-${MODEL}

Test the FasterTransformer model

FIXME

echo "Hi, who are you?" > gptneox_input
echo "And you are?" >> gptneox_input
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/gptneox_example.py \
         --ckpt_path ${WORKSPACE}/FT-${MODEL}/1-gpu \
         --tokenizer_path ${WORKSPACE}/${MODEL} \
         --sample_input_file gptneox_input

Update Triton configuration files

Fix a typo in the example:

sed -i -e 's@postprocessing@preprocessing@' all_models/gptneox/preprocessing/config.pbtxt

Update the path to the PyTorch model, and set to use 1 GPU:

sed -i -e "s@/workspace/ft/models/ft/gptneox/@${WORKSPACE}/FT-${MODEL}/1-gpu@" all_models/gptneox/fastertransformer/config.pbtxt
sed -i -e 's@string_value: "2"@string_value: "1"@' all_models/gptneox/fastertransformer/config.pbtxt

Launch Triton

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 \
        --allow-run-as-root /opt/tritonserver/bin/tritonserver  \
        --model-repository=${WORKSPACE}/all_models/gptneox/ &

Now, you should see something like this:

+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| fastertransformer | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
+-------------------+---------+--------+

which means the pipeline is ready to make predictions!

Run client test

Let's test the endpoint:

python3 ${WORKSPACE}/tools/gpt/identity_test.py

And now the end-to-end test:

We first have to fix a bug in the inputs for postprocessing:

sed -i -e 's@prepare_tensor("RESPONSE_INPUT_LENGTHS", output2, FLAGS.protocol)@prepare_tensor("sequence_length", output1, FLAGS.protocol)@' ${WORKSPACE}/tools/gpt/end_to_end_test.py
python3 ${WORKSPACE}/tools/gpt/end_to_end_test.py