trustai / instructions.md
psabharwal's picture
Upload instructions.md
819c358 verified
|
raw
history blame
No virus
9.27 kB
## Instructions to run end-to-end demo
## Chapters
[I. Installation of KServe & its dependencies](#installation-of-kserve--its-dependencies)
[II. Setting up local MinIO S3 storage](#setting-up-local-minio-s3-storage)
[III. Setting up your OpenShift AI workbench](#setting-up-your-openshift-ai-workbench)
[IV. Train model and evaluate](#train-model-and-evaluate)
[V. Convert model to Caikit format and save to S3 storage](#convert-model-to-caikit-format-and-save-to-s3-storage)
[V. Deploy model onto Caikit-TGIS Serving Runtime](#deploy-model-onto-caikit-tgis-serving-runtime)
[VI. Model inference](#model-inference)
**Prerequisites**
* To support training and inference, your cluster needs a node with CPUS, 4 GPUs, and GB memory. Instructions to add GPU support to RHOAI can be found [here](https://docs.google.com/document/d/1T2oc-KZRMboUVuUSGDZnt3VRZ5s885aDRJGYGMkn_Wo/edit#heading=h.9xmhoufikqid).
* You have a cluster administrator permissions
* You have installed the OpenShift CLI (`oc`)
* You have installed the `Red Hat OpenShift Service Mesh Operator`
* You have installed the `Red Hat OpenShift Serverless Operator`
* You have installed the `Red Hat OpenShift AI Operator` and created a **DataScienceCluster** object
### Installation of KServe & its dependencies
Instructions adapted from [Manually installing KServe](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2-latest/html/serving_models/serving-large-models_serving-large-models#manually-installing-kserve_serving-large-models)
1. Git clone this repository
```
git clone https://github.com/trustyai-explainability/trustyai-detoxify-sft.git
```
2. Login to your OpenShift cluster as a cluster adminstrator
```
oc login --token=<token>
```
2. Create the required namespace for Red Hat OpenShift Service Mesh
```
oc create ns istio-system
```
3. Create a `ServiceMeshControlPlane` object
```
oc apply -f manifests/kserve/smcp.yaml -n istio-system
```
4. Sanity check to verify creation of the service mesh instance
```
oc get pods -n istio-system
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
istio-egressgateway-7c46668687-fzsqj 1/1 Running 0 22h
istio-ingressgateway-77f94d8f85-fhsp9 1/1 Running 0 22h
istiod-data-science-smcp-cc8cfd9b8-2rkg4 1/1 Running 0 22h
```
5. Create the required namespace for a `KnativeServing` instance
```
oc create ns knative-serving
```
6. Create a `ServiceMeshMember` object
```
oc apply -f manifests/kserve/default-smm.yaml -n knative-serving
```
7. Create and define a `KnativeServing` object
```
oc apply -f manifests/kserve/knativeserving-istio.yaml -n knative-serving
```
8. Sanity check to validate creation of the Knative Serving instance
```
oc get pods -n knative-serving
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
activator-7586f6f744-nvdlb 2/2 Running 0 22h
activator-7586f6f744-sd77w 2/2 Running 0 22h
autoscaler-764fdf5d45-p2v98 2/2 Running 0 22h
autoscaler-764fdf5d45-x7dc6 2/2 Running 0 22h
autoscaler-hpa-7c7c4cd96d-2lkzg 1/1 Running 0 22h
autoscaler-hpa-7c7c4cd96d-gks9j 1/1 Running 0 22h
controller-5fdfc9567c-6cj9d 1/1 Running 0 22h
controller-5fdfc9567c-bf5x7 1/1 Running 0 22h
domain-mapping-56ccd85968-2hjvp 1/1 Running 0 22h
domain-mapping-56ccd85968-lg6mw 1/1 Running 0 22h
domainmapping-webhook-769b88695c-gp2hk 1/1 Running 0 22h
domainmapping-webhook-769b88695c-npn8g 1/1 Running 0 22h
net-istio-controller-7dfc6f668c-jb4xk 1/1 Running 0 22h
net-istio-controller-7dfc6f668c-jxs5p 1/1 Running 0 22h
net-istio-webhook-66d8f75d6f-bgd5r 1/1 Running 0 22h
net-istio-webhook-66d8f75d6f-hld75 1/1 Running 0 22h
webhook-7d49878bc4-8xjbr 1/1 Running 0 22h
webhook-7d49878bc4-s4xx4 1/1 Running 0 22h
```
9. From the web console, install KServe by going to **Operators -> Installed Operators** and click on the **Red Hat OpenShift AI Operator**
10. Click on the **DSC Intialization** tab and click on the **default-dsci** object
11. Click on the **YAML** tab and in the `spec` section, change the `serviceMesh.managementState` to `Unmanaged`
```
spec:
serviceMesh:
managementState: Unmanaged
```
12. Click **Save**
12. Click on the **Data Science Cluster** tab and click on the **default-dsc** object
13. Click on the **YAML** tab and in the `spec` section, change the `components.kserve.managementState` and the `components.kserve.serving.managementState` to `Managed`
```
spec:
components:
kserve:
managementState: Managed
serving:
managementState: Managed
```
15. Click **Save**
### Setting up local MinIO S3 storage
1. Create a namespace for your project called "detoxify-sft"
```
oc create namespace detoxify-sft
```
2. Set up your local MinIO S3 storage in your newly created namespace
```
oc apply -f manifests/minio/setup-s3.yaml -n detoxify-sft
```
3. Run the following sanity checks
```
oc get pods -n detoxify-sft | grep "minio"
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
minio-7586f6f744-nvdl 1/1 Running 0 22h
```
```
oc get route -n detoxify-sft | grep "minio"
```
Expected output:
```
NAME STATUS LOCATION SERVICE
minio-api Accepted https://minio-api... minio-service
minio-ui Accepted https://minio-ui... minio-service
```
4. Get the MinIO UI location URL and open it in a web browser
```
oc get route minio-ui -n detoxify-sft
```
5. Login using the credentials in `manifests/minio/setup-s3.yaml`
**user**: `minio`
**password**: `minio123`
6. Click on **Create a Bucket** and choose a name for your bucket and click on **Create Bucket**
### Setting up your OpenShift AI workbench
1. Go to Red Hat OpenShift AI from the web console
2. Click on **Data Science Projects** and then click on **Create data science project**
3. Give your project a name and then click **Create**
4. Click on the **Workbenches** tab and then create a workbench with a Pytorch notebook image, set the container size to Large, and select a single NVIDIA GPU. Click on **Create Workbench**
5. Click on **Add data connection** to create a matching data connection for MinIO
6. Fill out the required fields and then click on **Add data collection**
7. Once your workbench status changes from **Starting** to **Running**, click on **Open** to open JupyterHub in a web browser
8. In your JupyterHub environment, launch a terminal and clone this project
```
git clone https://github.com/trustyai-explainability/trustyai-detoxify-sft.git
```
8. Go into the `notebooks` directory
### Train model and evaluate
1. Open the `01-sft.ipynb` file
2. Run each cell in the notebook
3. Once the model trained and uploaded to HuggingFace Hub, open the `02-eval.ipynb` file and run each cell to compare the model trained on raw input-output pairs vs. the one trained on detoxified prompts
### Convert model to Caikit format and save to S3 storage
1. Open the `03-save_convert_model.ipynb` and run each cell in the notebook to convert the model Caikit format and save it to a MinIO bucket
### Deploy model onto Caikit-TGIS Serving Runtime
1. In the OpenShift AI dashboard, navigate to the project details page and click the **Models** tab
2. In the **Single-model serving platform** tile, click on deploy model. Provide the following values:
**Model Name**: `opt-350m-caikit`
**Serving Runtime**: `Caikit-TGIS Serving Runtime`
**Model framework**: `caikit`
**Existing data connection**: `My Storage`
**Path**: `models/opt-350m-caikit`
3. Click **Deploy**
4. Increase the `initialDelaySeconds`
```
oc patch template caikit-tgis-serving-template --type=='merge' -p '{"spec":{"containers":[{"readinessProbe":"initialDelaySeconds":300, "livenessProbe":"initialDelaySeconds":300}]}}'
```
5. Wait for the model **Status** to show a green checkmark
### Model inference
1. Return to the JupyterHub environment to test out the deployed model
2. Click on `03-inference_request.ipynb` and run each cell to make an inference request to the detoxified model