Serve a model using Triton Inference Server

Serve a BERT model using NVIDIA Triton Inference Server.

Prerequisites

An active Charmed Kubeflow deployment. For installation instructions, follow the Get started tutorial.

Contents

Refresh the knative-serving charm

upgrade the knative-serving charm to channel latest/edge

juju refresh knative-serving --channel=latest/edge

Wait until the charm is in active status, you can watch the status with:

juju status --watch 5s

Create a Notebook

Create a Kubeflow Jupyter Notebook. The Notebook will be your workspace from which you run the commands. Running the commands in this guide requires in-cluster communication and instructions won’t work outside of the Notebook environment.

The image for the Notebook can be anything since we will be only using the CLI. You can leave it as the default.

Explore components | Create a Kubeflow Notebook

Connect to the Notebook, and start a new terminal from the Launcher as shown below.

Use this terminal session to run the commands in the next sections.

Create the InferenceService

Define a new InferenceService yaml for the BERT model with the following content:

cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
  annotations:
    "sidecar.istio.io/inject": "false"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:
        limits:
          cpu: "1"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 8Gi
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

Disable istio sidecar

In the ISVC yaml, make sure to add the annotation "sidecar.istio.io/inject": "false" as done in the example above.

Due to issue GH 216, you will not be able to reach the ISVC without disabling istio sidecar injection.

GPU Scheduling

For running on GPU, specify the GPU resources in the ISVC yaml. For example, to run the predictor on NVIDIA GPU:

cat <<EOF > "./isvc-gpu.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

See more: Kubernetes | Schedule GPUs

Modify the ISVC yaml to set the node selector, node affinity, or tolerations in the ISVC to match your GPU node.

Expand to see an ISVC yaml with node scheduling attributes
cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "bert-v2"
spec:
  transformer:
    containers:
      - name: kserve-container      
        image: kfserving/bert-transformer-v2:latest
        command:
          - "python"
          - "-m"
          - "bert_transformer_v2"
        env:
          - name: STORAGE_URI
            value: "gs://kfserving-examples/models/triton/bert-transformer"
  predictor:
    nodeSelector:
      myLabel1: "true"
    tolerations:
      - key: "myTaint1"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    triton:
      runtimeVersion: 20.10-py3
      resources:      # specifiy gpu limits and vendor
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
      storageUri: "gs://kfserving-examples/models/triton/bert"
EOF

This example sets nodeSelector and tolerations for the predictor. Similarly, you can set the affinity.

Apply the ISVC to your namespace with kubectl

kubectl apply -f ./isvc.yaml -n <namespace>

Since we are using the CLI from inside a Notebook, kubectl is using the ServiceAccount credentials of the Notebook pod.

Wait until the InferenceService is in Ready state. It can take a few minutes to be Ready because of pulling the large-size triton image. You can check on the state with:

kubectl get inferenceservice bert-v2 -n <namespace>

you should see an output similar to this:

NAME      URL                                           READY   AGE
bert-v2   http://bert-v2.default.10.64.140.43.nip.io   True    71s

Perform inference

Get the ISVC’s status.address.url

URL=$(kubectl get inferenceservice bert-v2 -n <namespace> -o jsonpath='{.status.address.url}')

Make a request to the ISVC’s URL

  • Prepare the inference input
cat <<EOF > "./input.json"
{
  "instances": [
    "What President is credited with the original notion of putting Americans in space?"
  ]
}
EOF
  • Make a prediction request
curl -v -H "Content-Type: application/json" ${URL}/v1/models/bert-v2:predict -d @./input.json

The response will contain the prediction output:

{"predictions": "John F. Kennedy", "prob": 77.91851169430718}