This guide describes how to deploy NVIDIA Inference Microservice (NIM) on Charmed Kubeflow (CKF) and serve a model with KServe, a component of CKF. A NIM is a containerized inference microservice for running LLMs, distributed from NVIDIA container registry (NGC).
The guide uses Kubeflow v1.9, Kubernetes v1.29, and Juju v3.4. See Supported versions for more details on compatibility among these.
Requirements
- An active CKF deployment and access to the Kubeflow dashboard. See the Get started tutorial for more details.
- A GPU compatible with the model downloaded from NGC. See the model details for further information. This guide is tested using an NVIDIA A100 GPU.
- An NVIDIA NGC API key. See create an NVIDIA account and create an API key for more details.
When creating an API key, ensure that “NGC Catalog” is selected from the “Services Included” dropdown.
Configure MicroK8s GPU add-on
Install NVIDIA GPU drivers
See MicroK8s Add-on: gpu documentation to install the NVIDIA drivers and verify that drivers are loaded.
This step is necessary due to this issue with the MicroK8s GPU add-on.
Enable MicroK8s GPU add-on
Enable MicroK8s GPU add-on to enable running NVIDIA GPU workloads.
microk8s enable gpu
Check MicroK8s status as follows:
microk8s status --wait-ready
Wait until the output shows microk8s is running and the gpu add-on is listed under enabled.
Create a Kubeflow notebook
Create a Kubeflow notebook. This notebook is the workspace from which you run commands.
Choose the default notebook image since you will be only using the Command Line Interface (CLI).
Running commands in this guide requires in-cluster communication and instructions won’t work outside of the notebook environment.
Connect to the notebook and start a new terminal from the launcher as shown below:
Use this terminal session to run the commands in the next sections.
Create Kubernetes secrets
From the terminal, export your NGC API key. It will be used in the next steps to create the required K8s secrets:
export NGC_API_KEY=<your_key>
Create a Docker config K8s secret with the NGC API key to pull the image from the NGC private Docker registry as follows:
kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY
Create an opaque K8s secret with the NGC API key to launch NIMs:
kubectl create secret generic nvidia-nim-secret --from-literal=NGC_API_KEY=$NGC_API_KEY
Create Serving Runtime
Create the KServe Serving Runtime YAML to be used as the runtime for the NIMs as follows:
cat <<EOF > "./runtime.yaml"
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: nvidia-nim-llama3-8b-instruct-1.0.0
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8000"
serving.kserve.io/enable-metric-aggregation: "true"
serving.kserve.io/enable-prometheus-scraping: "true"
containers:
- env:
- name: NIM_CACHE_PATH
value: /tmp
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: nvidia-nim-secret
key: NGC_API_KEY
image: nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
name: kserve-container
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
cpu: "12"
memory: 32Gi
requests:
cpu: "12"
memory: 32Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
imagePullSecrets:
- name: ngc-secret
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- autoSelect: true
name: nvidia-nim-llama3-8b-instruct
priority: 1
version: "1.0.0"
volumes:
- emptyDir:
medium: Memory
sizeLimit: 16Gi
name: dshm
EOF
Apply the YAML file to your namespace:
kubectl apply -f runtime.yaml
The runtime above is inspired by the runtimes published in the NVIDIA/nim-deploy repository. Check it out for more details on available NIM runtimes.
This guide deviates from NVIDIA’s runtime YAMLs by setting the NIM_CACHE_PATH
to /tmp
. This enforces the NIM container to download the model in memory instead of using a PVC and avoids this issue in KServe.
Create Inference Service
Define a new Inference Service YAML file using the LLama3 runtime created in the previous step:
cat <<EOF > "./isvc.yaml"
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
autoscaling.knative.dev/target: "10"
name: llama3-8b-instruct-1xgpu
spec:
predictor:
minReplicas: 1
model:
modelFormat:
name: nvidia-nim-llama3-8b-instruct
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
runtime: nvidia-nim-llama3-8b-instruct-1.0.0
EOF
Apply the YAML file to your namespace:
kubectl apply -f ./isvc.yaml
Wait until Inference Service is in Ready
state.
This process can take up to 10 minutes because of pulling the large-size NIMs image and model.
You can check its state with:
kubectl get inferenceservice llama3-8b-instruct-1xgpu
You should expect an output similar to this:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
llama3-8b-instruct-1xgpu http://llama3-8b-instruct-1xgpu.admin.10.64.140.43.nip.io True 100 llama3-8b-instruct-1xgpu-predictor-00001 16m
Make a request to the Inference Service
Get the Inference Service status.address.url
and save it to a variable:
URL=$(kubectl get inferenceservice llama3-8b-instruct-1xgpu -o jsonpath='{.status.address.url}')
Make a request to the Inference Service URL:
curl $URL/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta/llama3-8b-instruct",
"messages": [{"role":"user","content":"What is Kubeflow?"}]
You should expect an output similar to this:
{"id":"cmpl-c2ec0a9bf1d64172975992f8236fc166","object":"chat.completion","created":1729157204,"model":"meta/llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Kubeflow is an open-source platform for Machine Learning (ML) on Kubernetes. It allows data scientists and developers to easily build, deploy, and manage machine learning workloads on Kubernetes, a container orchestration system...."},"logprobs":null,"finish_reason":"stop","stop_reason":128009}],"usage":{"prompt_tokens":16,"total_tokens":426,"completion_tokens":410}}