Charmed Kubeflow 1.6.1

ollienuk · 14 November 2022 11:18

Hi,

Is anyone able to confirm if Charmed Kubeflow will be updated to 1.6.1?

There are major bugs in 1.6.0 which make it unusable in a production environment.

The notebook culling operation is causing huge issues for us which has been patched in 1.6.1. When notebooks are culled or stopped they are unable to be relaunched again.

https://github.com/canonical/bundle-kubeflow/issues/512 https://github.com/kubeflow/kubeflow/issues/6648

ollienuk · 9 December 2022 17:21

Hi,

Just bumping this thread.

ca-scribner · 9 December 2022 18:59

Hey @ollienuk,

Sorry for the delays, but we’re in the final stages of testing. We have what we think is a complete 1.6.1 now in the 1.6/edge channel (eg: juju deploy kubeflow --channel 1.6/edge --trust). I have an instance running right now, but I’m waiting for it to snooze my notebook so I can confirm the fixes are working correctly. A colleague of mine is testing the upgrade path (1.6 -> 1.6.1) right now as well.

You’re welcome to try it out from 1.6/edge or wait till hopefully Monday for us to push everything to 1.6/stable.

Sorry for the wait! Thanks for your patience

ca-scribner · 12 December 2022 15:33

Hey @ollienuk, Charmed Kubeflow 1.6.1 has now been pushed to the charmhub 1.6/stable channel. This should fix the bugs you were facing.

Sorry for the delay in the release. Thanks for your patience!

ollienuk · 12 December 2022 15:55

Hi @ca-scribner,

Thanks for the info.

I cannot see it live on charmhub at the moment:

ollienuk · 12 December 2022 16:02

Update:

I was able to update regardless of the website

Thanks for your help.

ca-scribner · 12 December 2022 16:34

Ah sorry, this is a confusing bookkeeping thing with bundles and charm channels.

Our bundles pin to charm channels (eg: kubeflow:1.6/stable pins to kubeflow-profiles:1.6/stable), so today when we pushed updates to all the charms changed in 1.6.1, the bundle implicitly got those updates too. But because we did not change the bundle file itself, the charmhub page for the bundle shows no changes to the bundle’s definition.

It definitely makes things harder to interpret

ollienuk · 13 December 2022 11:47

@ca-scribner Ah I see thanks for the clarification.

I’m afraid the issue I mentioned above is still happening on the latest version.

I’ve created a notebook, stopped it and when trying to start it again it is just loading indefinitely

ca-scribner · 13 December 2022 14:02

@ollienuk sorry not sure what is going on here. Can you provide:

kubectl describe notebook dead
kubectl describe pod (pod-of-dead)
kubectl get events

and maybe the logs from the notebook controller (can’t remember what the pod name would be, but it should be visible in kubectl get pods -n kubeflow. If you see multiple, one will be for the charm and one the actual controller, we want the latter but feel free to post both

ollienuk · 13 December 2022 15:31

@ca-scribner After the notebook has been stopped for the first time it cannot be described

ca-scribner · 13 December 2022 16:11

Oh I think the issue here is that the notebook object (and pods for the notebook) would not be in kubeflow (eg: the control plane of kubeflow) but in the namespace of the user who owns that notebook

The notebook controller would be in -n kubeflow, and the logs of that might be helpful to understand why the notebook pod is not coming back online

ollienuk · 13 December 2022 16:27

Thanks, please see below:

ubuntu@control-1:~$ kubectl describe notebook dead -n onixon Name: dead Namespace: onixon Labels: access-ml-pipeline=true app=dead Annotations: kubeflow-resource-stopped: 2022-12-13T15:31:44Z notebooks.kubeflow.org/server-type: jupyter API Version: kubeflow.org/v1 Kind: Notebook Metadata: Creation Timestamp: 2022-12-13T14:03:22Z Generation: 2 Managed Fields: API Version: kubeflow.org/v1beta1 Fields Type: FieldsV1 fieldsV1: f:spec: f:template: f:spec: f:containers: Manager: manager Operation: Update Time: 2022-12-13T14:27:21Z API Version: kubeflow.org/v1beta1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubeflow-resource-stopped: f:notebooks.kubeflow.org/server-type: f:labels: .: f:access-ml-pipeline: f:app: f:spec: .: f:template: .: f:spec: .: f:serviceAccountName: f:volumes: Manager: OpenAPI-Generator Operation: Update Time: 2022-12-13T15:31:44Z API Version: kubeflow.org/v1beta1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: f:containerState: f:readyReplicas: Manager: manager Operation: Update Subresource: status Time: 2022-12-13T15:32:56Z Resource Version: 30879166 UID: b1ff7a82-8f26-42a2-a3bb-334490555b12 Spec: Template: Spec: Containers: Image: kubeflownotebookswg/jupyter-scipy:v1.6.1 Image Pull Policy: IfNotPresent Name: dead Resources: Limits: Cpu: 600m Memory: 1288490188800m Requests: Cpu: 500m Memory: 1Gi Volume Mounts: Mount Path: /dev/shm Name: dshm Mount Path: /home/jovyan Name: dead-volume Service Account Name: default-editor Volumes: Empty Dir: Medium: Memory Name: dshm Name: dead-volume Persistent Volume Claim: Claim Name: dead-volume Status: Conditions: Container State: Ready Replicas: 0 Events:

I cannot see any pods running for ‘dead’:

ubuntu@control-1:~$ kubectl get pods -n onixon NAME READY STATUS RESTARTS AGE ml-pipeline-ui-artifact-5cfb68f5b7-d6ll2 2/2 Running 4 (6d17h ago) 53d ml-pipeline-visualizationserver-665bb6b8fc-qmsxb 2/2 Terminating 2 (53d ago) 53d ml-pipeline-visualizationserver-665bb6b8fc-nrr76 2/2 Running 0 6d17h

The kubectl get events only returns events about pvcs not being deleted in other namespaces by other users: Warning VolumeFailedDelete persistentvolume/pvc-2b59a481-c08a-469f-8bf7-be9a0ed3f238 rpc error: code = DeadlineExceeded desc = context deadline exceeded

ca-scribner · 13 December 2022 18:20

I’m not sure what is going on atm. What do the notebook controller logs say (kubectl logs (notebook-controller-pod) -n kubeflow)?

That there are no pods for dead could make sense. I should have asked for any statefulset or deployment for that (I think it should exist and be scaled down to 0). The describe for that parent object might help us

ollienuk · 9 January 2023 12:45

Hi @ca-scribner,

I can confirm the issue was fixed.

I had a separate permissions issue with our NFS storage class which was also preventing the restart. I’ve changed our configuration and it’s all good now.

Thanks for all your help.

ca-scribner · 9 January 2023 14:05

You’re welcome @ollienuk, and great to hear! Thanks for following up about it.