Charmed Kubeflow 1.7 beta is here!

Kubeflow 1.7 Beta We are happy to announce that **Charmed Kubeflow 1.7 is now available in Beta**. Kubeflow is a foundational part of the [MLOps ](https://ubuntu.com/blog/what-is-mlops)ecosystem that has been evolving over the years. With Charmed Kubeflow 1.7, users will benefit from the ability to run serverless workloads and perform model inference regardless of the machine learning framework they use.

We’re looking for data scientists, ML engineers and developers to take the Beta release for a drive and share their feedback! Our blog is available if you want to read more.

Join us live: tech talk on Charmed Kubeflow 1.7

By the way, if you’ve got no plans tonight, Canonical’s MLOps will host a live stream about Charmed Kubeflow 1.7 Beta. Together with Daniela Plasencia and Noha Ihab, we will continue the tradition that started with the previous release. We will answer your questions and talk about:

  • The latest release: Kubeflow 1.7 and how our distribution handles it
  • Key features covered in Charmed Kubeflow 1.7
  • The differences between the upstream release and Canonical’s Charmed Kubeflow

The live stream will be available on both LinkedIn and Youtube, so pick your platform and meet us there.

Please be mindful that this is not a stable version, so there is always a risk that something might go wrong. Save your work to proceed with caution. If you encounter any difficulties, Canonical’s MLOps team is here to hear your feedback and help you out. Since this is a Beta version, Canonical does not recommend running or upgrading it on any production environment.

3 Likes

Charmed Kubeflow 1.7 beta has the following known issues:

Pods in error state with message too many open files

In a microk8s environment, sometimes the containers go into an error state with the message too many open files. If that is the case execute the following commands in your system and restart any pod in error state.

sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360 

Tensorboards-controller stuck in waiting state

If the tensorboard-controller unit stays in waiting status with message Waiting for gateway info relation for a long period of time, you should run the following command:

juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
Associated issue

#32, but the bug is attributed to istio-pilot, see below for more information.

The default gateway is not created

Errors you may encounter that could be related to this issue:

  • tensorboard-controller is stuck in “Waiting for gateway info relation”
  • The dashboard is unreachable after setting up dex-auth and oidc-gatekeeper

This issue prevents the kubeflow-gateway creation. To check if this issue is affecting your deployment, check if the kubeflow-gateway is there:

kubectl get gateway -nkubeflow kubeflow-gateway

If you get no output from the above command, please run the following and wait for istio-pilot to be active and idle.

juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
Associated issue

#113

Notebooks appears to not have access to pipelines

If you create a notebook (any framework) and you try to connect to pipelines through the API (e.g. using create_run_from_pipeline_func()) you may hit the following

ERROR:root:Failed to read a token from file '/var/run/secrets/kubeflow/pipelines/token'

If that is the case, make sure you have selected the “Allow access to Kubeflow Pipelines” in the “Advanced section” in the creation page.

Associated issue

#159, the issue has been labeled as a regression from 1.6 and will be fixed soon.

knative-serving and knative-eventing are in error state

These charms might be missing some configuration, please run the following to change their state:

For knative-serving

juju config knative-serving namespace="knative-serving" istio.gateway.namespace=kubeflow istio.gateway.name=kubeflow-gateway

juju resolved knative-serving/<unit-number>

For knative-eventing

juju config knative-eventing namespace="knative-eventing"

juju resolved knative-eventing/<unit-number>

Please note this workaround only applies in the Charmed Kubeflow context, if you are deploying standalone, please follow the instructions in the README.

Some units are stuck in blocked status

If the charm belongs to this list:

  • KubeflowProfiles
  • KubeflowDashboard
  • Seldon
  • TrainingOperator
  • MLflow

Please refer to the associated issue below for more information.

Associated issue

#549

If you have any questions or topics to discuss, you can submit a discourse post with the tag kubeflow.

If you find any issues when deploying the bundle, please go ahead and file a bug in the bundle-kubeflow repository.

1 Like

You have more time for Charmed Kubeflow 1.7 Beta!

The general availability release has been pushed for 29th of March. What does it mean?

  • You have more time to try Charmed Kubeflow 1.7 Beta
  • You have more time to provide your feedback
  • You have more time to share your MLOps experience with us
1 Like

Is there any tutorial for how to upgrade Kubeflow from 1.6 to 1.7?

1 Like

Not yet, but there will be one when we make the stable release

1 Like

hi, I tested this new version on my platform. Everything works except like version 1.6, Istio problem is still there :joy:. Needs commands to fix it.

Thank you Moula! The team is working now on the upgrade path for Istio. How did you find the deployment experience of 1.7 Beta?

Hi Andreea, The deployment works well except istio which always looks for the loadbalancer ip but never finds it :joy:

Hi @Moula,

That looks like whatever is providing your load balancers in k8s might be malfunctioning. I see you’re using microk8s, can you check kubectl get pods -A and look at the metallb pods? I have a feeling something is wrong there.

Did you deploy this in the past day or two? microk8s historically pulled metallb images from dockerhub, but metallb’s dockerhub has been shut down just this week (see this for some context if interested). If your metallb pods are having image pull problems and their images are pointing to dockerhub, I think this is your problem. To work around this, microk8s 1.24/edge points to the up to date metallb repo, and that’ll hit other microk8s risks soon. Or, if you want to recover your current deployment, you should also be able to edit the metallb deployments and update the images like here

1 Like

Hi @ca-scribner I’m not the only one having this problem. See: https://github.com/canonical/bundle-kubeflow/issues/559. The problem comes from the image of Metallb as you say.

I did the same deployment of my microk8s-ha cluster as with kubeflow 1.6 which worked after a few modifications. The deployment I redid it twice today and there is always the same problem.

Hey @Moula,

Yeah I too was frustrated with this yesterday :slight_smile: I feel your pain.

microk8s 1.24/edge works for me. Does that help? Manually changing those metallb images should also get it working, I just haven’t done that yet myself to solve things

1 Like

@ca-scribner Thank you very much. I will try with cluster microk8s 1.24/edge this evening .

welcome! Sorry for the frustration. Let us know how it goes

By the time you try it, things might have moved through the release ci too. atm I see

snap info microk8s
(truncated)
  1.24/stable:           v1.24.10        2023-02-02 (4561) 224MB classic
  1.24/candidate:        v1.24.10        2023-01-27 (4561) 224MB classic
  1.24/beta:             v1.24.10        2023-01-27 (4561) 224MB classic
  1.24/edge:             v1.24.11        2023-03-16 (4891) 225MB classic

where that v1.24.11 (rev 4891) works for me. Check that out before you install in case it has promoted through to the others already

@ca-scribner There is a new problem with knative-eventing

Hi @Moula,

From the messages in the status of the units, I think you need to juju trust these charms - that should get things going. Did you deploy this from one of the premade bundles, or did you add these in separately? If a premade bundle, please let me know which one as we might have missed adding trust somewhere in our bundles.

@ca-scribner I just redid the deployment by bundle. The problem is still there. We must add trust to the charm :

Thanks @moula, just so I don’t get it wrong can you tell me which bundle you’re using? Is the command juju deploy kubeflow --channel 1.7/beta?

1 Like

@ca-scribner juju deploy kubeflow --channel 1.7/beta --trust