How to troubleshoot Charmed Kubeflow

Kubeflow is a complex stack of applications running on Kubernetes, usually on a remote cloud. While most of the time it does “just work”, there may be occasions when you encounter issues. This page outlines some general methods to find out the cause of the issue, as well as some common issues and their solutions.

Contents:

Troubleshooting with Juju

Juju tracks the state of all the applications it deploys. Any issues detected by the applications will be picked up by Juju, so running:

juju status

…will show the current status of deployed software and any actions which need to be taken.

To troubleshoot applications further, you can use Juju to get a shell in the running container. This just requires the deployed application name and unit number (which you can see from the status command above). For example:

juju ssh seldon-core/0

You can then run whatever commands required to examine the state of the application and its container.

For more information on debugging and troubleshooting with Juju, see the juju documentation

Troubleshooting with kubectl

The kubectl command can give you lots of information about the state of pods and services running on the cluster. To restrict the output to your kubflow deployment, you can run only in the desired namespace (the name of the Juju model, which in this documentation we called “kubeflow”). For example:

kubectl get pods -n kubeflow

A lot of information can be gleaned just using kubectl. Check out the kubectl documentation for more help.

Common Issues

Deployment

Pods stuck in pending

If some pods are not progressing past the ‘pending’ stage after a long time, the most common cause is that they have been unable to allocate storage. Check that enough storage is available to the cluster and examine the persistent volume claims made by the pods.

Jupyter notebook server stuck creating

If your Jupyter notebook server is not progressing past the ‘creating’ stage, it may be that there is insufficient CPU available in your cluster to create the notebook server. By default, notebook servers are created using 0.5 CPU. If this is the case, you can set the CPU share to 0 when you create the notebook server instead.

Dashboard issues

Forgotten password

The dex-auth user and password can be seen using the Juju config command if you have access to the Juju client running the model.

juju config dex-auth static-username juju config dex-auth static-password

… will reveal the current settings. You can also set a new username/password:

juju config dex-auth static-username=admin juju config dex-auth static-password=AxWiJjk2hu4fFga7

DNS issues accessing Kubeflow Dashboard

If you are using dynamic hostname resolution (using a hostname ending with nip.io) to evaluate Charmed Kubeflow, you may encounter issues with DNS caching. By default, Ubuntu Server uses systemd resolved for DNS caching. You can change the behaviour with the following commands:

sudo apt install -y resolvconf
sudo systemctl enable --now resolvconf.service
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolvconf/resolv.conf.d/head
sudo resolvconf -u

Other issues

Istio Pilot and Istio Gateway fail to start

In some cases when Istio Pilot and/or Istio Gateway pods are not stating issue may lie with the internet connectivity. Verify that internet connection is stable and has enough bandwidth.

1 Like