How to integrate Charmed Kubeflow with the Canonical Observability Stack (COS)

Charmed Kubeflow (CKF) and Canonical Observability Stack (COS) can be integrated using Juju. This integration enables monitoring Kubeflow deployments.

The COS integration allows CKF to collect and analyse control plane data. User namespace data, such as logs from model training, cannot be monitored.

Requirements

Integrate with COS

As per COS best practices, this guide assumes that COS and CKF are deployed each using their own controllers. This means that after the deployment, there is a kubeflow and a cos model. These have associated the kf-controller and cos-controller controllers, respectively. These are the default names for the controllers. Users can set any other name during the controller bootstrapping.

Integrating CKF with COS involves adding relations to Prometheus, that provides metrics and alerts, and to Grafana, which provides dashboards. To avoid cross-model relations and ensure COS is accessible, Kubeflow components are related to COS through the Grafana Agent charm. Data flows from CKF charms through the Grafana agent to COS.

All the code examples provided throughout this guide can be run from any host terminal with Juju installed in the host machine.

Deploy Grafana Agent

Within the kubeflow model, deploy the Grafana agent as follows:

juju deploy -m kf-controller:kubeflow grafana-agent-k8s --channel=stable

Get COS URLs

You can navigate the User Interface (UI) of charms such as Grafana and Prometheus using Catalogue.

You can get their related URLs by running the following code:

juju show-unit -m cos-controller:cos catalogue/0 --format json | jq '.[]."relation-info".[]."application-data".url | select (. != null)'

You should see something like this:

"http://10.64.140.43/cos-grafana"
"http://10.64.140.43/cos-prometheus-0"
"http://10.64.140.43/cos-alertmanager"

For more information on COS URLs, see Browse dashboards.

Check connectivity

To check the Grafana agent to COS connectivity, try to access any of the COS URLs. For example, “prometheus”, from within the Grafana agent:

juju exec --unit grafana-agent-k8s/0 -m kf-controller:kubeflow 'curl -s <URL>'

If successful, you will get an OK response, similar to this one:

juju exec --unit grafana-agent-k8s/0 -m kf-controller:kubeflow 'curl -s http://10.64.140.43/cos-prometheus-0/api/v1/status/runtimeinfo'
{"status":"success","data":{"startTime":"2024-07-24T16:28:30.48462767Z","CWD":"/","reloadConfigSuccess":true,"lastConfigTime":"2024-07-24T16:28:30Z","corruptionCount":0,"goroutineCount":56,"GOMAXPROCS":16,"GOMEMLIMIT":9223372036854775807,"GOGC":"","GODEBUG":"","storageRetention":"15d or 819MiB204KiB819B"}}

Make offers from COS

You can make offers for Prometheus, Grafana and Loki from COS as follows:

juju offer -c cos-controller cos.prometheus:receive-remote-write prometheus-receive-remote-write
juju offer -c cos-controller cos.grafana:grafana-dashboard grafana-dashboards
juju offer -c cos-controller cos.loki:logging loki-logging

If you’ve deployed COS with offers overlay, making offers is not necessary, since they are already exist.

Consume Offers in Kubeflow

Within the kubeflow model, you can consume the COS offers for Prometheus, Grafana and Loki as follows:

juju consume -m kf-controller:kubeflow cos-controller:cos.prometheus-receive-remote-write
juju consume -m kf-controller:kubeflow cos-controller:cos.grafana-dashboards
juju consume -m kf-controller:kubeflow cos-controller:cos.loki-logging

Connect Grafana agent to endpoints

The Grafana agent can provide metrics, alerts, dashboards and logs to the COS via these three relation endpoints:

You can tell the Grafana agent to provide those by consuming those offers as follows:

juju integrate -m kf-controller:kubeflow grafana-agent-k8s:send-remote-write prometheus-receive-remote-write
juju integrate -m kf-controller:kubeflow grafana-agent-k8s:grafana-dashboards-provider grafana-dashboards
juju integrate -m kf-controller:kubeflow grafana-agent-k8s:logging-consumer loki-logging

Verify the relations for all offers are in place:

juju status -m cos-controller:cos grafana-agent-k8s --relations

You should see 1/1 in the Connected column under Offers.

Integrate with Prometheus

You can provide charms metrics to Prometheus in COS by linking the CKF charms to the metrics-endpoint as follows:

juju switch kf-controller:kubeflow
juju integrate argo-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate dex-auth:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate envoy:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate istio-ingressgateway:metrics-endpoint grafana-agent-k8s:metrics-endpoint  
juju integrate istio-pilot:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate jupyter-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate katib-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate kfp-api:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate knative-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate knative-eventing:otel-collector knative-operator:otel-collector
juju integrate knative-serving:otel-collector knative-operator:otel-collector
juju integrate kserve-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate kubeflow-profiles:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate metacontroller-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate minio:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate seldon-controller-manager:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate pvcviewer-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate tensorboard-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju integrate training-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint

Verify the relations are successfully added with juju status --relations.

Integrate with Grafana

You can link Kubeflow charms to the Grafana agent via the grafana-dashboards-consumer endpoint in COS as follows:

juju switch kf-controller:kubeflow
juju integrate argo-controller:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate dex-auth:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate envoy:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate istio-pilot:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate jupyter-controller:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate katib-controller:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate kfp-api:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate kubeflow-dashboard:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate metacontroller-operator:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate minio:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate seldon-controller-manager:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate pvcviewer-operator:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju integrate training-operator:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer

Verify the relations are successfully added with juju status --relations.

Integrate with Loki

Log forwarding to Loki is available from CKF 1.9.

You can provide charms logs to Loki in COS by integrating the CKF charms with loki-logging endpoint and Grafana agent as follows:

juju switch kf-controller:kubeflow
juju integrate admission-webhook:logging grafana-agent-k8s:logging-provider
juju integrate jupyter-ui:logging grafana-agent-k8s:logging-provider
juju integrate katib-db-manager:logging grafana-agent-k8s:logging-provider
juju integrate katib-ui:logging grafana-agent-k8s:logging-provider
juju integrate kfp-metadata-writer:logging grafana-agent-k8s:logging-provider
juju integrate kfp-persistence:logging grafana-agent-k8s:logging-provider
juju integrate kfp-profile-controller:logging grafana-agent-k8s:logging-provider
juju integrate kfp-schedwf:logging grafana-agent-k8s:logging-provider
juju integrate kfp-ui:logging grafana-agent-k8s:logging-provider
juju integrate kfp-viewer:logging grafana-agent-k8s:logging-provider
juju integrate kfp-viz:logging grafana-agent-k8s:logging-provider
juju integrate kubeflow-dashboard:logging grafana-agent-k8s:logging-provider
juju integrate kubeflow-volumes:logging grafana-agent-k8s:logging-provider
juju integrate mlmd:logging grafana-agent-k8s:logging-provider
juju integrate oidc-gatekeeper:logging grafana-agent-k8s:logging-provider
juju integrate tensorboards-web-app:logging grafana-agent-k8s:logging-provider
juju integrate argo-controller:logging grafana-agent-k8s:logging-provider
juju integrate dex-auth:logging grafana-agent-k8s:logging-provider
juju integrate envoy:logging grafana-agent-k8s:logging-provider
juju integrate jupyter-controller:logging grafana-agent-k8s:logging-provider
juju integrate katib-controller:logging grafana-agent-k8s:logging-provider
juju integrate kfp-api:logging grafana-agent-k8s:logging-provider
juju integrate knative-operator:logging grafana-agent-k8s:logging-provider
juju integrate kserve-controller:logging grafana-agent-k8s:logging-provider
juju integrate kubeflow-profiles:logging grafana-agent-k8s:logging-provider
juju integrate pvcviewer-operator:logging grafana-agent-k8s:logging-provider
juju integrate seldon-core:logging grafana-agent-k8s:logging-provider
juju integrate tensorboard-controller:logging grafana-agent-k8s:logging-provider

Verify the relations are successfully added with juju status --relations.

Access monitoring resources

Using COS URLs, you can access Prometheus and Grafana to monitor resources, including metrics, alerts and dashboards from CKF charms.

Prometheus metrics

Navigate to the Prometheus URL (see Get COS URLs for more details).

To view the metrics for a specific charm, query {juju_application="<app-name>"}.

For example, you can check the argo-controller logs using this query:

argo_workflows_count{juju_model="kubeflow", juju_charm="argo-controller"}

It returns all the charm-related metrics:

To view all the metrics available from Prometheus, use the metrics explorer by clicking on the round icon next to Execute in that query form.

See Prometheus metrics for more details on available metrics.

Prometheus alerts

Navigate to the Prometheus URL and click on Alerting under the left-hand side navigation bar. This shows all available alerts.

Alerting-view

You can filter from Active, Pending and Firing alerts using the available checkboxes. To view alerts for a specific charm, type its name in the search bar on the top.

See Prometheus alerts for more details.

Grafana dashboards

Navigate to the Grafana dashboard URL (see Get COS URLs for more details). Get the admin password:

juju run -m cos-controller:cos grafana/leader get-admin-password

Using admin as the username, log in with the password returned. See the available dashboards by clicking on Dashboards in the sidebar menu:

See Grafana dashboards for more information.

Loki logs

Loki does not provide a UI. You can use the Grafana UI for checking Loki logs.

Navigate to the Grafana URL and click on Explore, where you can navigate through all collected logs by selecting Loki as source.

Loki-logs-explore

See Visualize log data for more details on navigating Grafana Loki. For more information on forwarded logs, see Loki logs.

How should I solve this when the traefik goes into a pending state??

I already have an istio load balancer.