I can't get kubeflow to talk to COS

Hi,

This is the external network I am using : 10.70.0.0/16

My deployment is as follows :

  • a MAAS VM with juju client

  • a JUJU VM as a controller : a first controller

  • a kubernetes cluster composed of : deployed with charmed kubernetes v1.30

    • 3 VMs as control-plane
    • 4 physical nodes as workers
  • I deployed charmed kubeflow on top of it (v1.9) : it is accessible over 10.70.250.1 IP

I then deployed a new VM to use for OBS using microk8s : v1.30

  • I deployed microk8s on the VM using : Charmhub | Getting started on MicroK8s
  • I used my MAAS machine to deploy a second juju controller to manage the microk8s VM
  • I merged the two kubernetes configs to be able to see the two clusters (as contexts)
  • I deployed COS on the microk8s VM and it is accessible over the 10.70.80.1 IP
    • The endpoints are accessible : prometheus, grafana …etc.
  • I then followed this article to integrate kubeflow with cos : Integrate with Canonical Observability Stack | Documentation | Charmed Kubeflow
  • The check connectivity worked and I got : success
  • I did the offer and consume but I got errors, it seems that the kubeflow cannot talk to cos.

it seems like a DNS resolve problem, but I don’t see how to correct it.

This is what I am seeing
root@maas:~# juju debug-log
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "loki-logging"
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "grafana-dashboards"
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "prometheus-receive-remote-write"
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for prometheus-receive-remote-write: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr stopped "prometheus-receive-remote-write", err: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr non-fatal error "prometheus-receive-remote-write": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr exited "prometheus-receive-remote-write": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for loki-logging: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr stopped "loki-logging", err: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr non-fatal error "loki-logging": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr restarting "prometheus-receive-remote-write" in 15s
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr exited "loki-logging": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr restarting "loki-logging" in 15s
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for grafana-dashboards: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
root@maas:~# kubectl config get-contexts
CURRENT   NAME           CLUSTER            AUTHINFO    NAMESPACE
          juju-context   juju-cluster       adminks     
*         microk8s       microk8s-cluster   adminmk8s

root@maas:~# juju controllers
Use --refresh option with this command to see the latest information.

Controller            Model     User   Access     Cloud/Region               Models  Nodes    HA  Version
juju-controller*      kubeflow  admin  superuser  maas01-cloud/default            3      8  none  3.5.4  
juju-remote-microk8s  cos       admin  superuser  remote-microk8s/localhost       2      1     -  3.6.1 

root@maas:~# juju switch juju-remote-microk8s  
juju-controller:admin/kubeflow -> juju-remote-microk8s:admin/cos
root@maas:~# juju status --relations 
Model  Controller            Cloud/Region               Version  SLA          Timestamp
cos    juju-remote-microk8s  remote-microk8s/localhost  3.6.1    unsupported  22:14:51+01:00

App           Version  Status  Scale  Charm             Channel        Rev  Address         Exposed  Message
alertmanager  0.27.0   active      1  alertmanager-k8s  latest/stable  128  10.152.183.135  no       
catalogue              active      1  catalogue-k8s     latest/stable   59  10.152.183.124  no       
grafana       9.5.3    active      1  grafana-k8s       latest/stable  117  10.152.183.61   no       
loki          2.9.6    active      1  loki-k8s          latest/stable  161  10.152.183.56   no       
prometheus    2.52.0   active      1  prometheus-k8s    latest/stable  210  10.152.183.142  no       
traefik       2.11.0   active      1  traefik-k8s       latest/stable  203  10.152.183.225  no       Serving at 10.70.80.1

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.106.154         
catalogue/0*     active    idle   10.1.106.144         
grafana/0*       active    idle   10.1.106.155         
loki/0*          active    idle   10.1.106.156         
prometheus/0*    active    idle   10.1.106.157         
traefik/0*       active    idle   10.1.106.153         Serving at 10.70.80.1

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  128  0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       117  0/0        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          161  0/0        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    210  0/0        receive-remote-write  prometheus_remote_write  provider

Integration provider                Requirer                     Interface              Type     Message
alertmanager:alerting               loki:alertmanager            alertmanager_dispatch  regular  
alertmanager:alerting               prometheus:alertmanager      alertmanager_dispatch  regular  
alertmanager:grafana-dashboard      grafana:grafana-dashboard    grafana_dashboard      regular  
alertmanager:grafana-source         grafana:grafana-source       grafana_datasource     regular  
alertmanager:replicas               alertmanager:replicas        alertmanager_replica   peer     
alertmanager:self-metrics-endpoint  prometheus:metrics-endpoint  prometheus_scrape      regular  
catalogue:catalogue                 alertmanager:catalogue       catalogue              regular  
catalogue:catalogue                 grafana:catalogue            catalogue              regular  
catalogue:catalogue                 prometheus:catalogue         catalogue              regular  
catalogue:replicas                  catalogue:replicas           catalogue_replica      peer     
grafana:grafana                     grafana:grafana              grafana_peers          peer     
grafana:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
grafana:replicas                    grafana:replicas             grafana_replicas       peer     
loki:grafana-dashboard              grafana:grafana-dashboard    grafana_dashboard      regular  
loki:grafana-source                 grafana:grafana-source       grafana_datasource     regular  
loki:metrics-endpoint               prometheus:metrics-endpoint  prometheus_scrape      regular  
loki:replicas                       loki:replicas                loki_replica           peer     
prometheus:grafana-dashboard        grafana:grafana-dashboard    grafana_dashboard      regular  
prometheus:grafana-source           grafana:grafana-source       grafana_datasource     regular  
prometheus:prometheus-peers         prometheus:prometheus-peers  prometheus_peers       peer     
traefik:ingress                     alertmanager:ingress         ingress                regular  
traefik:ingress                     catalogue:ingress            ingress                regular  
traefik:ingress-per-unit            loki:ingress                 ingress_per_unit       regular  
traefik:ingress-per-unit            prometheus:ingress           ingress_per_unit       regular  
traefik:metrics-endpoint            prometheus:metrics-endpoint  prometheus_scrape      regular  
traefik:peers                       traefik:peers                traefik_peers          peer     


root@maas:~# juju switch juju-controller
juju-remote-microk8s:admin/cos -> juju-controller:admin/kubeflow
root@maas:~# juju status --relations 
Model     Controller       Cloud/Region         Version  SLA          Timestamp
kubeflow  juju-controller  kflow-cloud/default  3.5.4    unsupported  22:15:33+01:00

SAAS                             Status  Store                 URL
grafana-dashboards               error   juju-remote-microk8s  admin/cos.grafana-dashboards
loki-logging                     error   juju-remote-microk8s  admin/cos.loki-logging
prometheus-receive-remote-write  error   juju-remote-microk8s  admin/cos.prometheus-receive-remote-write

App                      Version                  Status   Scale  Charm                    Channel          Rev  Address         Exposed  Message
admission-webhook                                 active       1  admission-webhook        1.9/stable       344  10.152.183.146  no       
argo-controller                                   active       1  argo-controller          3.4/stable       617  10.152.183.20   no       
dex-auth                                          active       1  dex-auth                 2.39/stable      588  10.152.183.54   no       
envoy                                             active       1  envoy                    2.2/stable       310  10.152.183.132  no       
grafana-agent-k8s        0.40.4                   blocked      1  grafana-agent-k8s        latest/stable     80  10.152.183.236  no       Missing incoming ('requires') relation: metrics-endpoint|logging-provider|grafana-dashboards-consumer
istio-ingressgateway                              active       1  istio-gateway            1.22/stable     1280  10.152.183.93   no       
istio-pilot                                       active       1  istio-pilot              1.22/stable     1169  10.152.183.131  no       
jupyter-controller                                active       1  jupyter-controller       1.9/stable      1083  10.152.183.59   no       
jupyter-ui                                        active       1  jupyter-ui               1.9/stable       961  10.152.183.147  no       
katib-controller                                  active       1  katib-controller         0.17/stable      813  10.152.183.104  no       
katib-db                 8.0.37-0ubuntu0.22.04.3  active       1  mysql-k8s                8.0/stable       180  10.152.183.209  no       
katib-db-manager                                  active       1  katib-db-manager         0.17/stable      713  10.152.183.145  no       
katib-ui                                          active       1  katib-ui                 0.17/stable      713  10.152.183.178  no       
kfp-api                                           active       1  kfp-api                  2.3/stable      1743  10.152.183.170  no       
kfp-db                   8.0.37-0ubuntu0.22.04.3  active       1  mysql-k8s                8.0/stable       180  10.152.183.247  no       
kfp-metadata-writer                               active       1  kfp-metadata-writer      2.3/stable       825  10.152.183.144  no       
kfp-persistence                                   active       1  kfp-persistence          2.3/stable      1756  10.152.183.39   no       
kfp-profile-controller                            active       1  kfp-profile-controller   2.3/stable      1715  10.152.183.214  no       
kfp-schedwf                                       active       1  kfp-schedwf              2.3/stable      1765  10.152.183.235  no       
kfp-ui                                            active       1  kfp-ui                   2.3/stable      1752  10.152.183.105  no       
kfp-viewer                                        active       1  kfp-viewer               2.3/stable      1781  10.152.183.83   no       
kfp-viz                                           active       1  kfp-viz                  2.3/stable      1700  10.152.183.87   no       
knative-eventing                                  active       1  knative-eventing         1.12/stable      459  10.152.183.106  no       
knative-operator                                  active       1  knative-operator         1.12/stable      496  10.152.183.107  no       
knative-serving                                   active       1  knative-serving          1.12/stable      487  10.152.183.243  no       
kserve-controller                                 active       1  kserve-controller        0.13/stable      655  10.152.183.231  no       
kubeflow-dashboard                                active       1  kubeflow-dashboard       1.9/stable       659  10.152.183.171  no       
kubeflow-profiles                                 active       1  kubeflow-profiles        1.9/stable       458  10.152.183.111  no       
kubeflow-roles                                    active       1  kubeflow-roles           1.9/stable       240  10.152.183.219  no       
kubeflow-volumes                                  active       1  kubeflow-volumes         1.9/stable       348  10.152.183.227  no       
metacontroller-operator                           active       1  metacontroller-operator  3.0/stable       352  10.152.183.44   no       
minio                    res:oci-image@220b31a    active       1  minio                    ckf-1.9/stable   383  10.152.183.84   no       
mlflow-minio             res:oci-image@220b31a    active       1  minio                    ckf-1.9/stable   383  10.152.183.78   no       
mlflow-mysql             8.0.37-0ubuntu0.22.04.3  active       1  mysql-k8s                8.0/stable       180  10.152.183.250  no       
mlflow-server                                     active       1  mlflow-server            2.15/stable      762  10.152.183.207  no       
mlmd                                              active       1  mlmd                     ckf-1.9/stable   213  10.152.183.189  no       
oidc-gatekeeper                                   active       1  oidc-gatekeeper          ckf-1.9/stable   423  10.152.183.33   no       
pvcviewer-operator                                active       1  pvcviewer-operator       1.9/stable       204  10.152.183.60   no       
resource-dispatcher                               active       1  resource-dispatcher      2.0/stable       182  10.152.183.183  no       
tensorboard-controller                            active       1  tensorboard-controller   1.9/stable       355  10.152.183.165  no       
tensorboards-web-app                              active       1  tensorboards-web-app     1.9/stable       343  10.152.183.98   no       
training-operator                                 active       1  training-operator        1.8/stable       545  10.152.183.166  no       

Unit                        Workload  Agent  Address          Ports          Message
admission-webhook/0*        active    idle   192.168.93.51                   
argo-controller/0*          active    idle   192.168.93.61                   
dex-auth/0*                 active    idle   192.168.93.33                   
envoy/0*                    active    idle   192.168.152.62                  
grafana-agent-k8s/0*        blocked   idle   192.168.152.40                  Missing incoming ('requires') relation: metrics-endpoint|logging-provider|grafana-dashboards-consumer
istio-ingressgateway/0*     active    idle   192.168.93.57                   
istio-pilot/0*              active    idle   192.168.93.39                   
jupyter-controller/0*       active    idle   192.168.152.56                  
jupyter-ui/0*               active    idle   192.168.152.49                  
katib-controller/0*         active    idle   192.168.152.61                  
katib-db-manager/0*         active    idle   192.168.93.43                   
katib-db/0*                 active    idle   192.168.152.38                  Primary
katib-ui/0*                 active    idle   192.168.93.40                   
kfp-api/0*                  active    idle   192.168.152.32                  
kfp-db/0*                   active    idle   192.168.26.216                  Primary
kfp-metadata-writer/0*      active    idle   192.168.93.48                   
kfp-persistence/0*          active    idle   192.168.93.30                   
kfp-profile-controller/0*   active    idle   192.168.26.246                  
kfp-schedwf/0*              active    idle   192.168.152.51                  
kfp-ui/0*                   active    idle   192.168.152.58                  
kfp-viewer/0*               active    idle   192.168.93.53                   
kfp-viz/0*                  active    idle   192.168.93.54                   
knative-eventing/0*         active    idle   192.168.93.45                   
knative-operator/0*         active    idle   192.168.251.148                 
knative-serving/0*          active    idle   192.168.152.53                  
kserve-controller/0*        active    idle   192.168.26.248                  
kubeflow-dashboard/0*       active    idle   192.168.93.35                   
kubeflow-profiles/0*        active    idle   192.168.152.55                  
kubeflow-roles/0*           active    idle   192.168.93.50                   
kubeflow-volumes/0*         active    idle   192.168.26.242                  
metacontroller-operator/0*  active    idle   192.168.93.37                   
minio/0*                    active    idle   192.168.251.159  9000-9001/TCP  
mlflow-minio/0*             active    idle   192.168.251.150  9000-9001/TCP  
mlflow-mysql/0*             active    idle   192.168.26.213                  Primary
mlflow-server/0*            active    idle   192.168.251.190                 
mlmd/0*                     active    idle   192.168.152.11                  
oidc-gatekeeper/0*          active    idle   192.168.26.245                  
pvcviewer-operator/0*       active    idle   192.168.152.4                   
resource-dispatcher/0*      active    idle   192.168.152.8                   
tensorboard-controller/0*   active    idle   192.168.26.244                  
tensorboards-web-app/0*     active    idle   192.168.251.187                 
training-operator/0*        active    idle   192.168.93.58                   

Integration provider                                  Requirer                               Interface                 Type     Message
dex-auth:dex-oidc-config                              oidc-gatekeeper:dex-oidc-config        dex-oidc-config           regular  
grafana-agent-k8s:grafana-dashboards-provider         grafana-dashboards:grafana-dashboard   grafana_dashboard         regular  
grafana-agent-k8s:peers                               grafana-agent-k8s:peers                grafana_agent_replica     peer     
istio-pilot:gateway-info                              kserve-controller:ingress-gateway      istio-gateway-info        regular  
istio-pilot:gateway-info                              tensorboard-controller:gateway-info    istio-gateway-info        regular  
istio-pilot:ingress                                   dex-auth:ingress                       ingress                   regular  
istio-pilot:ingress                                   envoy:ingress                          ingress                   regular  
istio-pilot:ingress                                   jupyter-ui:ingress                     ingress                   regular  
istio-pilot:ingress                                   katib-ui:ingress                       ingress                   regular  
istio-pilot:ingress                                   kfp-ui:ingress                         ingress                   regular  
istio-pilot:ingress                                   kubeflow-dashboard:ingress             ingress                   regular  
istio-pilot:ingress                                   kubeflow-volumes:ingress               ingress                   regular  
istio-pilot:ingress                                   mlflow-server:ingress                  ingress                   regular  
istio-pilot:ingress                                   oidc-gatekeeper:ingress                ingress                   regular  
istio-pilot:ingress                                   tensorboards-web-app:ingress           ingress                   regular  
istio-pilot:ingress-auth                              oidc-gatekeeper:ingress-auth           ingress-auth              regular  
istio-pilot:istio-pilot                               istio-ingressgateway:istio-pilot       k8s-service               regular  
istio-pilot:peers                                     istio-pilot:peers                      istio_pilot_peers         peer     
katib-db-manager:k8s-service-info                     katib-controller:k8s-service-info      k8s-service               regular  
katib-db:database                                     katib-db-manager:relational-db         mysql_client              regular  
katib-db:database-peers                               katib-db:database-peers                mysql_peers               peer     
katib-db:restart                                      katib-db:restart                       rolling_op                peer     
katib-db:upgrade                                      katib-db:upgrade                       upgrade                   peer     
kfp-api:kfp-api                                       kfp-persistence:kfp-api                k8s-service               regular  
kfp-api:kfp-api                                       kfp-ui:kfp-api                         k8s-service               regular  
kfp-db:database                                       kfp-api:relational-db                  mysql_client              regular  
kfp-db:database-peers                                 kfp-db:database-peers                  mysql_peers               peer     
kfp-db:restart                                        kfp-db:restart                         rolling_op                peer     
kfp-db:upgrade                                        kfp-db:upgrade                         upgrade                   peer     
kfp-viz:kfp-viz                                       kfp-api:kfp-viz                        k8s-service               regular  
knative-serving:local-gateway                         kserve-controller:local-gateway        serving-local-gateway     regular  
kubeflow-dashboard:links                              jupyter-ui:dashboard-links             kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              katib-ui:dashboard-links               kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              kfp-ui:dashboard-links                 kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              kubeflow-volumes:dashboard-links       kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              mlflow-server:dashboard-links          kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              tensorboards-web-app:dashboard-links   kubeflow_dashboard_links  regular  
kubeflow-dashboard:links                              training-operator:dashboard-links      kubeflow_dashboard_links  regular  
kubeflow-profiles:kubeflow-profiles                   kubeflow-dashboard:kubeflow-profiles   k8s-service               regular  
loki-logging:logging                                  grafana-agent-k8s:logging-consumer     loki_push_api             regular  
minio:object-storage                                  argo-controller:object-storage         object-storage            regular  
minio:object-storage                                  kfp-api:object-storage                 object-storage            regular  
minio:object-storage                                  kfp-profile-controller:object-storage  object-storage            regular  
minio:object-storage                                  kfp-ui:object-storage                  object-storage            regular  
mlflow-minio:object-storage                           kserve-controller:object-storage       object-storage            regular  
mlflow-minio:object-storage                           mlflow-server:object-storage           object-storage            regular  
mlflow-mysql:database                                 mlflow-server:relational-db            mysql_client              regular  
mlflow-mysql:database-peers                           mlflow-mysql:database-peers            mysql_peers               peer     
mlflow-mysql:restart                                  mlflow-mysql:restart                   rolling_op                peer     
mlflow-mysql:upgrade                                  mlflow-mysql:upgrade                   upgrade                   peer     
mlmd:grpc                                             envoy:grpc                             k8s-service               regular  
mlmd:grpc                                             kfp-metadata-writer:grpc               k8s-service               regular  
oidc-gatekeeper:client-secret                         oidc-gatekeeper:client-secret          client-secret             peer     
oidc-gatekeeper:oidc-client                           dex-auth:oidc-client                   oidc-client               regular  
prometheus-receive-remote-write:receive-remote-write  grafana-agent-k8s:send-remote-write    prometheus_remote_write   regular  
resource-dispatcher:pod-defaults                      mlflow-server:pod-defaults             kubernetes_manifest       regular  
resource-dispatcher:secrets                           kserve-controller:secrets              kubernetes_manifest       regular  
resource-dispatcher:secrets                           mlflow-server:secrets                  kubernetes_manifest       regular  

Checking connectivity


root@maas:~# juju exec --unit grafana-agent-k8s/0 -m juju-controller:kubeflow 'curl -s http://10.70.80.1/cos-prometheus-0/api/v1/status/runtimeinfo'
{"status":"success","data":{"startTime":"2024-12-23T14:32:01.207038325Z","CWD":"/","reloadConfigSuccess":true,"lastConfigTime":"2024-12-23T14:33:19Z","corruptionCount":0,"goroutineCount":56,"GOMAXPROCS":8,"GOMEMLIMIT":9223372036854775807,"GOGC":"","GODEBUG":"","storageRetention":"15d or 819MiB204KiB819B"}}

I am new to all this, any help will be apreciated.

Regards.