Hi,
This is the external network I am using : 10.70.0.0/16
My deployment is as follows :
-
a MAAS VM with juju client
-
a JUJU VM as a controller : a first controller
-
a kubernetes cluster composed of : deployed with charmed kubernetes v1.30
- 3 VMs as control-plane
- 4 physical nodes as workers
-
I deployed charmed kubeflow on top of it (v1.9) : it is accessible over 10.70.250.1 IP
I then deployed a new VM to use for OBS using microk8s : v1.30
- I deployed microk8s on the VM using : Charmhub | Getting started on MicroK8s
- I used my MAAS machine to deploy a second juju controller to manage the microk8s VM
- I merged the two kubernetes configs to be able to see the two clusters (as contexts)
- I deployed COS on the microk8s VM and it is accessible over the 10.70.80.1 IP
- The endpoints are accessible : prometheus, grafana …etc.
- I then followed this article to integrate kubeflow with cos : Integrate with Canonical Observability Stack | Documentation | Charmed Kubeflow
- The check connectivity worked and I got : success
- I did the offer and consume but I got errors, it seems that the kubeflow cannot talk to cos.
it seems like a DNS resolve problem, but I don’t see how to correct it.
This is what I am seeing
root@maas:~# juju debug-log
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "loki-logging"
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "grafana-dashboards"
controller-0: 22:16:44 INFO juju.worker.remoterelations cmr start "prometheus-receive-remote-write"
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for prometheus-receive-remote-write: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr stopped "prometheus-receive-remote-write", err: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr non-fatal error "prometheus-receive-remote-write": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr exited "prometheus-receive-remote-write": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for loki-logging: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr stopped "loki-logging", err: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr non-fatal error "loki-logging": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr restarting "prometheus-receive-remote-write" in 15s
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr exited "loki-logging": cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
controller-0: 22:16:46 INFO juju.worker.remoterelations cmr restarting "loki-logging" in 15s
controller-0: 22:16:46 ERROR juju.worker.remoterelations cmr error in remote application worker for grafana-dashboards: cannot connect to external controller: opening facade to remote model: cannot resolve "controller-service.controller-juju-remote-microk8s.svc.cluster.local": lookup controller-service.controller-juju-remote-microk8s.svc.cluster.local on 127.0.0.53:53: server misbehaving
root@maas:~# kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
juju-context juju-cluster adminks
* microk8s microk8s-cluster adminmk8s
root@maas:~# juju controllers
Use --refresh option with this command to see the latest information.
Controller Model User Access Cloud/Region Models Nodes HA Version
juju-controller* kubeflow admin superuser maas01-cloud/default 3 8 none 3.5.4
juju-remote-microk8s cos admin superuser remote-microk8s/localhost 2 1 - 3.6.1
root@maas:~# juju switch juju-remote-microk8s
juju-controller:admin/kubeflow -> juju-remote-microk8s:admin/cos
root@maas:~# juju status --relations
Model Controller Cloud/Region Version SLA Timestamp
cos juju-remote-microk8s remote-microk8s/localhost 3.6.1 unsupported 22:14:51+01:00
App Version Status Scale Charm Channel Rev Address Exposed Message
alertmanager 0.27.0 active 1 alertmanager-k8s latest/stable 128 10.152.183.135 no
catalogue active 1 catalogue-k8s latest/stable 59 10.152.183.124 no
grafana 9.5.3 active 1 grafana-k8s latest/stable 117 10.152.183.61 no
loki 2.9.6 active 1 loki-k8s latest/stable 161 10.152.183.56 no
prometheus 2.52.0 active 1 prometheus-k8s latest/stable 210 10.152.183.142 no
traefik 2.11.0 active 1 traefik-k8s latest/stable 203 10.152.183.225 no Serving at 10.70.80.1
Unit Workload Agent Address Ports Message
alertmanager/0* active idle 10.1.106.154
catalogue/0* active idle 10.1.106.144
grafana/0* active idle 10.1.106.155
loki/0* active idle 10.1.106.156
prometheus/0* active idle 10.1.106.157
traefik/0* active idle 10.1.106.153 Serving at 10.70.80.1
Offer Application Charm Rev Connected Endpoint Interface Role
alertmanager-karma-dashboard alertmanager alertmanager-k8s 128 0/0 karma-dashboard karma_dashboard provider
grafana-dashboards grafana grafana-k8s 117 0/0 grafana-dashboard grafana_dashboard requirer
loki-logging loki loki-k8s 161 0/0 logging loki_push_api provider
prometheus-receive-remote-write prometheus prometheus-k8s 210 0/0 receive-remote-write prometheus_remote_write provider
Integration provider Requirer Interface Type Message
alertmanager:alerting loki:alertmanager alertmanager_dispatch regular
alertmanager:alerting prometheus:alertmanager alertmanager_dispatch regular
alertmanager:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
alertmanager:grafana-source grafana:grafana-source grafana_datasource regular
alertmanager:replicas alertmanager:replicas alertmanager_replica peer
alertmanager:self-metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
catalogue:catalogue alertmanager:catalogue catalogue regular
catalogue:catalogue grafana:catalogue catalogue regular
catalogue:catalogue prometheus:catalogue catalogue regular
catalogue:replicas catalogue:replicas catalogue_replica peer
grafana:grafana grafana:grafana grafana_peers peer
grafana:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
grafana:replicas grafana:replicas grafana_replicas peer
loki:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
loki:grafana-source grafana:grafana-source grafana_datasource regular
loki:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
loki:replicas loki:replicas loki_replica peer
prometheus:grafana-dashboard grafana:grafana-dashboard grafana_dashboard regular
prometheus:grafana-source grafana:grafana-source grafana_datasource regular
prometheus:prometheus-peers prometheus:prometheus-peers prometheus_peers peer
traefik:ingress alertmanager:ingress ingress regular
traefik:ingress catalogue:ingress ingress regular
traefik:ingress-per-unit loki:ingress ingress_per_unit regular
traefik:ingress-per-unit prometheus:ingress ingress_per_unit regular
traefik:metrics-endpoint prometheus:metrics-endpoint prometheus_scrape regular
traefik:peers traefik:peers traefik_peers peer
root@maas:~# juju switch juju-controller
juju-remote-microk8s:admin/cos -> juju-controller:admin/kubeflow
root@maas:~# juju status --relations
Model Controller Cloud/Region Version SLA Timestamp
kubeflow juju-controller kflow-cloud/default 3.5.4 unsupported 22:15:33+01:00
SAAS Status Store URL
grafana-dashboards error juju-remote-microk8s admin/cos.grafana-dashboards
loki-logging error juju-remote-microk8s admin/cos.loki-logging
prometheus-receive-remote-write error juju-remote-microk8s admin/cos.prometheus-receive-remote-write
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook 1.9/stable 344 10.152.183.146 no
argo-controller active 1 argo-controller 3.4/stable 617 10.152.183.20 no
dex-auth active 1 dex-auth 2.39/stable 588 10.152.183.54 no
envoy active 1 envoy 2.2/stable 310 10.152.183.132 no
grafana-agent-k8s 0.40.4 blocked 1 grafana-agent-k8s latest/stable 80 10.152.183.236 no Missing incoming ('requires') relation: metrics-endpoint|logging-provider|grafana-dashboards-consumer
istio-ingressgateway active 1 istio-gateway 1.22/stable 1280 10.152.183.93 no
istio-pilot active 1 istio-pilot 1.22/stable 1169 10.152.183.131 no
jupyter-controller active 1 jupyter-controller 1.9/stable 1083 10.152.183.59 no
jupyter-ui active 1 jupyter-ui 1.9/stable 961 10.152.183.147 no
katib-controller active 1 katib-controller 0.17/stable 813 10.152.183.104 no
katib-db 8.0.37-0ubuntu0.22.04.3 active 1 mysql-k8s 8.0/stable 180 10.152.183.209 no
katib-db-manager active 1 katib-db-manager 0.17/stable 713 10.152.183.145 no
katib-ui active 1 katib-ui 0.17/stable 713 10.152.183.178 no
kfp-api active 1 kfp-api 2.3/stable 1743 10.152.183.170 no
kfp-db 8.0.37-0ubuntu0.22.04.3 active 1 mysql-k8s 8.0/stable 180 10.152.183.247 no
kfp-metadata-writer active 1 kfp-metadata-writer 2.3/stable 825 10.152.183.144 no
kfp-persistence active 1 kfp-persistence 2.3/stable 1756 10.152.183.39 no
kfp-profile-controller active 1 kfp-profile-controller 2.3/stable 1715 10.152.183.214 no
kfp-schedwf active 1 kfp-schedwf 2.3/stable 1765 10.152.183.235 no
kfp-ui active 1 kfp-ui 2.3/stable 1752 10.152.183.105 no
kfp-viewer active 1 kfp-viewer 2.3/stable 1781 10.152.183.83 no
kfp-viz active 1 kfp-viz 2.3/stable 1700 10.152.183.87 no
knative-eventing active 1 knative-eventing 1.12/stable 459 10.152.183.106 no
knative-operator active 1 knative-operator 1.12/stable 496 10.152.183.107 no
knative-serving active 1 knative-serving 1.12/stable 487 10.152.183.243 no
kserve-controller active 1 kserve-controller 0.13/stable 655 10.152.183.231 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.9/stable 659 10.152.183.171 no
kubeflow-profiles active 1 kubeflow-profiles 1.9/stable 458 10.152.183.111 no
kubeflow-roles active 1 kubeflow-roles 1.9/stable 240 10.152.183.219 no
kubeflow-volumes active 1 kubeflow-volumes 1.9/stable 348 10.152.183.227 no
metacontroller-operator active 1 metacontroller-operator 3.0/stable 352 10.152.183.44 no
minio res:oci-image@220b31a active 1 minio ckf-1.9/stable 383 10.152.183.84 no
mlflow-minio res:oci-image@220b31a active 1 minio ckf-1.9/stable 383 10.152.183.78 no
mlflow-mysql 8.0.37-0ubuntu0.22.04.3 active 1 mysql-k8s 8.0/stable 180 10.152.183.250 no
mlflow-server active 1 mlflow-server 2.15/stable 762 10.152.183.207 no
mlmd active 1 mlmd ckf-1.9/stable 213 10.152.183.189 no
oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.9/stable 423 10.152.183.33 no
pvcviewer-operator active 1 pvcviewer-operator 1.9/stable 204 10.152.183.60 no
resource-dispatcher active 1 resource-dispatcher 2.0/stable 182 10.152.183.183 no
tensorboard-controller active 1 tensorboard-controller 1.9/stable 355 10.152.183.165 no
tensorboards-web-app active 1 tensorboards-web-app 1.9/stable 343 10.152.183.98 no
training-operator active 1 training-operator 1.8/stable 545 10.152.183.166 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 192.168.93.51
argo-controller/0* active idle 192.168.93.61
dex-auth/0* active idle 192.168.93.33
envoy/0* active idle 192.168.152.62
grafana-agent-k8s/0* blocked idle 192.168.152.40 Missing incoming ('requires') relation: metrics-endpoint|logging-provider|grafana-dashboards-consumer
istio-ingressgateway/0* active idle 192.168.93.57
istio-pilot/0* active idle 192.168.93.39
jupyter-controller/0* active idle 192.168.152.56
jupyter-ui/0* active idle 192.168.152.49
katib-controller/0* active idle 192.168.152.61
katib-db-manager/0* active idle 192.168.93.43
katib-db/0* active idle 192.168.152.38 Primary
katib-ui/0* active idle 192.168.93.40
kfp-api/0* active idle 192.168.152.32
kfp-db/0* active idle 192.168.26.216 Primary
kfp-metadata-writer/0* active idle 192.168.93.48
kfp-persistence/0* active idle 192.168.93.30
kfp-profile-controller/0* active idle 192.168.26.246
kfp-schedwf/0* active idle 192.168.152.51
kfp-ui/0* active idle 192.168.152.58
kfp-viewer/0* active idle 192.168.93.53
kfp-viz/0* active idle 192.168.93.54
knative-eventing/0* active idle 192.168.93.45
knative-operator/0* active idle 192.168.251.148
knative-serving/0* active idle 192.168.152.53
kserve-controller/0* active idle 192.168.26.248
kubeflow-dashboard/0* active idle 192.168.93.35
kubeflow-profiles/0* active idle 192.168.152.55
kubeflow-roles/0* active idle 192.168.93.50
kubeflow-volumes/0* active idle 192.168.26.242
metacontroller-operator/0* active idle 192.168.93.37
minio/0* active idle 192.168.251.159 9000-9001/TCP
mlflow-minio/0* active idle 192.168.251.150 9000-9001/TCP
mlflow-mysql/0* active idle 192.168.26.213 Primary
mlflow-server/0* active idle 192.168.251.190
mlmd/0* active idle 192.168.152.11
oidc-gatekeeper/0* active idle 192.168.26.245
pvcviewer-operator/0* active idle 192.168.152.4
resource-dispatcher/0* active idle 192.168.152.8
tensorboard-controller/0* active idle 192.168.26.244
tensorboards-web-app/0* active idle 192.168.251.187
training-operator/0* active idle 192.168.93.58
Integration provider Requirer Interface Type Message
dex-auth:dex-oidc-config oidc-gatekeeper:dex-oidc-config dex-oidc-config regular
grafana-agent-k8s:grafana-dashboards-provider grafana-dashboards:grafana-dashboard grafana_dashboard regular
grafana-agent-k8s:peers grafana-agent-k8s:peers grafana_agent_replica peer
istio-pilot:gateway-info kserve-controller:ingress-gateway istio-gateway-info regular
istio-pilot:gateway-info tensorboard-controller:gateway-info istio-gateway-info regular
istio-pilot:ingress dex-auth:ingress ingress regular
istio-pilot:ingress envoy:ingress ingress regular
istio-pilot:ingress jupyter-ui:ingress ingress regular
istio-pilot:ingress katib-ui:ingress ingress regular
istio-pilot:ingress kfp-ui:ingress ingress regular
istio-pilot:ingress kubeflow-dashboard:ingress ingress regular
istio-pilot:ingress kubeflow-volumes:ingress ingress regular
istio-pilot:ingress mlflow-server:ingress ingress regular
istio-pilot:ingress oidc-gatekeeper:ingress ingress regular
istio-pilot:ingress tensorboards-web-app:ingress ingress regular
istio-pilot:ingress-auth oidc-gatekeeper:ingress-auth ingress-auth regular
istio-pilot:istio-pilot istio-ingressgateway:istio-pilot k8s-service regular
istio-pilot:peers istio-pilot:peers istio_pilot_peers peer
katib-db-manager:k8s-service-info katib-controller:k8s-service-info k8s-service regular
katib-db:database katib-db-manager:relational-db mysql_client regular
katib-db:database-peers katib-db:database-peers mysql_peers peer
katib-db:restart katib-db:restart rolling_op peer
katib-db:upgrade katib-db:upgrade upgrade peer
kfp-api:kfp-api kfp-persistence:kfp-api k8s-service regular
kfp-api:kfp-api kfp-ui:kfp-api k8s-service regular
kfp-db:database kfp-api:relational-db mysql_client regular
kfp-db:database-peers kfp-db:database-peers mysql_peers peer
kfp-db:restart kfp-db:restart rolling_op peer
kfp-db:upgrade kfp-db:upgrade upgrade peer
kfp-viz:kfp-viz kfp-api:kfp-viz k8s-service regular
knative-serving:local-gateway kserve-controller:local-gateway serving-local-gateway regular
kubeflow-dashboard:links jupyter-ui:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links katib-ui:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links kfp-ui:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links kubeflow-volumes:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links mlflow-server:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links tensorboards-web-app:dashboard-links kubeflow_dashboard_links regular
kubeflow-dashboard:links training-operator:dashboard-links kubeflow_dashboard_links regular
kubeflow-profiles:kubeflow-profiles kubeflow-dashboard:kubeflow-profiles k8s-service regular
loki-logging:logging grafana-agent-k8s:logging-consumer loki_push_api regular
minio:object-storage argo-controller:object-storage object-storage regular
minio:object-storage kfp-api:object-storage object-storage regular
minio:object-storage kfp-profile-controller:object-storage object-storage regular
minio:object-storage kfp-ui:object-storage object-storage regular
mlflow-minio:object-storage kserve-controller:object-storage object-storage regular
mlflow-minio:object-storage mlflow-server:object-storage object-storage regular
mlflow-mysql:database mlflow-server:relational-db mysql_client regular
mlflow-mysql:database-peers mlflow-mysql:database-peers mysql_peers peer
mlflow-mysql:restart mlflow-mysql:restart rolling_op peer
mlflow-mysql:upgrade mlflow-mysql:upgrade upgrade peer
mlmd:grpc envoy:grpc k8s-service regular
mlmd:grpc kfp-metadata-writer:grpc k8s-service regular
oidc-gatekeeper:client-secret oidc-gatekeeper:client-secret client-secret peer
oidc-gatekeeper:oidc-client dex-auth:oidc-client oidc-client regular
prometheus-receive-remote-write:receive-remote-write grafana-agent-k8s:send-remote-write prometheus_remote_write regular
resource-dispatcher:pod-defaults mlflow-server:pod-defaults kubernetes_manifest regular
resource-dispatcher:secrets kserve-controller:secrets kubernetes_manifest regular
resource-dispatcher:secrets mlflow-server:secrets kubernetes_manifest regular
Checking connectivity
root@maas:~# juju exec --unit grafana-agent-k8s/0 -m juju-controller:kubeflow 'curl -s http://10.70.80.1/cos-prometheus-0/api/v1/status/runtimeinfo'
{"status":"success","data":{"startTime":"2024-12-23T14:32:01.207038325Z","CWD":"/","reloadConfigSuccess":true,"lastConfigTime":"2024-12-23T14:33:19Z","corruptionCount":0,"goroutineCount":56,"GOMAXPROCS":8,"GOMEMLIMIT":9223372036854775807,"GOGC":"","GODEBUG":"","storageRetention":"15d or 819MiB204KiB819B"}}
I am new to all this, any help will be apreciated.
Regards.