CDK Heapster Crashlooping

seffyroff · 24 December 2018 21:56

I’ve deployed CDK as a bundle to a MAAS cloud. Deployment went well, but then after grabbing the kubeconfig and starting to interact with the cluster, I noticed all is not well. Initially having trouble accessing the dashboard, I’ve found that several core services are crashlooping. (Edited for actual issue in post #2.)
I feel like I’m missing something trivial here, but the mire of K8s hoops I’ve jumped through to reach this point are making it confusing to trace back to the core problem. Please can someone offer any pointers?

seffyroff · 25 December 2018 23:49

EDIT: so the heapster deployment is crashlooping:

I1226 10:41:15.694885       1 heapster.go:79] /heapster --source=kubernetes.summary_api:https://kubernetes.default?kubeletPort=10250&amp;kubeletHttps=true --sink=influxdb:http://monitoring-influxdb:8086
I1226 10:41:15.694985       1 heapster.go:80] Heapster version v1.6.0-beta.1
I1226 10:41:15.695452       1 configs.go:61] Using Kubernetes client with master "https://kubernetes.default" and version v1
I1226 10:41:15.695488       1 configs.go:62] Using kubelet port 10250
I1226 10:41:15.796324       1 influxdb.go:312] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I1226 10:41:15.796380       1 heapster.go:203] Starting with InfluxDB Sink
I1226 10:41:15.796396       1 heapster.go:203] Starting with Metric Sink
I1226 10:41:16.298315       1 heapster.go:113] Starting heapster on port 8082
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1724540]
 goroutine 108 [running]:
k8s.io/heapster/metrics/sources/summary.(*summaryMetricsSource).decodeEphemeralStorageStatsForContainer(0xc4202e2f00, 0xc420769490, 0xc42069d4f0, 0xc42069d540)
	/go/src/k8s.io/heapster/metrics/sources/summary/summary.go:276 +0x90
k8s.io/heapster/metrics/sources/summary.(*summaryMetricsSource).decodeContainerStats(0xc4202e2f00, 0xc420b52930, 0xc420af9a68, 0x0, 0x298ce20)
	/go/src/k8s.io/heapster/metrics/sources/summary/summary.go:238 +0x537
k8s.io/heapster/metrics/sources/summary.(*summaryMetricsSource).decodePodStats(0xc4202e2f00, 0xc420b526c0, 0xc420b526f0, 0xc420af9bd8)
	/go/src/k8s.io/heapster/metrics/sources/summary/summary.go:212 +0x91e
k8s.io/heapster/metrics/sources/summary.(*summaryMetricsSource).decodeSummary(0xc4202e2f00, 0xc4201d3500, 0x0)
	/go/src/k8s.io/heapster/metrics/sources/summary/summary.go:132 +0x351
k8s.io/heapster/metrics/sources/summary.(*summaryMetricsSource).ScrapeMetrics(0xc4202e2f00, 0xed3b551bc, 0xc400000000, 0x296aa60, 0xed3b551f8, 0xc400000000, 0x296aa60, 0x296c008, 0x0, 0xc42000e913)
	/go/src/k8s.io/heapster/metrics/sources/summary/summary.go:102 +0x116
k8s.io/heapster/metrics/sources.scrape(0x27bdb00, 0xc4202e2f00, 0xed3b551bc, 0xc400000000, 0x296aa60, 0xed3b551f8, 0x0, 0x296aa60, 0x0, 0x0, ...)
	/go/src/k8s.io/heapster/metrics/sources/manager.go:176 +0x11f
k8s.io/heapster/metrics/sources.(*sourceManager).ScrapeMetrics.func1(0xc420272990, 0x27bdb00, 0xc4202e2f00, 0xc420268960, 0xed3b551bc, 0x0, 0x296aa60, 0xed3b551f8, 0x0, 0x296aa60, ...)
	/go/src/k8s.io/heapster/metrics/sources/manager.go:99 +0x155
created by k8s.io/heapster/metrics/sources.(*sourceManager).ScrapeMetrics
	/go/src/k8s.io/heapster/metrics/sources/manager.go:120 +0x387

I don’t see any logentries that hint to what’s up

The Juju output looks like this:

seffyroff@Ame:~$ juju status
Model  Controller  Cloud/Region  Version  SLA          Timestamp
base   maas        maas          2.5-rc1  unsupported  02:58:53-08:00

App                    Version       Status  Scale  Charm                  Store       Rev  OS      Notes
ceph-fs                13.2.1+dfsg1  active      1  ceph-fs                jujucharms   36  ubuntu  
ceph-mon               13.2.1+dfsg1  active      3  ceph-mon               jujucharms  354  ubuntu  
ceph-osd               13.2.1+dfsg1  active      4  ceph-osd               jujucharms  380  ubuntu  
easyrsa                3.0.1         active      1  easyrsa                jujucharms  199  ubuntu  
etcd                   3.2.10        active      3  etcd                   jujucharms  352  ubuntu  
flannel                0.10.0        active      5  flannel                jujucharms  360  ubuntu  
kubeapi-load-balancer  1.14.0        active      1  kubeapi-load-balancer  jujucharms  538  ubuntu  exposed
kubernetes-master      1.13.1        active      2  kubernetes-master      jujucharms  542  ubuntu  
kubernetes-worker      1.13.1        active      3  kubernetes-worker      jujucharms  414  ubuntu  exposed

Unit                      Workload  Agent  Machine  Public address  Ports           Message
ceph-fs/0*                active    idle   2/lxd/1  10.0.10.219                     Unit is ready (1 MDS)
ceph-mon/0                active    idle   0/lxd/0  10.0.10.221                     Unit is ready and clustered
ceph-mon/1*               active    idle   1/lxd/0  10.0.10.220                     Unit is ready and clustered
ceph-mon/2                active    idle   2/lxd/0  10.0.10.214                     Unit is ready and clustered
ceph-osd/0                active    idle   2        10.0.10.217                     Unit is ready (1 OSD)
ceph-osd/1*               active    idle   1        10.0.10.222                     Unit is ready (1 OSD)
ceph-osd/2                active    idle   0        10.0.10.218                     Unit is ready (1 OSD)
ceph-osd/3                active    idle   3        10.0.10.211                     Unit is ready (1 OSD)
easyrsa/0*                active    idle   0/lxd/1  10.0.10.223                     Certificate Authority connected.
etcd/0                    active    idle   0/lxd/2  10.0.10.215     2379/tcp        Healthy with 3 known peers
etcd/1                    active    idle   1/lxd/1  10.0.10.216     2379/tcp        Healthy with 3 known peers
etcd/2*                   active    idle   2/lxd/2  10.0.10.213     2379/tcp        Healthy with 3 known peers
kubeapi-load-balancer/0*  active    idle   3/lxd/0  10.0.10.212     443/tcp         Loadbalancer ready.
kubernetes-master/0*      active    idle   2        10.0.10.217     6443/tcp        Kubernetes master running.
  flannel/1               active    idle            10.0.10.217                     Flannel subnet 10.1.64.1/24
kubernetes-master/1       active    idle   0        10.0.10.218     6443/tcp        Kubernetes master running.
  flannel/4               active    idle            10.0.10.218                     Flannel subnet 10.1.26.1/24
kubernetes-worker/0       active    idle   1        10.0.10.222     80/tcp,443/tcp  Kubernetes worker running.
  flannel/0*              active    idle            10.0.10.222                     Flannel subnet 10.1.85.1/24
kubernetes-worker/1*      active    idle   0        10.0.10.218     80/tcp,443/tcp  Kubernetes worker running.
  flannel/3               active    idle            10.0.10.218                     Flannel subnet 10.1.26.1/24
kubernetes-worker/2       active    idle   2        10.0.10.217     80/tcp,443/tcp  Kubernetes worker running.
  flannel/2               active    idle            10.0.10.217                     Flannel subnet 10.1.64.1/24

Machine  State    DNS          Inst id              Series  AZ       Message
0        started  10.0.10.218  m3yhap               bionic  default  Deployed
0/lxd/0  started  10.0.10.221  juju-c0ce59-0-lxd-0  bionic  default  Container started
0/lxd/1  started  10.0.10.223  juju-c0ce59-0-lxd-1  bionic  default  Container started
0/lxd/2  started  10.0.10.215  juju-c0ce59-0-lxd-2  bionic  default  Container started
1        started  10.0.10.222  ywywnt               bionic  default  Deployed
1/lxd/0  started  10.0.10.220  juju-c0ce59-1-lxd-0  bionic  default  Container started
1/lxd/1  started  10.0.10.216  juju-c0ce59-1-lxd-1  bionic  default  Container started
2        started  10.0.10.217  n4wbbg               bionic  default  Deployed
2/lxd/0  started  10.0.10.214  juju-c0ce59-2-lxd-0  bionic  default  Container started
2/lxd/1  started  10.0.10.219  juju-c0ce59-2-lxd-1  bionic  default  Container started
2/lxd/2  started  10.0.10.213  juju-c0ce59-2-lxd-2  bionic  default  Container started
3        started  10.0.10.211  g43qk3               bionic  default  Deployed
3/lxd/0  started  10.0.10.212  juju-c0ce59-3-lxd-0  bionic  default  Container started

seffyroff · 30 December 2018 19:06

Taking all other variables out of the equation, I deployed kubernetes-core from charm bundle directly to a MAAS controller, using default placement and configs, and Heapster still segfaults upon deployment. I imagine something has changed in the stack somewhere upstream to mess this around, but I welcome being proven wrong on that. Reading upstream docs it appears that Heapster itself is deprecated and no longer supported going forwards?

rick_h · 7 January 2019 17:11

Hmm, wonder if @tvansteenburgh or @kos.tsakalozos have any insight into this.

tvansteenburgh · 16 January 2019 14:23

Hi @seffyroff. You’re right, Heapster is not supported by upstream any more, having been replaced by metrics-server (which we do ship in CDK). But the Kubernetes default dashboard still relies on Heapster. The dashboard will switch to using metrics-server eventually (hopefully soon), but until then, if you want the dashboard, you need Heapster. All that to say, if you don’t need/want the dashboard, you can juju config kubernetes-master enable-dashboard-addons=false and that will get rid of the dashboard and heapster pods. If you want to keep the dashboard you could try changing the heapster pod to use the older 1.5.4 container image. From looking at the source it doesn’t appear to have the bug you’re hitting.

seffyroff · 16 January 2019 21:15

Thanks for the clarification, I appreciate your giving up time to reply.