CDK-Addons and custom charm

Hi, we used a custom cdk-addons snap because we needed to change some settings for the metrics module to avoid OOMKill :
Here is what we changed :

\-        "base_metrics_server_cpu": "40m",
\-        "base_metrics_server_memory": "40Mi",
\-        "metrics_server_memory_per_node": "4",
\+        "base_metrics_server_cpu": "200m",
\+       "base_metrics_server_memory": "256Mi",
\+        "metrics_server_memory_per_node": "16",

This is what I get when listing snaps :

ubuntu@juju-806d1a-52-lxd-1:~$ snap list
Name                     Version   Rev    Tracking Publisher   Notes
cdk-addons               1.16.15   x3     - -           -
core                     16-2.49   10859  latest/stable canonicalâś“  core
core18                   20210128  1988   latest/stable canonicalâś“  base
kube-apiserver           1.16.15   1789   1.16/stable canonicalâś“  in-cohort
kube-controller-manager  1.16.15   1685   1.16/stable canonicalâś“  in-cohort
kube-proxy               1.16.15   1673   1.16/stable canonicalâś“  classic
kube-scheduler           1.16.15   1639   1.16/stable canonicalâś“  in-cohort
kubectl                  1.16.15   1639   1.16/stable canonicalâś“  classic,in-cohort

When trying to upgrade the kubernetes-master charm to 808, we get :

File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-kubernetes-master-15/charm/reactive/kubernetes_master.py", line 434, in join_or_update_cohorts
    snap.join_cohort_snapshot(snapname, cohort_key)
  File "lib/charms/layer/snap.py", line 425, in join_cohort_snapshot
    '--cohort', cohort_key])
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['snap', 'refresh', 'cdk-addons', '--cohort', 'MSBlZUVJQ25EaUIzdGo5NHBLUU0xQnd0RWN2VmdIZTk1biAxNjEzNDU0NzQ4IDhmYmI0NzJmMmZhZjBkNmIyYWY2MzM4Yjk3OWJiZWY1NzRhMmJlODliMjg3ZWI2NWMyODNjYTY4ODdkMDk5Y2Y=']' returned non-zero exit status 1.

unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm Traceback (most recent call last):
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/charm/hooks/upgrade-charm", line 22, in <module>
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     main()
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm bus.dispatch(restricted=restricted_mode)
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm _invoke(other_handlers)
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     handler.invoke()
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     self._action(*args)
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/var/lib/juju/agents/unit-kubernetes-master-15/charm/reactive/kubernetes_master.py", line 434, in join_or_update_cohorts
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm snap.join_cohort_snapshot(snapname, cohort_key)
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "lib/charms/layer/snap.py", line 425, in join_cohort_snapshot
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     '--cohort', cohort_key])
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     **kwargs).stdout
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm   File "/usr/lib/python3.6/subprocess.py", line 438, in run
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm     output=stdout, stderr=stderr)
unit-kubernetes-master-15: 15:36:09 WARNING unit.kubernetes-master/15.upgrade-charm subprocess.CalledProcessError: Command '['snap', 'refresh', 'cdk-addons', '--cohort', 'MSBlZUVJQ25EaUIzdGo5NHBLUU0xQnd0RWN2VmdIZTk1biAxNjEzNDU0NzQ4IDhmYmI0NzJmMmZhZjBkNmIyYWY2MzM4Yjk3OWJiZWY1NzRhMmJlODliMjg3ZWI2NWMyODNjYTY4ODdkMDk5Y2Y=']' returned non-zero exit status 1.
unit-kubernetes-master-15: 15:36:09 ERROR juju.worker.uniter.operation hook "upgrade-charm" (via explicit, bespoke hook script) failed: exit status 1
unit-kubernetes-master-15: 15:36:09 DEBUG juju.machinelock machine lock released for kubernetes-master/15 uniter (run upgrade-charm hook)
unit-kubernetes-master-15: 15:36:09 DEBUG juju.worker.uniter.operation lock released for kubernetes-master/15
unit-kubernetes-master-15: 15:36:09 DEBUG juju.worker.uniter.operation running operation run commands for kubernetes-master/15
unit-kubernetes-master-15: 15:36:09 DEBUG juju.machinelock acquire machine lock for kubernetes-master/15 uniter (run commands)
unit-kubernetes-master-15: 15:36:09 DEBUG juju.machinelock machine lock acquired for kubernetes-master/15 uniter (run commands)
unit-kubernetes-master-15: 15:36:09 DEBUG juju.work

Running the command manually gives this :

root@juju-806d1a-52-lxd-1:~# snap refresh cdk-addons --cohort MSBlZUVJQ25EaUIzdGo5NHBLUU0xQnd0RWN2VmdIZTk1biAxNjEzNDU0NzQ4IDhmYmI0NzJmMmZhZjBkNmIyYWY2MzM4Yjk3OWJiZWY1NzRhMmJlODliMjg3ZWI2NWMyODNjYTY4ODdkMDk5Y2Y=
error: local snap "cdk-addons" is unknown to the store, use --amend to proceed anyway 

So, what’s the solution here ?
Is there a way to modify those settings while keeping the original charm ?
Can we somehow bypass this one, since we manage it internally ?

Patrick

(edited to turn copy blocks into code blocks)

It looks like we can’t use a custom snap anymore because of this. Anyone here who can help ?

@Canonical, maybe you can contact me for support options ?

@patrickd75
The Kubernetes charms added the use of cohorts to ensure point releases were consistent across the cluster. This was added in the 1.17 release which you’re upgrading to(Release notes | Ubuntu).

The team is actually working to remove addons and move them to Operators, which will provide additional ability to configure and manage during day 2 operations. Until there is an operator for managing metrics server, I recommend you disable the addon installed metrics server and install the manifest with your customization in it.

I’m concerned that any other approach to this is going to leave you in a situation where you could again get failures on upgrade. Forking the snap or charms is going to leave you outside of normal tests and likely end up with broken edge cases.

@chris.sanders

Ok, so how do you recommend we switch back to the “official” snap resources for cdk-addon, to be able to proceed with the upgrade ?

We ran this to switch to our own snap : juju attach-resource kubernetes-master cdk-addons=cdk-addons_1.16.15_amd64.snap

And now we need to go back to mainstream snap, without causing issues in the production cluster.

Thanks for the help !

@patrickd75 first, you should definitely test such a procedure in a test cluster before doing it in production as we’re outside the bounds of normalcy. Also, if you have support though Canonical using your support portal is the best way to get help with a situation like this.

I’m happy to make some recommendations with the above caveat this isn’t a support portal it’s my personal thoughts.

I presume you’ve completed the upgrade, have you upgraded to 1.17 on all of the units?

I think the safest thing to try would be to ssh onto the master nodes, uninstall the locally installed snap, and snap install from the charmstore (do this on all masters). Then you can clear the error and see if the snap can refresh and receive a cohort as expected.

It’s possible you could downgrade the master charm, remove the resource, and then upgrade but that’s risky. Downgrades aren’t intended and while Juju will let you do it I absolutely wouldn’t do that w/o testing it first with your configuration to see how it handles (or doesn’t handle) being downgraded between those versions.

While I know you edited the snap, I don’t know that it would have been any differenet if someone had uploaded the official snap. Would you mind opening a bug with this information at Bugs : Kubernetes Master Charm

I’d like to review this and see if it’s an upgrade bug. This is outside of our support window, but we can at least take a look at it and see if there’s anything we can do with it. Again, not as an official support effort, but if there’s a bug in the charm that we can fix I’d like to.

Thanks for the help, @chris.sanders

So, we built a test cluster and put it in exactly the same state as the prod cluster.
We then installed the “new” snap, and “juju resolve”. But juju then proceeds to reinstall the custom snap, and then tries to “snap refresh --cohort” and fails the same way. We then tried to catch it in a race condition by running the “snap refresh --cohort --amend” (with --amend) as fast as possible. It worked, but we need to do this on every master. And there seems to be other events that will re-apply the old snap (like a leader change in juju for the master).

So, the problem remains. We ran " juju attach-resource kubernetes-master cdk-addons=cdk-addons_1.16.15_amd64.snap" initially, and there is no way to run this command and point it back to canonical store. It looks like it only accepts local files. So once you run this command, we didn’t find a way to tell it to “attach-resource kubernetes-master cdk-addons=” or something similar.

As you said, using the official snap manually would probably create the same problem. I will file the bug report.

For now, the only way we can think of doing it would be to go in MongoDB (and other places where this information is stored), and replace it. Then run a manual snap refresh --amend or something like that.

Thanks again the the help

Solution :

touch zero.snap
juju attach kubernetes-master cdk-addons=zero.snap

A zero-byte snap resource is what the charm normally deploys with, and when detected, it will install from the snap store instead.

2 Likes