One thing that we need to work on as a product is figuring out how to make it easy for users to fix problems they encounter. At the moment, something breaks and they immediately think “ffs juju”. Or perhaps “juju 🤦”, as used in extract below.
Currently, Juju just throws its hands up and says something like “hook failed” and leaves the user with 0 options. We kill any sense of control and make users feel frustrated and stuck.
It’s a multi-faceted problem. Users think that Juju is misbehaving, but from Juju’s point of view - charms are misbehaving. But charm authors usually depending on upstream layers. So charms feel like other charms are misbehaving. And so it’s way too difficult for users to identify the actual issue and report a bug to the right place, let alone fix problems they encounter in production.
Here’s a snippet from one of Canonical’s internal IRC channels. I think it’s representative.
[11:33:47]<tartley> my first error from 'juju status' in staging, pre-deploy:
[11:33:50]<tartley> snapdevicegw-r050f610/1 error idle 81 10.50.78.177 hook failed: "store-services-relation-changed"
[11:33:58]<tartley> No idea what I should do about this. Any clues?
[11:34:19]<roadmr> hm
[11:35:00]<tartley> doesn't go away when I 'juju status' again. :-)
[11:35:29]<roadmr> tartley: it has to do with verterok's change for the port thingy
[11:35:30]<roadmr> memcache_port = memcache.get_remote_all('port')[0]
[11:35:38]<roadmr> IndexError: list index out of range
[11:35:56]<verterok> tartley, roadmr back
[11:35:58]<verterok> looking
[11:36:25]<roadmr> thanks, I was about to go back to the MP to figure this out
[11:38:10]<verterok> the weird part, only one unit failed
[11:38:14]<verterok> thanks juju
[11:38:42]<verterok> tartley: taking a closer look, will free the env for you in a bit
[11:40:03]<roadmr> juju 🤦
[11:40:22]<tartley> verterok, okdokes, thanks for the assist, no hurry on my part.
[11:40:34]<verterok> will just destroy it and land some defensive code...
[11:40:58]<tartley> in fact, I'll step away a minute then, get kiddo dinner started, back soon...
[11:41:12]* tartley is now known as tartley|BRB
[11:41:20]<timclicks> roadmr: which charm/layer is this? is it a public memcache layer?
[11:42:47]<roadmr> timclicks: it's our (snapdevicegw) charm trying to get the port for those memcache units
[11:44:37]<verterok> timclicks: hi
[11:44:50]<verterok> timclicks: is using memcached layer
[11:44:59]<verterok> but is a pinned version
[11:45:01]* verterok checks
[11:45:24]<verterok> timclicks: the weird part is that 1/2 units got the port just fine
[11:45:47]<timclicks> :/
[11:46:01]<verterok> the charm is using memcache.get_remote_all('port')
[11:46:15]<verterok> as memcache_hosts() only return the IPs and not the ports
[11:46:56]<verterok> we are using memcache interface 5ccddd9
[11:49:05]<timclicks> I wonder if the call could (optionally?) block until a port is available
[11:49:40]<verterok> actually b3245f167e9cfbe10de1f88553420347c588eec7
[11:49:48]<timclicks> if one of the units gets the right info, then it's likely the 2nd one will eventually right itself
[11:49:58]<timclicks> once the data propagates through the system
[11:50:04]<verterok> it could...but it's relying in the available state
[11:50:16]<verterok> :)
[11:50:25]<timclicks> going back to tartley|BRB's question, perhaps waiting for a minute, then "juju resolve --all"
[11:51:00]<verterok> tried that, still failed
[11:51:04]<timclicks> sign
[11:51:09]<timclicks> *sigh
[11:51:36]<verterok> the code is: https://git.launchpad.net/snapdevicegw/tree/charm/reactive/snapdevicegw.py#n99
[11:52:06]<timclicks> this is one of those cases where the juju team abstains responsibility because it's a charming bug, but the charm writers don't know what to do because juju makes state management too hard
[11:52:18]<verterok> tartley|BRB: staging is clear now
[11:52:49]<timclicks> meanwhile users just see a borked system
[11:54:29]<timclicks> this definitely feels like a bug in the memcache layer. If memcache.available is True, then there should be a remote port to access
[11:54:55]<verterok> indeed
[11:55:08]<verterok> the interface is buggy
[11:55:10]<timclicks> that's outside of my area of expertise
[11:55:28]<verterok> https://github.com/omnivector-solutions/interface-memcache
[11:55:36]<verterok> I can probably fix it
[11:55:55]<verterok> instead of workaround it in our charm
[11:56:00]<timclicks> james beedy (bdx) on freenode, is a very active member of the community
[11:56:09]<timclicks> and he would accept patches very quickly
[11:56:19]<verterok> will send him a PR
[11:56:32]<verterok> thx for the pointer
[11:57:13]<timclicks> hopefully we can reduce the level of juju-facepalm in the future!
[11:57:20]<timclicks> (please don't make that an actual juju plugin)
[11:57:56]<verterok> roadmr: 2 ^
[11:58:01]<verterok> s/2//
[11:58:13]<verterok> your next plugin?
[11:58:47]* verterok EODs
[11:58:49]<roadmr> I don't recall if emoju has a 🤦 command, checking
[11:59:16]<roadmr> no, no facepalm, i'll add it :)