Know What References Cost

Part of https://discourse.jujucharms.com/t/read-before-contributing/47, an opinionated guide by William Reade

Know What References Cost

…and remember that maps and slices are references too. This is really
an extension of the attitude that informs no-globals: every reference
that leaves your control, whether as a parameter or a result, should be
considered suspect – anyone can change it, from any goroutine, from
arbitrarily distant code. So, especially for raw/pure data: the cost of
copying structs is utterly negligible anyway, and is vastly cheaper
than the dev time wasted tracking down -race reports (if you’re lucky)
or weird unreproducible runtime corruption (if you’re not).

Evidently, you often do need to pass references around, and probably
most of your methods will still take pointer receivers: don’t twist your
code to avoid pointers. But be acutely aware of the costs of radically
enlarging the scope that can affect your data. If you’re dealing with
dumb data – maps, slices, plain structs – strongly prefer to copy
them rather than allow any other component to share access to them.

Errors Are Important

Really, they are. You’ll see a lot of variations on a theme of:

if err != nil {
    return errors.Trace(err)
} 

…but those variations are important, and often quite interesting. For
example, that form implies that it’s seeing a “can’t-happen” error –
one completely outside the code’s capacity to handle. It’s so common
precisely because most code is not designed to handle every possible
error (and it shouldn’t be!). But what it does need is to be mindful
of everything that could go wrong; and those error stanzas punctuate the
flow of code by delineating the points at which things could go wrong,
and thus leaving you in the best possible position to design robust,
resumable algorithms.

And, of course, it’s in the variation that you see actual error
handling. One particular error in this context implies inconsistent
state: push that up a level for possible retry. Another means that we’ve
already completed the task, continue to skip ahead to the next entry.
Etc etc. Yes: errors are for flow control. Everyone says not to use
exceptions for flow control, conveniently forgetting that that’s exactly
what they do; the trouble is only that it’s such an opaque mechanism to
casual inspection that sane use of exceptions requires that you
tightly proscribe how they’re used (just like sane use of panic/recover,
it would seem…).

By putting the error-handling front and centre, Golang puts you in a
position where you both can and must decide what to do based on your
errors; the error stanza is an intentional evocation of “I cannot
handle this condition”, where a mere failure to try: may or may not be
deliberate.

And yeah, it’s repetitive, nobody likes that: but a lot of what’s great
about Golang comes from its steadfast refusal to add magic just to
reduce boilerplate.

Nobody Understands Unknown Errors

When you get an unknown error, that signals one thing and one thing
only: that you do not know what happened. I’ll say it again, for
emphasis, because understanding this is so desperately critical to
writing code that behaves properly in challenging circumstances:

An unknown error means you DO NOT KNOW what happened.

In particular: an operation that returns an unknown error has NOT
(necessarily) FAILED. Seriously! This is important! When you get an
error you don’t recognise, all you know is that you do not know what’s
going on
– to proceed assuming either success or failure will
sometimes be wrong. (Yes, usually it’s harmless to assume failure –
that is certainly the common case. But the weirdest bugs live in the
shadows of the assumptions you don’t even know you’re making, and this
is a really common mistake.)

The only safe thing to do with an unknown error is to stop what
you’re doing
and report it up the chain. If you’re familiar with
languages that use exceptions, you should already be comfortable with
this approach; you know that, for example, except: pass is a Bad Idea,
and you know that as soon as an error happens you’ll stop what you’re
doing because the exception machinery will enforce that.

The thing further up the chain then gets to decide whether to retry or
to report it further; this is situational, but you generally shouldn’t
have to worry about it while you’re writing the code that might be
retried. (If you follow the juju-specific advice below then your workers
will always be retried anyway. You might at some point decide you need
finer-grained retry logic; this might be OK, but be sure you need it
before you dive in, it’s hard to write and hard to test.)

Nobody Understands Transient Errors

It is tediously common for developers, faced with a reliability problem,
to start off by thinking “well, we need to detect transient errors”.

If you do that, you’ve already screwed up: you’re slicing the set of all
possible errors along the wrong line, and implicitly placing the
infinitude of unknown errors in the set of “permanent” errors; you thus
doom yourself to a literally Sisyphean task of repeatedly discovering
that there’s one more error characteristic that you should treat as a
signal of transience.

What you actually have to do is assume that all errors are transient
until/unless you know otherwise: that is, your job is to detect and
handle or report the permanent errors, the ones that stand no chance
of being addressed without out-of-band intervention.

There will probably still be cases you miss, so it doesn’t guarantee
that your error handling will stand unchanged forever; but what it
does mean is that your component can be fairly resilient in the face
of reality, and might not develop a reputation for collapsing in a
useless heap every time it encounters a new situation.

As noted above, juju has mechanisms for taking these issues off your
plate: see discussion of dependency.Engine in particular.

Trace Errors With Abandon

Tracebacks are a crutch to work around unsophisticated error design.
Sadly nobody really seems to know how to design errors so nicely as to
work around the need for something to tell you what code generated the
error; so, pretty much always, return the errors.Trace of an error.

(There are situations when a Trace is not quite ideal: consider
worker.Worker.Wait, which is reporting an error from another
context rather than returning an error from a failed operation. But
the annoyance of extra traces pales into insignificance beside that of
missing traces, so the two-word advice is “always Trace”.)

Annotate Errors With Caution

Imagine you’re a user, and you try to do something, and you see
something like this.

ERROR: cannot add unit riak/3: cannot assign "riak/3" to machine "7": machine 7 no longer alive: not found: machine 7: machine not found.

Horrible, innit? Inconsistent, stuttery, verbose… alarmingly familiar.
We have a couple of ways to avoid this; one is to always test exact
error messages instead of hiding the gunk away behind .*s, but that
only goes so far, because our tests are less and less full-stacky these
days and many of the contributing components are not participating in
the unit tests. It can save individual components from the shame of
contributing, at least, and that counts for something.

The other option, which can feel terribly unnatural, but seems to have
good results in practice, is to staunchly resist the temptation to
include any context that is known to your client. So, say you’re
implementing AssignUnitToMachine(*Unit, *Machine), it is not your
job
to tell the client that you “cannot assign” anything: the client
already knows what you’re trying to do. For this to work, though, you
have to recognise the precise seams in your application where you do
need to insert context, and this is essentially the same “good error
design” problem that humans don’t really seem to have cracked yet.

It also suffers badly from needing the whole chain of error
handling to be written in that style: i.e. the choice between Trace and
Annotate is not verifiable by local inspection alone. The important
thing is that you think about the errors you’re producing, and remain
aware that eventually some poor user will see it and have to figure out
what they need to do.

React To Specific Causes

The github.com/juju/errors package is great at what it does, but it
includes a few pre-made error types (NotFound, NotValid, AlreadyExists,
etc) that are potential traps. They’re all fine to use, but you need
to be aware that by choosing such a generic error you’re effectively
saying “nobody but a user can usefully react to this”.

Why’s that? Because they’re so general, and anyone can feel reasonably
justified in creating them, heedless of the chaos they can cause if
misinterpreted. If you’re seeing a CodeNotFound out of the apiserver,
does that necessarily mean that the entity you were operating on was
not found? Or does it mean that some other associated entity was not
found? Or even, why not, some useful intermediate layer lost track of
some implementation detail and informed you via a NotFound? You don’t
know, and you can’t know, what your future collaborators will do: your
best bet, then, is to react only to the most precise errors you can.
errors.IsNotFound(err) could mean anything; but if you see an
errors.Cause(err) == foobar.ErrNotBazQuxed you can be as certain as
it’s actually possible to be that the real problem is as stated. (It’s
true that collaborators could still bamboozle you with inappropriate
errors – but an errors.NotFoundf in package foo is never going to
ring alarms the way a totallydifferentpackage.ErrSomethingOrOther
will. At least you’ve got a chance of spotting the latter.)

Note also that the errors.IsFoo funcs are subtly misnamed: they’re
really IsCausedByFoo. This is nice and helpful but some packages
declare their own IsFoo funcs, for their own errors, and don’t
necessarily check causes. Therefore, helpful as the errors package’s
default behaviour may be, don’t rely on it – it’s a habit that will
eventually steer you invisibly wrong. When handling errors, always
extract the Cause and react to that; only if no special handling is
appropriate should you fall back to returning a Trace of the original
error.