Distributed tracing

Distributed tracing is something that we’ve chatted about a few times as a team. I thought it would be worthwhile to start a discussion to record notes and ideas. I feel that Juju could have a role to play facilitating traces/spans, but am quite murky on what might be best.

The ecosystem seems to have settled on the OpenTelemetry standard.

An excerpt from the standard’s overview about what a distrubuted trace actually is:

Traces in OpenTelemetry are defined implicitly by their Spans . In particular, a Trace can be thought of as a directed acyclic graph (DAG) of Spans , where the edges between Spans are defined as parent/child relationship.

For example, the following is an example Trace made up of 6 Spans :

Causal relationships between Spans in a single Trace

       [Span A]  ←←←(the root span)
          |
   +------+------+
   |             |
[Span B]      [Span C] ←←←(Span C is a `child` of Span A)
   |             |
[Span D]      +---+-------+
             |           |
         [Span E]    [Span F]

Sometimes it’s easier to visualize Traces with a time axis as in the diagram below:

Temporal relationships between Spans in a single Trace

––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

[Span A···················································]
  [Span B··············································]
     [Span D··········································]
   [Span C········································]
        [Span E·······]        [Span F··]

There are 2 use cases here that are worth distinguishing:

  • using distributed traces within Juju–e.g. across the API server and the facades–to facilitate debugging
  • using distributed traces within a model. Charm authors could perhaps opt-in (via a hook?) to begin or annotate a span and leave Juju to do the data management

Just to clarify that I understand correctly. Traces are sort of like logs except with a beginning time and an end time right?

Yes, but much more comprehensive. Let’s say that a user clicks a button a webpage. The result is a HTTP 500 error. A distributed trace is able to collect information from everything that was triggered when serving the request… microservices, databases, etc.

1 Like

Ah, so it spans applications and services. That would be cool.