low-level hook tool and pebble commands inspection with `jhack replay`

Recently I thought it would be useful to see what hook tools and pebble calls a charm is generating throughout its lifecycle as it handles the events juju sends to it.

I was about to start hacking around a new little jhack, but it turns out I’ve had this idea already, and not even that long ago. That is, a jhack tool for the purpose exists already (although it does a little bit more than just recording the facts).

jhack replay

jhack replay is a little-known tool, and for good reason, as it’s :bomb: incredibly destructive :bomb: and should probably not ever be used in anything remotely close to production. But for your local dev/staging/testing setup, it’s :star2: fantastic :star2:.

How to use it

First, you need a model with some applications

Next, you need to run

jhack replay install myapp/0

For example:

image

Note the ominous output, which gives a good overview of what’s going on. Jhack is:

  • pushing to the unit a python module
  • rewriting the charm source (charm.py) using obscure ast trickery to insert a call to that module
  • rewriting several ops modules (pebble.py, model.py) also with ast to inject more calls to that module every time the charm interacts with pebble and hook tools

What does that module do? In a nutshell, it memoizes the inputs and outputs of the hook tool/pebble call. So if you want to see what the charm called, and what the response was, you can open up a file and read it.

From this point onwards, every event the charm receives will generate a new record, and every hook tool call and every pebble interaction (exec, push, pull…) will be attached to it.

How to view what’s been recorded

Type:

jhack replay list myapp/0

to view a full list of all events that have been recorded. For example:

image

Note the first column of the output: those IDs are assigned by the recorder itself, and we need to use them when interacting with it further.

Type:

jhack replay dump myapp/0 0

To view the captured event with ID 0. This will output a big json data structure that contains:

  • an event data structure containing the full environment the charm was run with, including the hook name.
  • a context data structure, containing a bunch of memos.
    • each memo contains, among other things that are much more ‘meta’ than this post allows, a list of calls.
      • each call represents a single hook tool call or pebble interaction that the charm has sought while handling that event.

Reading a call

An example of a call recorded on prometheus/0 during an ingress-relation-changed event is:

"_ModelBackend.is_leader": {  
  "calls": [                  
    [                         
      "[[], {}]",             
      "true"                  
    ],                        
    [                         
      "[[], {}]",             
      "true"                  
    ]                         
  ],                          

This means that the is-leader hook tool has been called twice while handling that event, both times without arguments, and both times it has returned "true".

  "_ModelBackend.config_get": {                                                                                
    "calls": {                                                                                                 
      "[[], {}]": "{\"evaluation_interval\": \"1m\", \"log_level\": \"info\", \"maximum_retention_size\": \"80%
\", \"metrics_retention_time\": \"15d\", \"metrics_wal_compression\": false, \"web_external_url\": \"\"}"      
    },                                                                                                         

This means that the config-get hook tool has been called only once, and in that occasion it returned that big json-encoded payload you see there.

Other hook tools and pebble method calls follow the same pattern, and it is usually quite intuitive how to map the name of the method being called on the operator framework to the name of the hook tool.

3 Likes

Just for curiosity I modified the dispatch script

JUJU_DISPATCH_PATH="${JUJU_DISPATCH_PATH:-$0}" PYTHONPATH=lib:venv \
-  exec ./src/charm.py
+  exec strace -f --trace execve ./src/charm.py 2>&1 | tee /tmp/execve.log.$$

and… I was quite surprised by the volume of output.

Adding an alertmanager:alerting relation to prometheus generated 6775 lines totaling 563328 bytes of output. Here’s a sample:

[pid 36060] execve("/var/lib/juju/tools/unit-prom-0/relation-ids", ["/var/lib/juju/tools/unit-prom-0/"..., "metrics-endpoint", "--format=json"], 0x20f08b0 /* 43 vars */) = 0
[pid 36065] execve("/var/lib/juju/tools/unit-prom-0/relation-list", ["/var/lib/juju/tools/unit-prom-0/"..., "-r", "3", "--format=json"], 0x20f08b0 /* 43 vars */) = 0
[pid 36070] execve("/var/lib/juju/tools/unit-prom-0/relation-get", ["/var/lib/juju/tools/unit-prom-0/"..., "-r", "3", "-", "am", "--app", "--format=json"], 0x20f08b0 /* 43 vars */) = 0
[pid 36075] execve("/var/lib/juju/tools/unit-prom-0/relation-get", ["/var/lib/juju/tools/unit-prom-0/"..., "-r", "3", "-", "am/0", "--format=json"], 0x20f08b0 /* 43 vars */) = 0
[pid 36080] execve("/var/lib/juju/agents/unit-prom-0/charm/cos-tool-amd64", ["/var/lib/juju/agents/unit-prom-0"..., "validate-config", "/tmp/tmporim3b7x"], 0x20f08b0 /* 43 vars */) = 0
[pid 36087] execve("/var/lib/juju/tools/unit-prom-0/juju-log", ["/var/lib/juju/tools/unit-prom-0/"..., "--log-level", "INFO", "--", "Pushed new configuration"], 0x20f08b0 /* 43 vars */) = 0
1 Like

Is this something that we’ll be able to do away with once ops is set up for sending tracing to Tempo?

This is partly due to the nature of the Juju API that ops uses being all environment variables and hook tools, right? If it was using an HTTP API you’d see a large volume of HTTP requests, etc.

I assume it’s also because ops very rarely caches and relies on the unit agent to do that (except, weirdly, for secret content). One day when I find some time I would like to compare the performance of caching at least some of these calls instead, to see whether it’s actually significant or not.

In general, I do think that there are lots of places that performance can be improved - I had hope that after the Madrid sprint there was going to be a centralised collection of profiling of charms running on production servers, but that doesn’t seem to have come to be.

Is this something that we’ll be able to do away with once ops is set up for sending tracing to Tempo?

that depends entirely on what you folks choose to add to the trace! but in principle, that could very well be

1 Like