Quick Benchmark: juju-log Seems to be the Cause of Slow Charm Operations!

zicklag · 7 March 2020 01:44

When I first got started with charms, I noticed that reactive charms, and the PostgreSQL charm in particular, seemed to spend a lot of time doing nothing while it logged information about the flags that the framework was setting. I wrote a post about some of my findings:

After writing a new charming framework called Lucky, though, I noticed a similar issue. Charms seemed to spend a whole lot of time doing things that should have taken no time at all. There may still be other factors to consider but I believe that I’ve found the single largest cause of the performance hit: logging.

I’ve had this suspicion for about a month and today I finally did a quick test with a very basic charm. The script looks like this:

# First step: echo 1000 times to a file
to_file_time=$(/usr/bin/time -po /dev/stdout bash -c 'for i in $(seq 1000); do echo "Item $i" >> /tmp/charm-logging1.log; done')

# Next step: echo 1000 times to juju-log
juju_log_time=$(/usr/bin/time -po /dev/stdout bash -c 'for i in $(seq 1000); do juju-log "Item $i"; done')

juju-log "============================================================================================"

juju-log "To file time ( 1000 records ): $to_file_time"
juju-log "Juju log time ( 1000 records ): $juju_log_time"

All I do is log 1000 lines to a file on the host, and compare that to logging 1000 lines to juju-log. The result:

To file time ( 1000 records ): real 0.08 user 0.03 sys 0.04
Juju log time ( 1000 records ): real 149.08 user 111.32 sys 71.33

In other words:

	Total time ( 1000 records )	Average time per log entry
`juju-log`	149 seconds	0.15 seconds
Local file	0.08 seconds	0.00008 seconds

When it comes to just logging, juju-log is 1,863 times slower than just writing to a file!

Obviously some amount of being slower is understandable, but that is extremely slow. That was the cause of PostgreSQL just sitting there spitting stuff into the log when it should have been doing something. It wasn’t the fault of the reactive framework, it was the fault of juju-log.

In my eyes, juju-log should not block the execution of the charm like it does today. I’m not sure exactly how you might design it differently, but one idea is to have a local service to handle logging. juju-log would push messages to this local service, something that would finish very quickly, and then exit immediately to avoid blocking the charm any longer than necessary. The local service would be responsible for queuing and sending the log messages to the Juju Server.

I would really like to get some feedback from Juju developers on this because it is having a large, noticeable impact on real-life charms in the wild. I had a noticeable increase in the speed of the Let’s Encrypt Proxy charm just by removing the verbose logging that I had configured to help with debugging.

With the reactive framework enabling trace logging by default, this will effect almost every major charm in the charm store in a way that directly effects user experience. One of the first things that me and my partner noticed about Juju when we started using it was that it took a long time to spin up a Kubernetes cluster. Yes, it did it and it was amazing, but it took about an hour or more. If we sped up logging it could really help these kinds of situations. I mean, it even seems to have an impact on a simple CodiMD+PostgreSQL charm deployment.

jameinel · 8 March 2020 04:54

That does seem surprising. 150ms for a juju-log request does seem quite slow.
We have talked about doing logging only locally and then having ways of pulling data from logs after the fact/integrating with other logging systems.
I’m curious if you change the default logging config (juju model-config logging-config=“unit=warning”) if that affects deploy times.

A step up would be to force an alternative ‘juju-log’ script into PATH for all charms but that is a bit harder to try.

zicklag · 8 March 2020 16:38

I’ll try that later with the CodiMD + Postgres deployment.

Also, a note about the environment I benchmarked in, I was doing the benchmarking on my local development VM with 4 CPUs and 8GB of RAM, but it was all on the same machine with the LXD coud, so there isn’t any “real” network traffic being made. I imagine that it could be even slower if it had to make a hop over the network.

I don’t know what the implementation looks like, but I would guess that the problem is that every call to Juju log has to cross over to the Juju server, then the Juju server has to write the log entry to the MongoDB database, the database has to write it to disk, and then confirm the write, all before the juju-log command exits, with what seems to be no buffering or queuing.

timClicks · 8 March 2020 16:49

Side note: if juju-log is causing delays, that would explain why reactive charms are slow. They’re very noisy.

zicklag · 9 March 2020 17:46

Exactly, and it is going to effect all charming frameworks, Operator, Lucky, or otherwise that want any info to go to the juju debug-log.

For now I suppose I’m just going to limit the logging and maybe put a feature into Lucky that allows you to disable juju-log for anything below WARNING and have it log everything to /var/log/lucky/unit_name. This is hardly a good workaround, though because it makes it much more difficult to get to the logs: instead of juju debug-log you have to juju ssh unit_name/3; tail -f /var/log/lucky/unit_name_3.log. It would also mean that the log files could be different for every different framework and each charm would have to document where to get extra logging. On top of that you can’t view the logs for more than one unit/app at one time.

The unfortunate part of this is that all of that extra pain is worth it for the performance gains that you get when the charm does work.

jameinel · 11 March 2020 13:15

So as a quick test of this, I did updated my ‘ubuntu-lite’ charm with a simple action:

logging-test:
    additionalProperties: false
    properties:
        count:
            type: integer
            default: 100

$ cat actions/logging-test
#!/usr/bin/env python3

import subprocess
import time


def count():
    output = subprocess.check_output(['action-get', 'count'])
    return int(output.strip())

def output(val):
    subprocess.check_call(['action-set', 'output=' + str(val)])


def log(msg):
    subprocess.check_call(['juju-log', str(msg)])


def main():
    c = count()
    start = time.clock_gettime(time.CLOCK_MONOTONIC)
    last = start
    for i in range(c):
        now = time.clock_gettime(time.CLOCK_MONOTONIC)
        log('%8.3fs %8.3fs count: %d' % (now - start, now-last, i))
        last = now
    now = time.clock_gettime(time.CLOCK_MONOTONIC)
    total = now - start
    output(total / c)


if __name__ == '__main__':
    main()

Then you can just do

$ juju run-action ubuntu-lite/0 --wait logging-test count=100
unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "16"
  results:
    output: "0.06549006290035322"
  status: completed

That puts juju-log around 65ms. Which is still quite a bit slower than the ‘echo >/tmp/out’ variant

unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "18"
  results:
    output: "0.000776869960827753"

It does seem worthwhile to look at more of a pipelining approach, and having a call to juju-log queue up a message to the controller, but not block waiting for the message to be sent (and thus being able to batch up messages).

jameinel · 11 March 2020 13:21

I should also note that I did try setting:

juju model-config logging-config='<root>=INFO:unit=WARNING'

And I did see that you no longer see the INFO messages in juju debug-log, the logging-test itself was still:

$ juju run-action ubuntu-lite/0 --wait logging-test
unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "24"
  results:
    output: "0.06558336023008451"

So the overhead isn’t in the ‘write it do the database’ step, but probably in all the other stages.

jameinel · 11 March 2020 14:00

Given the interest, I tried other commands to see how they responded. An interesting one is ‘config-get’, which actually should be caching the config settings in the Unit agent for the lifetime of the hook/action, so it doesn’t involve a request back to the API Server at all. (It does talk to the controller for the first call, but should be serving the information from the cache for all other calls.) With that I get:

unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "31"
  results:
    config: 'b''foo: foo\n'''
    output: "0.06460508618038148"

That hints to me that all the overhead is in the time from the ‘config-get’ command, talking to the local agent, and dealing with the response, not in talking to the API server at all. In fact, I started adding tracing in ‘jujud’ itself, which shows that it only takes 1ms or so for it to connect to the agent and exit.
I then changed the jujud main script to be:

--- a/cmd/jujud/main.go
+++ b/cmd/jujud/main.go
@@ -304,6 +313,7 @@ func Main(args []string) int {
        case jujunames.JujuIntrospect:
                code = cmd.Main(&introspect.IntrospectCommand{}, ctx, args[1:])
        default:
+                return 1
                code, err = hookToolMain(commandName, ctx, args)
        }
        if err != nil {

So that if it is about to execute a hook tool, it just exits with an error. I then compiled jujud, and just ran it in a loop with a different name (so that it runs to that bit of code).

$ time notjujud

real    0m0.066s
user    0m0.063s
sys     0m0.056s

So that 60ms is not overhead in talking to the Juju Unit agent, nor is it overhead in sending the log message to mongo. It is overhead in ‘starting up the juju-log’ process to do those things.

We should rethink having a single binary if the startup time for that binary is that slow.

jameinel · 11 March 2020 15:58

I filed Bug #1867021 “jujuc commands are slow to start” : Bugs : juju to track this.

zicklag · 11 March 2020 17:36

@jameinel Great investigation! Thanks for looking into this. Very interesting that it boils down to the start time of the binary and in that case all of the Juju tools. It is just juju-log that my charms called more than anything else ( that’s what happens when you set -x in a bash script and log the output ).

At least that seems like a rather simple and not majorly architectural change to Juju if we had to split it out to a separate command.

jameinel · 11 March 2020 18:24

So I poked and poked, and stripped until I got down to truly minimal. It turns out that we had a dependency between the ‘hookToolMain’ and ‘worker/uniter/runner/jujuc’ because we wanted the Request struct that defined how we passed the args to the command.
But that actually means we have lots of init functionality defining types/commands/etc that we never use, because it is all client side.
After finally getting a ‘jujuc’ that is minimized to my liking (8MB), I got the exec time down to 1-5ms as a no-op. Then replacing ‘juju-log’ with that minimal ‘just connect to the socket and talk to it’, the time goes down to:

I then took the stripped down jujuc, scp it to the unit machine, and put it into /var/lib/juju/tools/2.7.4.2-bionic-amd64 and replaced the symlink of juju-log to jujud to a symlink to the new jujuc.

With that change:
$ juju run-action ubuntu-lite/0 --wait logging-test count=100
unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "34"
  results:
    output: "0.005814908350585029"
  status: completed
  timing:
    completed: 2020-03-11 18:20:18 +0000 UTC
    enqueued: 2020-03-11 18:20:14 +0000 UTC
    started: 2020-03-11 18:20:17 +0000 UTC

So that is down to 5-6ms instead of 60ms. Seems worthwhile.

jameinel · 11 March 2020 18:26

IIRC, if you take the history of juju tools, the reason we merged jujuc into jujud and just called it, was because ‘jujuc’ was nearly the size of jujud, and we didn’t want to waste the upload/download bandwidth. From this analysis, it seems the reason jujuc got so big was because we had a bad import that caused us to pull in a lot of stuff we didn’t use.
We do already have a tarball for agent binaries, we can probably go back to shipping with both a jujud and a jujuc and see some pretty massive wins.

zicklag · 11 March 2020 18:28

That’s great. That’s a 10x increase in the speed of every Juju hook tool, without a massive amount of work or development. I think that is a great payoff. Good job!

I think that speed for logging should be reasonable for even rather verbose logging. ~200 lines of logs per second.

jameinel · 11 March 2020 18:29

If you want to try this at home, you could test out my branch here:
https://github.com/jameinel/juju/tree/2.7-minimal-jujuc-1867021

If you go install github.com/juju/juju/cmd/jujuc you can then put it into /var/lib/juju/tools/$VERSION/ and replace the symlinks on the target machine and see what you get.

zicklag · 11 March 2020 18:30

Sweet, I’ll try that if I get the chance.

zicklag · 11 March 2020 18:50

I cloned the repo. What do I do after that? I tried running go install github.com/juju/juju/cmd/jujuc in the repo, but that didn’t work. I’ve never built go before.

can't load package: package github.com/juju/juju/cmd/jujuc: cannot find package "github.com/juju/juju/cmd/jujuc" in any of:
        /snap/go/5365/src/github.com/juju/juju/cmd/jujuc (from $GOROOT)
        /home/vagrant/go/src/github.com/juju/juju/cmd/jujuc (from $GOPATH)

jameinel · 12 March 2020 04:45

Ah, if you haven’t build Juju before, setup isn’t too bad, but it does take a couple of steps.
I take it you installed it to ‘/home/vagrant/go/src/github.com/juju/juju’ ? or where did you clone it? (If you didn’t clone it there, that would be a good place to put it.)
go needs to find things in GOPATH according to their import name, so it likely needs to be there.

jameinel · 12 March 2020 07:17

So I did a few more steps to actually wire in the various bits. You should be able to update my branch (git pull -v) and then just make install to have it build juju normally (pulling in dependencies, etc), and then juju bootstrap will install the updated juju with jujud and jujuc and extract it on the controller, and symlink jujuc instead of jujud for the tools.

If you want to test it on an existing controller, you’d need to do juju upgrade-controller --build-agent which should do the steps (the one caveat is that --build-agent doesn’t strip the binaries that it builds, as it is meant for development testing.)

But with that:

$ juju run-action --wait ubuntu-lite/0 logging-test
unit-ubuntu-lite-0:
  UnitId: ubuntu-lite/0
  id: "3"
  results:
    output: "0.006335625179344788"

zicklag · 12 March 2020 16:54

Ah, I cloned it to ~/git/other/test_juju so that explains that.

Awesome, I’ll try it out and let you know how it goes.