Juju run takes a long time

sssler-scania · 14 July 2020 20:44

I have a small model of about 150 nodes. In an HPC context, this is small.

I need to perform some admin tasks for my test, so I thought to use “juju run” for this.

So I did:

$ time juju run --application hpc --timeout=10m0s 'sudo mkdir -p /scratch; sudo chmod 1777 /scratch'
- Message: action terminated
  UnitId: hpc/49
- Stdout: ""
  UnitId: hpc/98
- Stdout: ""
  UnitId: hpc/99
- Message: action terminated
  UnitId: hpc/54
- Message: action terminated
  UnitId: hpc/55
- Message: action terminated
  UnitId: hpc/86
- Message: action terminated
  UnitId: hpc/59
- Message: action terminated
  UnitId: hpc/89
- Message: action terminated
  UnitId: hpc/74

ERROR timed out waiting for result from: unit hpc/11

real	10m14.241s
user	0m0.959s
sys	0m0.347s

This not only takes a long time (10min before timeout)" and also times out for some units and I’m not able to determine easily the success of these commands.

I’m curious about your thoughts on how juju will be able to handle a larger environment of a few thousands servers and perhaps some 10000 units.

I’m not sure yet exactly how juju executes this, in serial, or, in parallell and it would be good to get some idea as how the progress of running on multiple targets progress. The current situation gives no indication of how many of these commands has completed, how many are executing, how many are waiting etc.

I’m using juju 2.8.1

timClicks · 14 July 2020 22:53

It certainly looks like there’s lots of opportunity to improve. Each agent should be able to execute that command completely in parallel.

How long does something like time juju run --application hpc 'hostname' take? Is it possible that the filesystem underneath the units is the bottleneck here? I’ve seen problems with distributed file systems before when their metadata servers become overloaded. I doubt that’s a problem here though, but it may be worthwhile to eliminate.

erik-lonroth · 14 July 2020 23:14

It’s a local file system on each server so this command would return in a subsecond.

I think the command is waiting perhaps on an offline agent or similar which would be normal in a large cluster… But not sure.

afreiberger · 15 July 2020 14:27

Something to keep in mind is that only one Juju agent daemon (whether an application jujud-unit-myapp-X or a machine jujud-machine-X) on any given juju machine/container can be running a hook at any one time. Juju run commands are considered as a hook and will await the machine lock before executing, even if it’s such a simple command as this.

One thing I can recommend to determine if this is the issue is connecting to the units that are not returning and running “juju_machine_lock” from the command line to determine if there’s a long-running hook that is holding the machine lock hostage and then investigating what that unit/hook holding the lock is stuck on.

If you know the command should run in ~30 seconds or less, you could definitely shorten your juju run timeout to something like ‘–timeout 60s’ in your juju run arguments. This will let those units that are not held captive by machine locks or dead machines to timeout sooner than the 10 minute auto-timeout.

The point about knowing which have succeeded or failed will just require an after-run audit command to determine if the actions were successful by taking inventory of the results.

The juju unit logs will tell you if your juju run command was ever attempted as you try to determine if this is juju agents not being responsive or if it is indeed an i/o blocking issue.

Hope this helps,
-Drew

afreiberger · 15 July 2020 14:41

Also, as part of juju run being treated as a hook, it gets piled on the queue for the agent to execute, which, if you run it early enough in the deployment of your model, could get stuck waiting for the start hook, the install hook, the config-changed hook, and any number of relation-joined/relation-changed hooks. I’d suggest if you need to run something during deployment that it either get coded into the charm requiring it, or wait for the model to settle, run the command, then deploy any other parts of the bundle that are dependent upon that juju run after it settles.

sssler-scania · 15 July 2020 17:29

@afreiberger thanx for the advices.

I would however be much helped by having some way to determine on a “per-unit” status while the command executes. This would leave me alot less in a dark state which only can be assessed by a debug session which will be very difficult in models with 1000+ nodes.

I have experience from “Rocks Clusters” which executes commands in parallell and outputs results which will easily let me know what nodes has completed the command in time and which has left to complete or somehow fails. This lets me have a workflow that makes me know immediately which nodes is in a state where a “re-run” or “fix” needs to be applied to the nodes which has not completed the execution.

This is absolutely a killer feature if it was reliable and fast on many, many unites/nodes.

Here is how rocks does it: https://cheatography.com/brie/cheat-sheets/rocks-cluster-commands/

… and more docs on that: https://www.rocksclusters.org/assets/usersguides/roll-documentation/base/6.2/x6396.html

afreiberger · 20 July 2020 16:45

You may be interested in the juju commands “show-action-status” and “show-action-output” to be able to identify the status of each unit’s run.

erik-lonroth · 20 July 2020 22:10

I did some looping, sed and awk which feels primitive and wrong but I can’t find a better way atm.

juju status… sed, awk, grep

for i in $(cat machines.txt); do juju ssh $i hostname; done

I’ll see if I can find a better way later on. Thanx for helping out.

sssler-scania · 28 August 2020 13:40

I’m back at this again. Just ending up in a “blocked” state for my “juju run --application foobar 'sudo systemctl restart autofs”

So, the MAJOR issue for me, is that I can’t determine now WHICH nodes, may, or may not have done what I’ve asked for. Ultimately leaving me in a very dangerous situation not being able to re-run some commands without risking my entire environment.

I don’t have a good way at the moment to manage large sets of machines which is worrysome…

The “juju-show-action-status” is not applicable since its not an action…

afreiberger · 8 September 2020 19:28

Just so you’re aware, in juju 2.7 and later, show-action-status does also reflect juju run as logged “actions”.

Here’s a simple example. What you can’t tell is which command was run, but you could likely use the action status timing to get a good idea of which command it was that you ran during that timeframe if you’re keeping good track of the processes run and timestamps for each.

drew@grimoire:~$ juju switch controller
lxd:admin/default -> lxd:admin/controller
drew@grimoire:~$ juju show-action-status
{}
ERROR no actions found
drew@grimoire:~$ juju run --machine 0 hostname
juju-b8f347-0
drew@grimoire:~$ juju show-action-status
actions:
- action: juju-run
  completed at: "2020-09-08 19:22:58"
  id: f40b71a0-8118-4b50-8d28-f36bf07e0faa
  status: completed
  unit: machine-0
drew@grimoire:~$ juju show-action-output f40b71a0-8118-4b50-8d28-f36bf07e0faa
id: f40b71a0-8118-4b50-8d28-f36bf07e0faa
results:
  Stdout: |
    juju-b8f347-0
status: completed
timing:
  completed: 2020-09-08 19:22:58 +0000 UTC
  enqueued: 2020-09-08 19:22:56 +0000 UTC
  started: 2020-09-08 19:22:58 +0000 UTC

erik-lonroth · 8 September 2020 20:23

Thanx for looking in to this @afreiberger I appreciate it.

But this process of tracking and inspecting execution of a juju run is at its present form not useful in a large model consisting of hundreds of units. It’s even dangerous since the workflow would leave an admin such as myself in a position where evaluating the outcome of a run command would be difficult and/or so slow that it would not be possible.

My ‘juju run - - application compute hostname’ takes several minutes to complete where the equivalent using Rocks Cluster takes seconds on several hundreds of nodes, possibly from the parallel execution model.

Parsing of the output of the juju run commands are also non trivial which further complicates matters.

I know I can get a json format out of the commands (after the 5 or 10 minutes of execution) but piping it into jq also prevents me from easy standard tools such as sed, swk, grep etc for easy things. Leaving me to complicated querying with the jq language which is in itself a challenge. Each command becomes a hurdle and does not at all help.

I would like to propose that the juju run would:

Execute in parallel
Deliver results as soon as a unit has completed the execution, explicitly not waiting for nodes that are stuck or timing out.
Possible option to return its raw output to be able to use it with awk, sed, etc.
Return quickly for nodes that won’t be able to execute the command immediately or within seconds to allow for quick assessment if command results from thousands of units.

This is a killer feature if implemented with the ambition to allow admins to use this to issue commands across many many units without the problems I highlight in my above posts.

@jamesbeedy can probably fill in here as he has seen first hand the implications as do @hallback.

sabdfl · 9 September 2020 19:26

We may have erred in tying ‘juju run’ into the hook serialisation pattern. A command being run this way is no more likely to interfere with a hook execution than a cronjob going off, or an admin using SSH to get to the machine and run the command.

Perhaps juju run should ‘just do it’? Then it could be as fast as any sort of parallel SSH or rocks-cluster method. I would think carefully about the architectural implications, maybe go back to the source fo hook serialisation in the first place, but my gut feel is that juju run is orthogonal to hook executions and so doesn’t need to be serialized.

sssler-scania · 9 September 2020 20:02

My experience with “Rocks Clusters” was extremely attractive, expressive and efficient working with large numbers of hosts.

Have a look here for reference:

http://galileo.graphycs.cegepsherbrooke.qc.ca/roll-documentation/base/5.4.3/x6003.html

rocks run host [host…] {command} [collate= string ] [command= string ] [delay= string ] [managed= boolean ] [num-threads= string ] [stats= string ] [timeout= string ] [x11= boolean ]

That tool allowed me to work on many hundred of nodes to determine consistency across machines, execute commands on many nodes and be able to quickly find nodes with “deltas” or that was in some way or the other different than expected. I really miss that tool. It was fantastic and I thought for a long time that “juju run” would be the tool that could replace it good enough. It turns out that it doesn’t.

Now I have to perform massive amount of looping with ssh etc. This becomes very cumbersome if not impossible when every login takes about 2 seconds x 200 nodes = 400 seconds per command. Which is not practically useful.

“juju run” is too dangerous to even execute because of the issues I raised above.

So, perhaps a new command could be introduced

juju run-host

which would rip off the “rocks run host” command model There is no shame in ctrl-c ctrl-v others great work! Lots of attribution to those guys that worked up that thing.

Just look at the beautiful “delay” parameter. When executing on a large number of hosts - if you execute a command in parallel - you basically can easily “Ddos” your system - for example if you issue a “id ubuntu” and have an LDAP server that suddenly gets slammed with 2000 sub-second instant ldap queries coming from 200 servers. The delay parameter gives the person issuing the commands some help in avoiding massive problems in other parts of the system due to this very efficient tool. Beautiful.

wallyworld · 10 September 2020 01:07

One thing that’s been on the todo list is to enhance the juju run CLI UX to adopt the new Actions V2 workflow.

This provides amongst other things:

running the entire set of scripts in the background, grouped as an operation
a non yaml output option which is just stdout/stderr from the script
results are recorded as scripts finish
better query capability for completed and pending operations

The backend implementation is pretty much done (as it’s already used for Actions V2) - the main effort is some facade work to expose the relevant methods and CLI changes to use the new API calls. Maybe we can get something done soon once other 2.9 work gets finished.

erik-lonroth · 10 September 2020 04:09

Please take into consideration the stuff from Rocks Clusters. Those developers did get many things right, from long time experience with the complexities of running on large scale systems, which is reflected by the command line options for “rocks run host”.

For examle the ‘collate’ option, to get a header on what host returned what result for easy parsing and comparison of results.
Extremely useful when I need to check if a setting or command returns consistent results over the whole system. etc.

hallback · 10 September 2020 13:10

Just wanted to share that I totally agree with @erik-lonroth, I also really miss something similar to the rocks run command. In OpenStack and HPC deploys the models tend to be quite big, but still running commands targeting many or all machines or units based on some criteria is part of my daily work.

Many times this doesn’t really have anything to do with the charms deployed at all, instead it is just “unrelated” run once operations stuff to be carried out or checked on a subset of the model, like:

Install package xxx on all units of a specific application
Run some command on all non-LXD machines to verify InfiniBand link status

I have usually ended up creating a set of small bash scripts that loop through the output of juju status and then issues commands with juju ssh. Works, is not elegant, is also slow on 100+ units but tracking progress and result is easy.