This not only takes a long time (10min before timeout)" and also times out for some units and I’m not able to determine easily the success of these commands.
I’m curious about your thoughts on how juju will be able to handle a larger environment of a few thousands servers and perhaps some 10000 units.
I’m not sure yet exactly how juju executes this, in serial, or, in parallell and it would be good to get some idea as how the progress of running on multiple targets progress. The current situation gives no indication of how many of these commands has completed, how many are executing, how many are waiting etc.
It certainly looks like there’s lots of opportunity to improve. Each agent should be able to execute that command completely in parallel.
How long does something like time juju run --application hpc 'hostname' take? Is it possible that the filesystem underneath the units is the bottleneck here? I’ve seen problems with distributed file systems before when their metadata servers become overloaded. I doubt that’s a problem here though, but it may be worthwhile to eliminate.
Something to keep in mind is that only one Juju agent daemon (whether an application jujud-unit-myapp-X or a machine jujud-machine-X) on any given juju machine/container can be running a hook at any one time. Juju run commands are considered as a hook and will await the machine lock before executing, even if it’s such a simple command as this.
One thing I can recommend to determine if this is the issue is connecting to the units that are not returning and running “juju_machine_lock” from the command line to determine if there’s a long-running hook that is holding the machine lock hostage and then investigating what that unit/hook holding the lock is stuck on.
If you know the command should run in ~30 seconds or less, you could definitely shorten your juju run timeout to something like ‘–timeout 60s’ in your juju run arguments. This will let those units that are not held captive by machine locks or dead machines to timeout sooner than the 10 minute auto-timeout.
The point about knowing which have succeeded or failed will just require an after-run audit command to determine if the actions were successful by taking inventory of the results.
The juju unit logs will tell you if your juju run command was ever attempted as you try to determine if this is juju agents not being responsive or if it is indeed an i/o blocking issue.
Also, as part of juju run being treated as a hook, it gets piled on the queue for the agent to execute, which, if you run it early enough in the deployment of your model, could get stuck waiting for the start hook, the install hook, the config-changed hook, and any number of relation-joined/relation-changed hooks. I’d suggest if you need to run something during deployment that it either get coded into the charm requiring it, or wait for the model to settle, run the command, then deploy any other parts of the bundle that are dependent upon that juju run after it settles.
I would however be much helped by having some way to determine on a “per-unit” status while the command executes. This would leave me alot less in a dark state which only can be assessed by a debug session which will be very difficult in models with 1000+ nodes.
I have experience from “Rocks Clusters” which executes commands in parallell and outputs results which will easily let me know what nodes has completed the command in time and which has left to complete or somehow fails. This lets me have a workflow that makes me know immediately which nodes is in a state where a “re-run” or “fix” needs to be applied to the nodes which has not completed the execution.
This is absolutely a killer feature if it was reliable and fast on many, many unites/nodes.
I’m back at this again. Just ending up in a “blocked” state for my “juju run --application foobar 'sudo systemctl restart autofs”
So, the MAJOR issue for me, is that I can’t determine now WHICH nodes, may, or may not have done what I’ve asked for. Ultimately leaving me in a very dangerous situation not being able to re-run some commands without risking my entire environment.
I don’t have a good way at the moment to manage large sets of machines which is worrysome…
The “juju-show-action-status” is not applicable since its not an action…
Just so you’re aware, in juju 2.7 and later, show-action-status does also reflect juju run as logged “actions”.
Here’s a simple example. What you can’t tell is which command was run, but you could likely use the action status timing to get a good idea of which command it was that you ran during that timeframe if you’re keeping good track of the processes run and timestamps for each.
Thanx for looking in to this @afreiberger I appreciate it.
But this process of tracking and inspecting execution of a juju run is at its present form not useful in a large model consisting of hundreds of units. It’s even dangerous since the workflow would leave an admin such as myself in a position where evaluating the outcome of a run command would be difficult and/or so slow that it would not be possible.
My ‘juju run - - application compute hostname’ takes several minutes to complete where the equivalent using Rocks Cluster takes seconds on several hundreds of nodes, possibly from the parallel execution model.
Parsing of the output of the juju run commands are also non trivial which further complicates matters.
I know I can get a json format out of the commands (after the 5 or 10 minutes of execution) but piping it into jq also prevents me from easy standard tools such as sed, swk, grep etc for easy things. Leaving me to complicated querying with the jq language which is in itself a challenge. Each command becomes a hurdle and does not at all help.
I would like to propose that the juju run would:
Execute in parallel
Deliver results as soon as a unit has completed the execution, explicitly not waiting for nodes that are stuck or timing out.
Possible option to return its raw output to be able to use it with awk, sed, etc.
Return quickly for nodes that won’t be able to execute the command immediately or within seconds to allow for quick assessment if command results from thousands of units.
This is a killer feature if implemented with the ambition to allow admins to use this to issue commands across many many units without the problems I highlight in my above posts.
@jamesbeedy can probably fill in here as he has seen first hand the implications as do @hallback.
We may have erred in tying ‘juju run’ into the hook serialisation pattern. A command being run this way is no more likely to interfere with a hook execution than a cronjob going off, or an admin using SSH to get to the machine and run the command.
Perhaps juju run should ‘just do it’? Then it could be as fast as any sort of parallel SSH or rocks-cluster method. I would think carefully about the architectural implications, maybe go back to the source fo hook serialisation in the first place, but my gut feel is that juju run is orthogonal to hook executions and so doesn’t need to be serialized.
That tool allowed me to work on many hundred of nodes to determine consistency across machines, execute commands on many nodes and be able to quickly find nodes with “deltas” or that was in some way or the other different than expected. I really miss that tool. It was fantastic and I thought for a long time that “juju run” would be the tool that could replace it good enough. It turns out that it doesn’t.
Now I have to perform massive amount of looping with ssh etc. This becomes very cumbersome if not impossible when every login takes about 2 seconds x 200 nodes = 400 seconds per command. Which is not practically useful.
“juju run” is too dangerous to even execute because of the issues I raised above.
So, perhaps a new command could be introduced
juju run-host
which would rip off the “rocks run host” command model There is no shame in ctrl-c ctrl-v others great work! Lots of attribution to those guys that worked up that thing.
Just look at the beautiful “delay” parameter. When executing on a large number of hosts - if you execute a command in parallel - you basically can easily “Ddos” your system - for example if you issue a “id ubuntu” and have an LDAP server that suddenly gets slammed with 2000 sub-second instant ldap queries coming from 200 servers. The delay parameter gives the person issuing the commands some help in avoiding massive problems in other parts of the system due to this very efficient tool. Beautiful.
One thing that’s been on the todo list is to enhance the juju run CLI UX to adopt the new Actions V2 workflow.
This provides amongst other things:
running the entire set of scripts in the background, grouped as an operation
a non yaml output option which is just stdout/stderr from the script
results are recorded as scripts finish
better query capability for completed and pending operations
The backend implementation is pretty much done (as it’s already used for Actions V2) - the main effort is some facade work to expose the relevant methods and CLI changes to use the new API calls. Maybe we can get something done soon once other 2.9 work gets finished.
Please take into consideration the stuff from Rocks Clusters. Those developers did get many things right, from long time experience with the complexities of running on large scale systems, which is reflected by the command line options for “rocks run host”.
For examle the ‘collate’ option, to get a header on what host returned what result for easy parsing and comparison of results.
Extremely useful when I need to check if a setting or command returns consistent results over the whole system. etc.
Just wanted to share that I totally agree with @erik-lonroth, I also really miss something similar to the rocks run command. In OpenStack and HPC deploys the models tend to be quite big, but still running commands targeting many or all machines or units based on some criteria is part of my daily work.
Many times this doesn’t really have anything to do with the charms deployed at all, instead it is just “unrelated” run once operations stuff to be carried out or checked on a subset of the model, like:
Install package xxx on all units of a specific application
Run some command on all non-LXD machines to verify InfiniBand link status
I have usually ended up creating a set of small bash scripts that loop through the output of juju status and then issues commands with juju ssh. Works, is not elegant, is also slow on 100+ units but tracking progress and result is easy.