Upgrade Series Feature Development

externalreality · 7 August 2018 22:43

@veebers, your thinking would be correct but the actual prevention of machine operations during the upgrade process has not yet been implemented.

freyes · 21 August 2018 11:59

Hi,

I have a few doubts about upgrade-series.

Will upgrade-series work on a charm that doesn’t implement those hooks?,
Will the juju agent take care of running “do-release-upgrade”?

What’s the behavior for a machine that has more than 1 applications running on? are *-series-upgrade hooks executed in a random order?

Thanks,

rick_h · 21 August 2018 12:04

Yes, the feature will work for a charm that does not implement the hooks. They’ll just not be executed and Juju will update agents and such and hand the user the machine to perform the manual do-release-upgrade steps as usual.

The agent will not take care of do-release-upgrade. The feature is designed for that to be handled by the operator by request of the stakeholders. do-release-upgrade can ask questions about config file changes and more and having that automated seem like it’s not smooth enough to rely on Juju to JFDI. It can be scripted with juju run --machine and so if it’s true that it can be done smoothly the operator can script that part away.

When there’s more than one unit per machine the hooks will be triggered async and waited on. In the final version fo the feature the command will be interactive and you’ll see each step progress including the status of each unit’s hook execution. Once the hooks complete successfully and Juju is ready it hands control to the user to perform any manual/scripted steps with juju run, and then the user hands control back to Juju with the upgrade-series complete command.

rick_h · 28 August 2018 12:13

Heads up that @externalreality landed a change today that cleans up the lock file after you run complete and you are able to rerun the prepare/complete process additional times. This is only if things go smoothly/correctly. Since you don’t need to currently actually change the series it allows for quick triggering of the pre/post hooks for the charms.

This should be in the devel snap in a couple of hours.

gnuoy · 25 September 2018 16:55

The reactive charms based on layer-basic are currently failing after machine reboot (when the juju agent should still be disabled) due the bug of the config-changed hook firing prior to post-series-upgrade. Is there any update on the juju fix to stop hooks firing on reboot?

rick_h · 25 September 2018 18:06

Our understanding is that this was corrected from this branch during the sprint.

https://github.com/juju/juju/commit/6aff096f2af23742a6cbf47acbeebe52f85f805f

It’s part of the latest dev snap. Can you verify the snap used and that a fresh bootstrap was used and you’re still hitting the issue?

Thanks

jameinel · 26 September 2018 12:52

I’m not sure what you mean by “any interactions”.
It is intended that the machine goes into a “manual” mode at that point, and that unit-agents on the machine stop engaging. (they don’t notice config/relation changes, etc).

However, since the intent is that the Operator is then responsible for doing any actual upgrade steps, things like “juju run --machine X” should still work, as should ‘juju ssh X’ etc.

thedac · 27 September 2018 17:24

Update:

Liam and I independently are seeing the following using the most up to date edge snap version:

After reboot hooks are held off from firing as desired.

However, after the juju upgrade-series complete command, the first hook to execute is not guaranteed to be post-series-upgrade. We consistently see leader-settings-changed and config-changed execute before post-series-upgrade.

This is most significant for reactive charms which need the opportunity to re-create their virtual python environments. When any hook other than post-series-upgrade executes it will fail due to the venv being out of date.

rick_h · 27 September 2018 18:35

Thanks, we’re looking into it and will get an update out as soon as we can.

thedac · 28 September 2018 15:41

One more issue I see occasionally is: txn-queue for $MACHINE_ID in “machines” has too many transactions (1001)

Example:

machine-10 complete phase started
machine-10 starting all unit agents after series upgrade
ceph-osd/0 post-series-upgrade hook running
ceph-osd/0 post-series-upgrade completed
neutron-openvswitch/2 post-series-upgrade hook running
neutron-openvswitch/2 post-series-upgrade completed
nova-compute/0 post-series-upgrade hook running
ERROR txn-queue for 4fe90b0a-d454-46c1-8579-cfb2d6e11476:10 in “machines” has too many transactions (1001)

Once that occurs no operations against that machine will work including removing it by force:

juju remove-machine 10 --force

removing machine 10 failed: failed to run transaction: []txn.Op{
    {
        C:      "machines",
        Id:     "4fe90b0a-d454-46c1-8579-cfb2d6e11476:10",
        Assert: bson.D{
            {
                Name:  "jobs",
                Value: bson.D{
                    {
                        Name:  "$nin",
                        Value: []state.MachineJob{2},
                    },
                },
            },
        },
        Insert: nil,
        Update: nil,
        Remove: false,
    },
    {
        C:      "cleanups",
        Id:     "5bae4b0d16abcc10eec88b57",
        Assert: nil,
        Insert: &state.cleanupDoc{
            DocID:  "5bae4b0d16abcc10eec88b57",
            Kind:   "machine",
            Prefix: "10",
            Args:   nil,
        },
        Update: nil,
        Remove: false,
    },
}: txn-queue for 4fe90b0a-d454-46c1-8579-cfb2d6e11476:10 in "machines" has too many transactions (1001)

thedac · 28 September 2018 18:36

I am trying to be as responsive as possible. Let me know if this is too much noise.

Just tested 2.5-beta1+develop-03d5fc8 and we seem to have regressed. Note post reboot but before the post-series-upgrade hook leader-settings-changed and config-changed executed.

28 Sep 2018 11:03:03-07:00 juju-unit executing running pre-series-upgrade hook
28 Sep 2018 11:13:00-07:00 juju-unit idle <---- *** Reboot happened here ***
28 Sep 2018 11:13:19-07:00 juju-unit executing running leader-settings-changed hook
28 Sep 2018 11:13:29-07:00 juju-unit executing running config-changed hook
28 Sep 2018 11:13:38-07:00 workload blocked Ready for do-release-upgrade and reboot. Set complete when finished.
28 Sep 2018 11:13:38-07:00 juju-unit executing running post-series-upgrade hook
28 Sep 2018 11:13:44-07:00 juju-unit idle

rick_h · 28 September 2018 18:47

Appreciate it. The change that was trying to land hung up in landing and only now hit trunk. I’m working on getting the snap builds to be manually forced through. I shouldn’t have reached out about the PR with the fix until the snap had been built.

thedac · 28 September 2018 23:42

Initial tests with the newest snap look promising. More data to come.

jameinel · 29 September 2018 04:28

That generally means a txn is broken, preventing other txns from being run. Usually you need to run “mgopurge” with the controllers shut down in order to fix the broken txn before you can do other changes to the document.

It would be good to understand what other txns are listed in the machine record, in case something is wrong with the new code causing the corruption.

thedac · 5 October 2018 20:38

Saw the too many transactions problem again today on today’s snap.

juju remove-machine --force 12
removing machine 12 failed: failed to run transaction: []txn.Op{
    {
        C:      "machines",
        Id:     "2b82c210-cb6f-4f9b-8552-b234fc707068:12",
        Assert: bson.D{
            {
                Name:  "jobs",
                Value: bson.D{
                    {
                        Name:  "$nin",
                        Value: []state.MachineJob{2},
                    },
                },
            },
        },
        Insert: nil,
        Update: nil,
        Remove: false,
    },
    {
        C:      "cleanups",
        Id:     "5bb7cb2debf2fa10c9816832",
        Assert: nil,
        Insert: &state.cleanupDoc{
            DocID:  "5bb7cb2debf2fa10c9816832",
            Kind:   "machine",
            Prefix: "12",
            Args:   nil,
        },
        Update: nil,
        Remove: false,
    },
}: txn-queue for 2b82c210-cb6f-4f9b-8552-b234fc707068:12 in "machines" has too many transactions (1001)

jameinel · 7 October 2018 13:59

Is it possible to get access to the controller? While we know what txn was “one-two-many” it would be good to know what the other 1000 txns on that machine were. Likely something else is broken and this is just the visible fallout.

manadart · 8 October 2018 14:18

The following changes are landed in edge and should be available in the latest Snap build shortly:

The unit agents are no longer shut down prior to re-writing their service files.
Prior to commencing the prepare phase, applications represented by units on the machine have their leadership “frozen”.
After the complete phase has run, applications have their leadership “unfrozen”.

manadart · 29 October 2018 13:48

Some bug-fixes and enhancements are available in the edge Snap:

The “too many transactions” issue has been addressed by changes both to Mongo transaction handling and to logic around the upgrade-series machine lock.
Previously, if two machines were prepared, and those machines shared units of a common application, completing one machine would unpin leadership, despite still having a prepared machine with units of the same application. This has been rectified - normal leadership expiry for any given application is only restored when the last vested machine asks to unpin leadership.
Pinning and unpinning application leaders no longer happens in the client. The responsibility is now handled by the upgrade-series worker on the machine. This means that CLI feedback for each pin/unpin operation no longer occurs and is in the machine logs instead. A list of applications that will be pinned is reported when running the prepare command.
CLI feedback wording is changed slightly, and there is no message regarding units and pinned applications when the machine has no units running.
Upgrade-series commands are prevented from running on controllers.
It is no longer necessary to set the “upgrade-series” feature flag in order to use this functionality.

jameinel · 1 November 2018 04:07

Thanks for the update. Its good to keep visibility on what we’ve been working on.

manadart · 2 November 2018 16:10

Note: The order of upgrade-series command arguments change in a patch landing imminently.

The sub-command and machine arguments are now swapped, so what was:

juju upgrade-series prepare 3 bionic

will become:

juju upgrade-series 3 prepare bionic

Likewise for completion:

juju upgrade-series complete 3

becomes:

juju upgrade-series 3 complete