Trouble restoring a controller

knobby · 17 May 2020 21:35

I backed up a 2.6.10 controller I had and tried to move it to new hardware by restoring it over a newly provisioned controller. The original hardware has network trouble and is failing, so my time is limited and it is acting up. The restore failed and after manually attempting to fix that half-restore a few times I found the documentation Manual steps to Restore a Backup on how to manually do the restore process. This appeared to work, but I got to the very end and now my API is failing to respond. I only took a single backup and that was before attempting a restore, so my assumption is that there is a file somewhere indicating a restore for juju that is leftover from the failed attempts at restoration.

I get ERROR juju restore is in progress - API is disabled to prevent data loss when I try to contact my controller, though it was responding for a short period and I was able to see the models and machines. I’m on step 26 of the doc, so I’m trying to update my model’s machines to point to the new controller. I feel like if I can just flip the restoring bit and restart the machine service I will be good to go, but I just can’t find it.

timClicks · 17 May 2020 22:00

Any thoughts @babbageclunk? (Hopefully you won’t mind the ping - but I know that this is an area of interest for you at the moment)

knobby · 17 May 2020 22:21

Looking into this “it worked for a while” led me into the mongo db on the host and I can’t find anything inside db.restoreInfo, so I took a step back and found that my .local/store/juju/controllers.yaml IP for my new controller keeps getting smashed with the old controller’s IP. I update it and run juju status and things work, but the IP is changed back to the old controller. Did I miss a step somewhere in my restore? Does the controllers.yaml get updated when talking with the server based on an IP in the server response?

babbageclunk · 17 May 2020 22:30

Hi @knobby -
It sounds like you tried running the juju restore-backup and then try the manual process on the same target controller, is that right? The error you’re seeing indicates that there’s a document in the restoreInfo collection in the db, which would only be created by the restore command. The manual instructions are a replacement for the restore-backup command, they assume you’re starting with a clean controller.

I think the safest option would be to destroy the new target controller and recreate it so that it’s pristine and working normally when you start the restore steps. Sorry, I realise that’s a hassle! You might be able to delete that restoreInfo document and bounce the controller agents instead - it’s maybe worth trying but my worry would be that there’s some other partially-done state in the database that would cause more issues down the line.

You might have already tried this, but would migrating from the old controller to the new one work? (I know that cross-model relations would prevent this.)

babbageclunk · 17 May 2020 22:34

Sorry, just saw your reply - sounds like you’re way ahead of me!

Do your controllers both have the same name somehow? (Is that a consequence of the restore? I don’t think so, controllers typically don’t know their own name.)

knobby · 17 May 2020 22:36

I originally wanted to migrate, but the new controller can’t reach the old controller over the network due to this hardware issue. I believe the nic is letting go. I did try to use restore-backup first and then tried to poke it manually after that. If I can’t get this to go soon I will nuke the machine and do the manual restore as the first option and see if I can get it going.

Going back into the DB and looking at the collections ip.addresses and controllers I saw the old and new IP in there. I’ve removed the old again, while juju was running for better or worse, and it seems to be responding to requests. Now I’m working through trying to point the agents at the new controller.

babbageclunk · 17 May 2020 22:40

Ok, that makes sense. Hope it goes smoothly - keep us posted!

(I mean, smoothly from here at least. It doesn’t sound like the route to this point was bump-free!)

knobby · 17 May 2020 23:05

It looks like it did go smoothly after purging the old controller’s IP. I would point out that the lxd containers are not handled by the script:

$ for m in `juju status --format=json | jq -r '.machines | keys | join("\n")'`; do echo machine-$m juju ssh $m 'cd /var/lib/juju/agents; for a in `ls -d *`; do echo jujud-$a; sudo sed -i "s/apiaddresses:/apiaddresses:\n- 10.210.24.104:17070/" $a/agent.conf; sudo systemctl restart jujud-$a; done' done

I was able to just manually jump to my lxd hosts and run the inner for loop over agents and that brought things back around for the lxd containers. I think that I’m out of the woods now as things are showing up green in status and things seem to be working ok.

I hate that I brought you guys out on a Sunday to help me and then I ended up stumbling through it myself. That’s always how it works though, you post a question and it gets you thinking enough that you didn’t need to post the question to begin with. My hope is that this will help someone else if they run into issues like this and go down the same path.

I do wonder why the restore-backup command failed though. I wouldn’t expect it would talk back to the original controller to migrate, so I would have expected it to work. Are there any logs that might be of assistance? I didn’t run with --debug on the correct controller unfortunately. Looking back in my history I notice that the --debug runs were actually using the IP of my old controller, which is why it was saying it was doing a restore when my IP pointed me back there. I tried, at one point, to get the new controller back up enough to respond to juju status and then run a restore, but the juju status rewrote the controllers.yaml to point at my old machine smashing that one with a restore operation.

Anyway, I ramble. Thank you both for your help on a weekend restore journey.