Manually Restoring a Backup from an HA Controller
Currently there are a couple of bugs around restoring a backup that was generated from an HA controller. They mostly revolve around the fact that the restored controller needs to change its identity to that of the old controller, which may be a different machine agent.
This should be an appropriate set of steps to test out taking a backup that was created from an HA controller, and restoring it on a newly bootstrapped controller. The steps are given as how to set this up for testing purposes, but it should be fairly obvious how to adapt it for a production environment. We will likely keep this document around for people who want to have the manual steps for older controllers, but we should end up incorporating all of these steps into âjuju restore-backupâ in the future.
Setting up the initial HA controller
- Bootstrap
$ juju bootstrap --no-gui lxd lxd-original --bootstrap-series=xenial \
--debug --model-default "logging-config=<root>=INFO"
- HA
$ juju enable-ha
- Deploy
$ juju deploy cs:~jameinel/ubuntu-lite
- Create backup
$ juju create-backup -m controller
- Stop the controllers
$ lxc list
$ lxc stop juju-????-{0,1,2}
Explicit manual steps
- Bootstrap a similar controller, note that it needs a different name:
$ juju bootstrap --no-gui lxd lxd-backup --bootstrap-series=xenial \
--debug --model-default "logging-config=<root>=INFO"
- Copy the backup to the new machine
$ juju scp -m controller juju-backup-*.tar.gz 0:/tmp
- SSH onto the new machine
$ juju ssh -m controller 0
$$ sudo su -
- Grab a copy of the new agent.conf, lookup the state password, make sure you can connect to mongo.
# cp /var/lib/juju/agents/machine-0/agent.conf /tmp
# grep "statepassword" /tmp/agent.conf
# /usr/lib/juju/mongo3.2/bin/mongo --ssl \
-u `grep tag: /tmp/agent.conf | cut -d' ' -f2` \
-p `grep statepassword: /tmp/agent.conf | cut -d' ' -f2` \
--authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates \
localhost:37017/juju
- Unpack the backup and the root files
# cd /var/lib/juju
# tar xzf /tmp/juju-backup*.tar.gz
# cd juju-backup
# tar xf root.tar
- Stop the new agents, copy the old agent files in its place. Also cleanup any âraftâ folder that the new agent created (only relevant for Juju 2.5+)
# cd /var/lib/juju
# systemctl stop jujud-machine-0
# systemctl stop juju-db
# rm -rf agents raft
# cp -r juju-backup/var/lib/juju/{agents,nonce.txt,server.pem,shared-secret,system-identity} .
- Update agents.conf for the new IP. (and upgradedToVersion if you are changing it.)
# ls /var/lib/juju/agents
machine-2
# vi agents/machine-2/agent.conf
# <edit> apiaddresses:
# <edit> upgradedToVersion:
- Make sure to have the tools created for the restored agent
# ls /var/lib/juju/agents
machine-2
# (cd tools; ln -s 2.5.* machine-2)
- Change the systemd configuration to point to the restored agent
# cd /lib/systemd/system
# mv jujud-machine-0 jujud-machine-2
# cd jujud-machine-2
# mv jujud-machine-0.service jujud-machine-2.service
- Edit
exec_start.sh
andjujud-machine-2.service
replacing âmachine-0â with âmachine-2â. Also make sure to change the--machine-id 0
field to--machine-id 2
# vi exec_start.sh jujud-machine-2.service
- Update /etc to point to the new service.
Note: juju restore-backup seems to set up the new service but not get rid of the old one
# cd /etc/systemd/system
# rm jujud-machine-0.service
# ln -s /lib/systemd/system/jujud-machine-2/jujud-machine-2.service
- Check that it is lined up. It is expected it will be
Active: inactive (dead)
. We just want to make sure it is a known serivce.
# systemctl status jujud-machine-2
? jujud-machine-2.service - juju agent for machine-2
Loaded: loaded (/lib/systemd/system/jujud-machine-2/jujud-machine-2.service; linked; vendor preset: enabled)
Active: inactive (dead)
- Restart mongo, and confirm it is using the right certificate. Read
agent.conf
to copy the contents ofcacert:
to a fileca.pem
. Then connect to mongo using that as the sslCACert and the old password and make sure mongo is happy.
Note: we should always create ca.pem rather than just server.pem. It would mean we could redo our scripts to connect to mongo to use --sslCAFile rather than --sslAllowInvalidCertificates.
# cd /var/lib/juju
# systemctl start juju-db
# cp /var/lib/juju/agents/machine-2/agent.conf ca.pem
# edit ca.pem removing everything but --BEGIN down to --END under the cacert: section, also dedent the certificate.
# cat /var/lib/juju/agents/machine-2/agent.conf \
| sed -n -e '/cacert:/,/ -----END/p; / -----END/q' \
| sed -e 's/^[ ]*//' \
| tail -n+2 > ca.pem
# /usr/lib/juju/mongo3.2/bin/mongo --ssl \
-u `grep tag: /tmp/agent.conf | cut -d' ' -f2` \
-p `grep statepassword: /tmp/agent.conf | cut -d' ' -f2` \
--authenticationDatabase admin --sslCAFile ./ca.pem \
localhost:37017/juju
- Get the new InstanceID for the newly bootstrapped machine. You will use this later
when we update the information about the restored controller.
Note that the
_id
here is the model-uuid+machine-id of the newely boostrapped controller.
juju:PRIMARY> db.machines.find({"jobs": 2}, {_id: 1})
juju:PRIMARY> db.instanceData.find({_id: "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:0"}, {"instanceid": 1})
- Restore the mongo dump using the replacement credentials
# /usr/lib/juju/mongo3.2/bin/mongorestore --ssl \
-u `grep tag: /tmp/agent.conf | cut -d' ' -f2` \
-p `grep statepassword: /tmp/agent.conf | cut -d' ' -f2` \
--authenticationDatabase admin --sslAllowInvalidCertificates \
--drop --oplogReplay --host localhost:37017 --batchSize 10 \
/var/lib/juju/juju-backup/dump
Note: I tried to use --sslCAFile ca.pem, but I consistently got:
morebuf={pc:0x671a6d sp:0xc820053640 lr:0x0}
sched={pc:0x6677e0 sp:0xc820053638 lr:0x0 ctxt:0x0}
runtime: gp=0xc8200de600, gp->status=0x4
runtime: split stack overflow: 0xc820053630 < 0x7f1b1c014200
fatal error: runtime: split stack overflow
Even though that line worked fine with âmongoâ itself.
-
Connect as the replacement agent, to recreate the restored agentâs passwords in the DB. You can crib the roles for the new user from the existing user when looking at
db.getUsers()
Note that if the backup is using the same user agent as the newly bootstrapped controller (machine-0), then youâll want to just update the password, rather than creating the user and deleting the old user. (see https://docs.mongodb.com/manual/reference/method/db.changeUserPassword/)
Note: juju restore-backup seems to get this wrong. It creates the user, but dosenât give it roles
# grep statepassword /var/lib/juju/agents/machine-2/agent.conf
# /usr/lib/juju/mongo3.2/bin/mongo --ssl \
-u `grep tag: /tmp/agent.conf | cut -d' ' -f2` \
-p `grep statepassword: /tmp/agent.conf | cut -d' ' -f2` \
--authenticationDatabase admin --sslCAFile ./ca.pem \
localhost:37017/juju
juju:PRIMARY> use admin
juju:PRIMARY> db.getUsers()
juju:PRIMARY> db.createUser({
user: "machine-2",
pwd: "$PASSWORD_FROM_RESTORED_AGENT_CONF",
roles: [{
"role" : "dbAdminAnyDatabase",
"db" : "admin"
}, {
"role" : "userAdminAnyDatabase",
"db" : "admin"
}, {
"role" : "clusterAdmin",
"db" : "admin"
}, {
"role" : "readWriteAnyDatabase",
"db" : "admin"
}]})
Successfully added user: {
...
- Exit out and make sure you can connect as the new database user, then remove the rebootstrapped user from the database. Note, you wonât do this step if the username is reused.
# /usr/lib/juju/mongo3.2/bin/mongo --ssl \
-u `grep tag: /var/lib/juju/agents/machine*/agent.conf | cut -d' ' -f2` \
-p `grep statepassword: /var/lib/juju/agents/machine*/agent.conf | cut -d' ' -f2` \
--authenticationDatabase admin --sslCAFile ./ca.pem \
localhost:37017/juju
juju:PRIMARY> use admin
juju:PRIMARY> db.dropUser("machine-0", {"w": "majority"})
- Update rs.conf so that we properly flag the identity of this machine agent.
You can use the value from rs.conf() to set up the values to rs.reconfig().
You only really need to change the tag of who "_id": 1
is. That way the
peergrouper wonât think it is a different machine it doesnât know about.
Note: juju restore-backup does not do this
juju:PRIMARY> rs.conf()
{
"_id" : "juju",
"version" : 1,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 1,
"host" : "10.210.24.139:37017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
"juju-machine-id" : "0"
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5c9caaa7dd578d49f351d0e9")
}
}
juju:PRIMARY> rs.reconfig({
"_id" : "juju",
"version" : 1,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 1,
"host" : "10.210.24.139:37017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
"juju-machine-id" : "2"
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5c9caaa7dd578d49f351d0e9")
}
})
-
Demote the HA controller machines that arenât being restored. That way Peergrouper and Raft donât think they should be waiting for other agents to come up before they are happy. You look for machines with Job #2 (JobManageModel), that arenât the machine you are restoring. Then cleanup the âcontrollersâ document, for what machines are part of the controller.
Note: juju restore-backup does not do this
juju:PRIMARY> use juju
juju:PRIMARY> db.machines.find({"jobs": 2}).pretty()
juju:PRIMARY> db.machines.find({"jobs": 2}, {"_id": 1})
juju:PRIMARY> db.machines.update({"_id": "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:0"},
{$set: {jobs: [1], hasvote: false}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
juju:PRIMARY> db.machines.update({"_id": "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:1"},
{$set: {jobs: [1], hasvote: false}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
juju:PRIMARY> db.controllers.find({"_id": "e"}).pretty()
juju:PRIMARY> db.controllers.update({"_id": "e"}, {$set: {"machineids": ["2"]}})
-
Cleanup the IP addresses for the restored agent. The Machiner is supposed to do this, but doesnât seem to remove IP addresses that used to exist for a machine correctly. Theoretically these should all be automatically determined, some of these steps may not be necessary. Note that
10.210.24.139
is the IP of the replacement controller.First update the controllers collection for apiHostPorts and apiHostPortsForAgents. (These should get updated automatically when the controller starts up and sees that machine-0/1 are no longer listed as controller machines) Also, update the âeâ (environ) field to list only the new controller.
Note: juju restore-backup does not do this. But it shouldnât need to be done
juju:PRIMARY> db.controllers.find({"_id": "apiHostPorts"}).pretty()
juju:PRIMARY> db.controllers.update({"_id": "apiHostPorts"},
{$set: {"apihostports": [{
"value" : "10.210.24.139",
"addresstype" : "ipv4",
"networkscope" : "local-cloud",
"port" : 17070
}, {
"value" : "127.0.0.1",
"addresstype" : "ipv4",
"networkscope" : "local-machine",
"port" : 17070
}, {
"value" : "::1",
"addresstype" : "ipv6",
"networkscope" : "local-machine",
"port" : 17070
}
]}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
juju:PRIMARY> db.controllers.update({"_id": "apiHostPortsForAgents"},
{$set: {"apihostports": [{
"value" : "10.210.24.139",
"addresstype" : "ipv4",
"networkscope" : "local-cloud",
"port" : 17070
}, {
"value" : "127.0.0.1",
"addresstype" : "ipv4",
"networkscope" : "local-machine",
"port" : 17070
}, {
"value" : "::1",
"addresstype" : "ipv6",
"networkscope" : "local-machine",
"port" : 17070
}
]}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
- Update the instance-id for the restored controller. It should use the instanceData that we looked up earlier with the machine-id that we are restoring.
juju:PRIMARY> db.machines.find({"jobs": 2}, {_id: 1})
juju:PRIMARY> db.instanceData.update({_id: "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:2"}, {$set: {"instanceid": "juju-c6a309-0"}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
-
Next update the machine record with its provider addresses (
addresses
the addresses found by probing the machine directlymachineaddresses
and the preferred addresses, public and private). You can do a find() first, to make sure that the IP address to be updated is the first entry in the array (offset â.0â). It should generally be true, though.Note: juju restore-backup does not do this. But it shouldnât need to be done
juju:PRIMARY> db.machines.find({"jobs": 2}).pretty()
juju:PRIMARY> db.machines.update({"_id": "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:2"},
{$set: {
"addresses.0.value": "10.210.24.139",
"machineaddresses.0.value": "10.210.24.139",
"preferredpublicaddress.value": "10.210.24.139",
"preferredprivateaddress.value": "10.210.24.139"
}})
juju:PRIMARY> db.machines.find({"jobs": 2}).pretty()
-
Now cleanup the ip.addresses collection. This is the main step that the Machiner gets wrong. It will create the new IP address, but it doesnât remove the old addresses. (Which then confuses the peergrouper because it sees multiple possible addresses to setup the peer group.)
Note the translation of the ID
<model-uuid>:<machine-id>
into a regex of the form/^<model-uuid>:m#<machine-id>#.*/
. We just delete the IP address that should no longer be applied to this machine.Note: juju restore-backup does not do this. But it shouldnât need to be done
juju:PRIMARY> db.machines.find({"jobs": 2}, {"_id": 1})
{ "_id" : "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:2" }
juju:PRIMARY> db.ip.addresses.find({"_id": /^6cbbd092-0950-43fc-8482-7fdf42ff3cc2:m#2#.*/}).pretty()
juju:PRIMARY> db.ip.addresses.remove({"_id": "6cbbd092-0950-43fc-8482-7fdf42ff3cc2:m#2#d#eth0#ip#10.210.24.146"})
WriteResult({ "nRemoved" : 1 })
- You should be done with DB manipulations, and can now start the restored controller.
juju:PRIMARY> exit
# systemctl start jujud-machine-2
# tail -f /var/log/juju/machine\*.log
- On your client machine, edit ~/.local/share/controllers.yaml to add the new IP address to the original cloud. You can also remove the âlxd-backupâ cloud at this time:
# vim ~/.local/share/controllers.yaml
# juju unregister lxd-backup
- At this point, we will need to go to all the workloads and update their agent.conf to point to the new controller. It should sufficient to just add the new IP address into the list of apiaddresses. (once the agent successfully connects, it should then clean out the other addresses from that list.)
$ juju switch default
$ for m in `juju status --format=json | jq -r '.machines | keys | join("\n")'`; do
echo machine-$m
juju ssh $m 'cd /var/lib/juju/agents; for a in `ls -d *`; do echo jujud-$a; sudo sed -i "s/apiaddresses:/apiaddresses:\n- 10.210.24.104:17070/" $a/agent.conf; sudo systemctl restart jujud-$a; done'
done
-
Note: apicertupdater still has a bogus address that needs cleanup.
Only one of these is valid.
2019-03-31 08:12:15 INFO juju.worker.apiservercertwatcher manifold.go:182 new certificate addresses: 10.210.24.104, 10.210.24.146, 127.0.0.1, ::1
- Note: instancepoller wonât work correctly without renaming the instance-id.
provider/lxd
wonât find instances that donât start with the right model prefix. lxd.Provider.AllInstances
filters by juju-MODEL-*
before finding the list of instances. So after rebootstrap,
the new machine is, eg juju-NEWMODEL-0
, while restoring the old controller which is juju-OLDMODEL-2
.
lxc rename
should be sufficient to get it into the right model, and then update the database to match.
But just setting the instance-id in the controller wonât do the right thing. eg step ~18 for lxd provider
would be better served by using lxd rename
rather than updating the database.