Long story short:
PROBLEM: Juju suddenly silently fails to communicate with vCenter server, no errors in the logs, says “datacenter not found” when trying to update controller
WORKAROUND: Solved by updating (to the same) credentials on the controller. Affected all models on one controller.
Not sure if this is this is the correct forum for this, but maybe it will help someone. I still have models with this problem left in my controller if someone wants more info.
Scenario:
Juju version 2.5.1 on the controller and other models in this scenario.
ESXi 6.5, build 11925212
vCenter 6.7.0 build 10244857
Possibly important: We also use Candid.
For full context: Me and @erik-lonroth rapidly deployed close to 100 machines with juju on a vSphere cloud, by deploying about 25 copies of the slurm-core bundle (https://jujucharms.com/u/omnivector/slurm-core/). All deployments were successful, but after walking through the 25 models and adding yet another unit of the slurm-node charm on each, the controller seems to have lost all means of communication with the vCenter server.
The result was, we could not add units (machines) anymore, we could not delete models and machines. As an example, the following output shows the juju status at some point for a model that I have attempted to destroy. Machines 4-5 has failed to deploy, but machines 0-3 are actually up and running just fine.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-u iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 13:57:02+02:00
Machine State DNS Inst id Series AZ Message
0 stopped 10.104.129.171 juju-3ae1b0-0 xenial poweredOn
1 stopped 10.104.129.44 juju-3ae1b0-1 disco poweredOn
2 stopped 10.104.129.45 juju-3ae1b0-2 disco poweredOn
3 stopped 10.104.129.46 juju-3ae1b0-3 disco poweredOn
4 pending pending disco
5 pending pending disco
I tried to access the controller and dig around in /var/log/juju, this gave me nothing that was easy to interpret. There was a lot of TCP connections between the juju controller and vCenter server in the TIME_WAIT status. Desperately tried restart of Juju services on the controller and even rebooted the controller machine itself, to no avail.
No errors are visible to me in the vCenter web interface.
As I noticed we were not completely up to date, I tried to upgrade juju. First:
$ juju upgrade-juju
best version:
2.5.4
ERROR model cannot be upgraded to 2.5.4 while the controller is 2.5.1: upgrade 'controller' model first
Ok, fair enough. On the controller model:
$ juju upgrade-juju
best version:
2.5.4
ERROR cannot make API call to provider: datacenter 'Sodertalje-HPC' not found
This is wierd, this is indeed the name of our datacenter in vCenter. I started to suspect something fishy with credentials after all, even though nothing had changed.
hallback@t1000:~$ juju list-credentials
Cloud Credentials
vmware01-prod johanh*
hallback@t1000:~$ juju show-credential vmware01-prod johanh
controller-credentials:
vmware01-prod:
johanh:
content:
auth-type: userpass
user: johanh@domain.company.com
models: {}
hallback@t1000:~$ juju set-credential -m controller vmware01-prod johanh
Found credential remotely, on the controller. Not looking locally...
Changed cloud credential on model "controller" to "johanh".
hallback@t1000:~$ juju show-credential vmware01-prod johanh
controller-credentials:
vmware01-prod:
johanh:
content:
auth-type: userpass
user: johanh@domain.company.com
models:
controller: admin
hallback@t1000:~$ juju upgrade-juju
best version:
2.5.4
started upgrade to 2.5.4
Ok! Finally something works.
hallback@t1000:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
controller iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.4 unsupported 14:03:20+02:00
Machine State DNS Inst id Series AZ Message
0 started 10.104.129.212 juju-83eb7f-0 bionic poweredOn
Now my idea was to go back to one of the faulty models and continue my work with the original problem:
hallback@t1000:~$ juju switch slurm-u
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-u
hallback@t1000:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-u iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 14:04:38+02:00
Machine State DNS Inst id Series AZ Message
0 stopped 10.104.129.171 juju-3ae1b0-0 xenial poweredOn
1 stopped 10.104.129.44 juju-3ae1b0-1 disco poweredOn
2 stopped 10.104.129.45 juju-3ae1b0-2 disco poweredOn
3 stopped 10.104.129.46 juju-3ae1b0-3 disco poweredOn
4 pending pending disco
5 pending pending disco
Still no sign of life. Ok, so let’s upgrade this one too:
hallback@t1000:~$ juju upgrade-juju
best version:
2.5.4
ERROR some agents have not upgraded to the current model version 2.5.1: machine-4, machine-5
hallback@t1000:~$ juju remove-machine 4 --force
removing machine 4
hallback@t1000:~$ juju remove-machine 5 --force
removing machine 5
(The machines would not disappear according to the output of juju status)
hallback@t1000:~$ juju add-credential --replace vmware01-prod
Enter credential name: johanh
A credential "johanh" already exists locally on this client.
Replace local credential? (y/N): y
Using auth-type "userpass".
Enter user: johanh@domain.company.com
Enter password:
Credential "johanh" updated locally for cloud "vmware01-prod".
hallback@t1000:~$ juju update-credential vmware01-prod johanh
Credential valid for:
slurm-f
slurm-b
slurm-p
slurm-k
slurm-e
slurm-c
slurm-t
slurm-g
slurm-l
slurm-a
slurm-h
slurm-i
slurm-q
slurm-m
slurm-u
slurm-n
slurm-d
slurm-s
slurm-o
slurm-r
slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.
After this step, everything starts to work!
hallback@t1000:~$ juju status
ERROR model iuba-vmware:JHALLBACK@domain/slurm-u not found
The model is now deleted! This should have happened hours ago. The machines (VM:s) were also immediately deleted in VMware.
Now, retrying this method on some other model that I haven’t issued a “destroy” on yet:
hallback@t1000:~$ juju switch slurm-m
iuba-vmware:admin/controller -> iuba-vmware:JHALLBACK@domain/slurm-m
hallback@t1000:~$ juju add-unit slurm-node
hallback@t1000:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-m iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 11:58:58+02:00
App Version Status Scale Charm Store Rev OS Notes
mysql 5.7.25 active 1 mysql jujucharms 58 ubuntu
slurm-controller 18.08.5.2 active 1 slurm-controller jujucharms 4 ubuntu
slurm-dbd 18.08.5.2 active 1 slurm-dbd jujucharms 1 ubuntu
slurm-node waiting 1/2 slurm-node jujucharms 6 ubuntu
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 0 10.104.129.243 3306/tcp Ready
slurm-controller/0* active idle 2 10.104.129.206 Ready
slurm-dbd/0* active idle 1 10.104.129.169 Ready
slurm-node/0* active idle 3 10.104.129.133 Ready
slurm-node/1 waiting allocating 4 waiting for machine
Machine State DNS Inst id Series AZ Message
0 started 10.104.129.243 juju-9048b9-0 xenial poweredOn
1 started 10.104.129.169 juju-9048b9-1 disco poweredOn
2 started 10.104.129.206 juju-9048b9-2 disco poweredOn
3 started 10.104.129.133 juju-9048b9-3 disco poweredOn
4 pending pending disco
…and we wait more than ten minutes, nothing happens…
hallback@t1000:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-m iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 12:09:37+02:00
App Version Status Scale Charm Store Rev OS Notes
mysql 5.7.25 active 1 mysql jujucharms 58 ubuntu
slurm-controller 18.08.5.2 active 1 slurm-controller jujucharms 4 ubuntu
slurm-dbd 18.08.5.2 active 1 slurm-dbd jujucharms 1 ubuntu
slurm-node waiting 1/2 slurm-node jujucharms 6 ubuntu
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 0 10.104.129.243 3306/tcp Ready
slurm-controller/0* active idle 2 10.104.129.206 Ready
slurm-dbd/0* active idle 1 10.104.129.169 Ready
slurm-node/0* active idle 3 10.104.129.133 Ready
slurm-node/1 waiting allocating 4 waiting for machine
Machine State DNS Inst id Series AZ Message
0 started 10.104.129.243 juju-9048b9-0 xenial poweredOn
1 started 10.104.129.169 juju-9048b9-1 disco poweredOn
2 started 10.104.129.206 juju-9048b9-2 disco poweredOn
3 started 10.104.129.133 juju-9048b9-3 disco poweredOn
4 pending pending disco
hallback@t1000:~$ juju whoami
Controller: iuba-vmware
Model: slurm-m
User: JHALLBACK@domain
hallback@t1000:~$ juju list-credentials vmware01-prod
Cloud Credentials
vmware01-prod johanh*
hallback@t1000:~$ juju show-credential vmware01-prod johanh
controller-credentials:
vmware01-prod:
johanh:
content:
auth-type: userpass
user: johanh@domain.company.com
models:
slurm-a: admin
slurm-b: admin
slurm-c: admin
slurm-d: admin
slurm-e: admin
slurm-f: admin
slurm-g: admin
slurm-h: admin
slurm-i: admin
slurm-j: admin
slurm-k: admin
slurm-l: admin
slurm-m: admin
slurm-n: admin
slurm-o: admin
slurm-p: admin
slurm-t: admin
Ok, everything SHOULD be fine. Now at 12:13:33+02:00, i issue the following command:
hallback@t1000:~$ juju update-credential vmware01-prod johanh
Credential valid for:
slurm-f
slurm-b
slurm-p
slurm-k
slurm-e
slurm-c
slurm-t
slurm-g
slurm-l
slurm-a
slurm-h
slurm-i
slurm-m
slurm-n
slurm-d
slurm-o
slurm-j
Controller credential "johanh" for user "JHALLBACK@domain" on cloud "vmware01-prod" updated.
For more information, see ‘juju show-credential vmware01-prod johanh’.
The machine starts to deploy within 10 seconds, the status is set to poweredOn:
hallback@t1000:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-m iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 12:14:59+02:00
App Version Status Scale Charm Store Rev OS Notes
mysql 5.7.25 active 1 mysql jujucharms 58 ubuntu
slurm-controller 18.08.5.2 active 1 slurm-controller jujucharms 4 ubuntu
slurm-dbd 18.08.5.2 active 1 slurm-dbd jujucharms 1 ubuntu
slurm-node waiting 1/2 slurm-node jujucharms 6 ubuntu
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 0 10.104.129.243 3306/tcp Ready
slurm-controller/0* active idle 2 10.104.129.206 Ready
slurm-dbd/0* active idle 1 10.104.129.169 Ready
slurm-node/0* active idle 3 10.104.129.133 Ready
slurm-node/1 waiting allocating 4 waiting for machine
Machine State DNS Inst id Series AZ Message
0 started 10.104.129.243 juju-9048b9-0 xenial poweredOn
1 started 10.104.129.169 juju-9048b9-1 disco poweredOn
2 started 10.104.129.206 juju-9048b9-2 disco poweredOn
3 started 10.104.129.133 juju-9048b9-3 disco poweredOn
4 pending juju-9048b9-4 disco poweredOn
After some more minutes the unit is up and is related:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm-m iuba-vmware vmware01-prod/Sodertalje-HPC 2.5.1 unsupported 12:17:41+02:00
App Version Status Scale Charm Store Rev OS Notes
mysql 5.7.25 active 1 mysql jujucharms 58 ubuntu
slurm-controller 18.08.5.2 active 1 slurm-controller jujucharms 4 ubuntu
slurm-dbd 18.08.5.2 active 1 slurm-dbd jujucharms 1 ubuntu
slurm-node 18.08.6.2 active 2 slurm-node jujucharms 6 ubuntu
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 0 10.104.129.243 3306/tcp Ready
slurm-controller/0* active idle 2 10.104.129.206 Ready
slurm-dbd/0* active idle 1 10.104.129.169 Ready
slurm-node/0* active idle 3 10.104.129.133 Ready
slurm-node/1 active idle 4 10.104.129.23 Ready
Machine State DNS Inst id Series AZ Message
0 started 10.104.129.243 juju-9048b9-0 xenial poweredOn
1 started 10.104.129.169 juju-9048b9-1 disco poweredOn
2 started 10.104.129.206 juju-9048b9-2 disco poweredOn
3 started 10.104.129.133 juju-9048b9-3 disco poweredOn
4 started 10.104.129.23 juju-9048b9-4 disco poweredOn