vSphere clusters can't create new machines after upgrade to 2.7.0

merlijn-sebrechts · 17 January 2020 16:46

Looks like we started trusting Juju upgrades a bit too much and an upgrade to 2.7.0 broke vSphere support.

Due to a recent change, Juju now again contacts the vSphere hosts instead of the API to upload images. These hosts are on their own separate network not accessible to Juju. All communication should happen using the API. So now that we try to create a new VM, we get an error.

failed to start machine 47 (creating template VM: streaming http://cloud-images.ubuntu.com/releases/server/releases/xenial/release-20200108/ubuntu-16.04-server-cloudimg-amd64.ova to https://vnode1.test/nfc/521dec15-9a16-7856-f55f-8c92a328e84a/disk-0.vmdk: Post https://vnode1.test/nfc/521dec15-9a16-7856-f55f-8c92a328e84a/disk-0.vmdk: dial tcp 172.16.254.1:443: connect: no route to host), retrying in 10s (10 more attempts)

Another bug report from 2017 mentions the same issue saying that Juju should use the API instead of contacting the vSphere hosts. This was then fixed, however, a recent change somehow brought this behavior back. A quick look through the source code suggests the issue was brought back in this PR: https://github.com/juju/juju/pull/10461

@timClicks am I correct that that PR is the cause of our issue? Is there anything we can do to work around this issue?

kelvin.liu · 20 January 2020 06:20

Hi @merlijn-sebrechts

The PR you mentioned introduced vSAN support and changed the way to create a new VM.
So the step failed in ensuring the template VM(loading VMDK failed because the template VM is not accessible in network).
I don’t have the context of this change, I am not sure why we don’t want to upload the VMDK to datastore anymore.
Our vSphere experts @babbageclunk @timClicks are on leave now.
I will discuss with @babbageclunk for a solution once he is back this Wed.

Sorry, the latest upgrade breaks your cluster.

merlijn-sebrechts · 20 January 2020 13:01

Ok, let me know when you know more.

As a workaround right now, I manually attached the controller VM’s to the storage network and added DNS rules so the hostnames of the VMWare hosts resolve to their IP’s on the storage network. For new controllers we’ll just stay on the previous series.

babbageclunk · 22 January 2020 05:43

Hi @merlijn-sebrechts, really sorry about the breakage.

I’m not sure what the right solution is yet, but some backstory:
That PR is definitely the cause of the issue. It fixes a problem people were seeing that prevented bootstrapping on vSAN datastores. This turned out to be because the VMDK file uploaded to the datastore was in the wrong format - VMs created referring to the disk wouldn’t boot. The only way we could find to get vSphere to rewrite the disk file was to upload it using the HttpNfcLease.

We thought this was OK because we could mark that VM as a template and clone it to create other machines (for the same series) in the controller. That meant that it still avoided the cost of getting the VMDK from cloud-images each time. Unfortunately @timclicks and I didn’t know about the other requirement that we not talk to the host machines directly (which using HttpNfcLease.Upload does). The developer who fixed the linked bug left the company over a year before the vSAN issues came up, otherwise I’m sure he would have pointed it out.

As it stands we’re trying to find a different way to get the disk file rewritten correctly that can avoid using lease.Upload.

merlijn-sebrechts · 22 January 2020 11:59

Ok, thanks for the update! I understand the difficulty; vSphere is a very complicated beast. We’re happy Juju abstracts a lot of the complexity away so that we don’t have to mess too much with vSphere directly…

timClicks · 28 January 2020 01:14

I should also apologise. I’ve been on leave for a few weeks and wasn’t able to respond straight away.

Sorry to hear about your difficulties. Would you be able to assist us to test a fix when it’s available?

timClicks · 28 January 2020 01:26

@babbageclunk, @kelvin.liu, @merlijn-sebrechts here is some proposed wording that we could include in our vSphere documentation until the issue is resolved:

Known issue affecting VM creation: In 2.7.0, networking rules must allow direct access to the ESX host for the the Juju client and the controller VM. The Juju client’s access is required to upload disk images and the controller requires access to finalise the bootstrap process. If this access is not permitted by your site administrator, remain with Juju 2.6.9. This was an inadvertent regression and will be fixed in a future release.

Edit: added to the docs

wallyworld · 29 January 2020 07:10

It could be argued that it’s currently a necessary requirement to enable VSAN support rather than a regression per se. We’re currently looking for an alternative approach to enabling VSAN support, but so far are limited by he APIs we have available as to what can be done. We’ll definitely try and remove the need for the firewall hole.