Remove 'agent_name' constraint from a failed machine deployment

thomaseldredge · 29 January 2021 19:06

I’m have a working model of kubernetes-core. I’ve successfully added workers to the model using ‘add-unit’ command. I tried to add one before I had a machine prepared and it’s hung in the ‘pending’ state with:

No available machine matches constraints: [(‘agent_name’, [‘a45e8291-6a31-4a84-825e-170f4f0ac9a7’]), (‘cpu_count’, [‘4’]), (‘mem’, [‘4096’]), (‘storage’, [‘root:16’]), (‘zone’, [‘default’])] (resolved to “cpu_count=4.0 mem=4096.0 storage=root:16 zone=default”)

After I prepared a machine I tried ‘juju retry-provisioning #’ on the machinenin the pending state, but it still won’t provision. I think it’s still looking for a host with that ‘agent_name’.

Is there a way to restart the provisioning process that will reset the constraints or just back out and try again?

wallyworld · 31 January 2021 23:12

The best option at the moment is to remove the hung unit (probably requiring --force) and add-unit again.
(IIANM there’s currently no way to trigger Juju to retry the provisioning)

jameinel · 2 February 2021 14:52

While not conveyed very well, I think the key is:

(resolved to “cpu_count=4.0 mem=4096.0 storage=root:16 zone=default”)

those are the final constraints that we are applying to MAAS which is telling us that it isn’t available.
I would think that retry-provisioning would be able to find a new machine recorded in MAAS, unless it had somehow also been acquired for a different purpose?

thomaseldredge · 3 February 2021 14:48

I tested retry-provisioning this way:

$juju add-model sandbox
$juju set-model-constraints mem=16G
#there is no machine available with 16G
$juju add-machine
#wait until it fails:
#status: No available machine matches constraints: [(‘agent_name’, [‘62a5f18a-7086-4157-83b1-c2fc962aa04c’]), (‘mem’, [‘16384’]), (‘zone’, [‘default’])] (resolved to “mem=16384.0 zone=default”)
#create and commission a 24G machine, wait until ‘ready’ in Maas
$juju retry-provisioning 0
#no change to status
#status: No available machine matches constraints: [(‘agent_name’, [‘62a5f18a-7086-4157-83b1-c2fc962aa04c’]), (‘mem’, [‘16384’]), (‘zone’, [‘default’])] (resolved to “mem=16384.0 zone=default”)

The only way I’ve been able to reprovision is by wallyworlds suggestion or removing and readding the machine.

jameinel · 3 February 2021 17:05

That particular error string “No available machine matches constraints” is coming from MAAS

My understanding of how agent_name works, is that we aren’t requesting a machine tagged with the agent_name, we are requesting a machine with these constraints, and when you give it to us, please attach this agent name. (And we use the agent_name field to associate that machine with the given model where it is being used.)

So I’m pretty sure that agent_name is not the issue. I’m not sure what the issue is wrt retry-provisioning not actually being able to find the new machine, while remove and add is.

thomaseldredge · 3 February 2021 19:58

Found this bug that seems to relate to this question.

jameinel · 4 February 2021 15:13

That particular issue is different, in that it was a case where MAAS did assign a machine to the request, but then they wanted to reject the provisioned one and provision a new one.
Though they are potentially related, and you could certainly include your information there.

thomaseldredge · 4 February 2021 16:57

It seems like it’s doing something before the OS deployment is complete that locks it to a given machine.

I tried a full k8s charm deployment, came back and found juju with two machines ‘down’ with ‘failed deployment’. There were other identical VMs ‘ready’ in MaaS but it didn’t try provisioning after those failed. I tried ‘retry-provisioning’ one of them but nothing happened. Then I tried to ‘remove-machine’ and got "removing machine 6 failed: machine 6 has a unit “kubernetes-master/1” assigned.

Interestingly- then I set those failed machines to ‘test’ in MaaS and Juju status showed them PXE booting and loading ephemeral. But then the machines powered off in MaaS and they stuck in “Ready: Loading ephemeral”

So then I just ‘removed-unit’ on those machines, then I could ‘remove-machine --force’.
Then I was able to ‘add-unit’ and spin up another master and worker, but somehow it picked the same MaaS VM for both.

10 pending 10.120.9.137 maas-3-125 focal default Deploying: Powering on
11 pending 10.120.9.137 maas-3-125 focal default Deploying: Configuring OS

And it actually finished deploying OS and installing the master on machine 11, but 10 just stuck in ‘deploying: powering on’. Then I did ‘remove-unit’ again on the worker on machine 10, then ‘remove-machine’ on 10, but it also powered off machine 11, so I removed the master and that machine too.

Then readded another master unit, let it spin up, then added a worker and now everything is hunky dory. No idea what happened with the double machine thing. Guessing I need to let things settle after a ‘remove’ before more ‘adds’.

Sorry this is so jumbled, I’m learning as I go but thought it might be good to describe this experience here. FYI I’m working with Proxmox as the hypervisor and using a MaaS power driver script I found online.