Cannot run command juju add-machine It stays in Pending State

ejike · 13 October 2021 11:41

Hello Everyone,

We’ve been using Juju for the past a year and a half with absolutely no problem after our openstack deployment. And it’s really working great.

Recently we had a storage Issue were we were hosting our juju controller. The mongo database was not starting and I had to carry out some work to start it again.

I was able to do Juju status and it was working.

Today I tried to add a new machine to our deployment and I started getting machines in a pending state. We are using maas as a cloud and Usually that command starts the deployment of the physical machine.

Digging on the logs of machine-0 on our Juju controller I was able to see this:

2021-10-13 11:32:29 ERROR juju.worker.dependency engine.go:663 “raft” manifold worker returned unexpected error: timed out waiting for worker loop

This happens straight after I run the command… If i attach --debug to it it gives me this output:

12:32:07 INFO juju.cmd supercommand.go:56 running juju [2.9.0 0 gc go1.16.3]

12:32:07 DEBUG juju.cmd supercommand.go:57 args: string{“juju”, “add-machine”, “–debug”}

12:32:07 INFO juju.juju api.go:78 connecting to API addresses: [x.x.x.x:17070]

12:32:07 DEBUG juju.api apiclient.go:1132 successfully dialed “wss://x.x.x.x:17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:07 INFO juju.api apiclient.go:664 connection established to “wss://x.x.x.x:17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:08 INFO juju.juju api.go:78 connecting to API addresses: [x.x.x.x:17070]

12:32:08 DEBUG juju.api apiclient.go:1132 successfully dialed “wss://x.x.x.x:17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:08 INFO juju.api apiclient.go:664 connection established to “wss://x.x.x.x:17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:08 INFO juju.cmd.juju.machine add.go:291 load config

12:32:08 INFO juju.juju api.go:78 connecting to API addresses: [x.x.x.x:17070]

12:32:08 DEBUG juju.api apiclient.go:1132 successfully dialed “wss://x.x.x.x:17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:08 INFO juju.api apiclient.go:664 connection established to “wss://x.x.x.x17070/model/2989a1f5-c738-4e03-86ac-0f586a5d989a/api”

12:32:08 INFO juju.cmd.juju.machine add.go:316 model provisioning

12:32:08 INFO cmd add.go:363 created machine 10

12:32:08 DEBUG juju.api monitor.go:35 RPC connection died

12:32:08 INFO cmd supercommand.go:544 command finished

I am not sure exactly what the problem could be. It really feels I don’t have any more control over my cloud. On the audit logs i get this

{“errors”:{“conversation-id”:“c8169ed29b0b74ba”,“connection-id”:“58”,“request-id”:2,“when”:“2021-10-13T11:32:08Z”,“errors”:[null]}}

not sure if it’s connected, but usually when I run the command

juju run --unit mysql/0 leader-get

I usually get an output, but since the storage issue it hasn’t been working. It just hangs. Which keeps me thinking that I might have lost control of my cloud. But Juju status seems to be reporting properly so I am not sure.

Not sure if anyone can point me to the right direction for more troubleshooting.

Kind Regards,

Ejike

hpidcock · 18 October 2021 02:45

@simonrichardson @manadart does this sound like a raft issue to you?

manadart · 18 October 2021 08:38

Since you refer to the controller in the singular above, I am assuming that you are not running in HA.

There might be trouble starting Raft here, which would explain both the leader-get and provisioning hangs.

If you are running Juju 2.9+, you can introspect Raft’s view of leases by SSHing directly to the controller machine and running juju_leases.

If you have had storage issues, it may be that the Raft log/snapshots have been corrupted. These are stored in /var/lib/juju/raft. If this is the case, you can resolve his by:

Stopping the controller.
Deleting the contents of /var/lib/juju/raft.
Starting the controller.

Note that this will effectively expire all current leases, leading to new claims and potentially some churn.