[Fixed] Can't create new instances after deploying Octavia charm

Hybrid512 · 5 January 2021 13:56

I doubt it too … or well … at least, not only. It might be something more widespread and I suspect my kind of setup (multiple subnets with HA and VIPs) to be related to the issue but that still looks like a bug to me.

Nope, by now, it’s using the defaults, I was wanting to make a try with use-internal-endpoints=true but last time I did, it didn’t had any visible effect on the bugs I encountered before. (especially this one : How do you use hacluster?) However, for this previous one, it was really a bug and there is a fix that is being done on every charm if I understood well (Bug #1893847 “Certificates are not created” : Bugs : OpenStack Placement Charm)

Here is my topology in details :

3 bare metal control plane nodes provisionned by MaaS with these subnets :
- “pxe space” : 192.168.203.0/24, non routed, untagged
- “public/admin space” : 192.168.210.0/24, routed, VLAN
- “internal space” : 192.168.211.0/24, routed, VLAN
4 bare metal compute nodes provisionned by MaaS with these subnets :
- “pxe space” : 192.168.203.0/24, non routed, untagged
- “public/admin space” : 192.168.210.0/24, routed, VLAN
- “internal space” : 192.168.212.0/24, routed, VLAN

All of these machines are connected through a L2/L3 switch, every routed VLAN can communicate with each other, no firewall involved, this is just L1/L3 routing done at the switch level.

Can you confirm that the Placement API is not bound to the .211.x address prior to you reboot?
ss -tlpn | grep 8778
If it is not, and a reboot fixes that then perhaps a bug in the Placement charm?

I can’t right now because the reboot fixed the issue but I’ll check again next time I redeploy. However, as I said, on the Placement unit itself, the API was answering on every interface (localhost, .211.x and .210.x) while from a remote machine (a compute node), only .210.x answered while I was receiving a curl: (7) Failed to connect to 192.168.211.229 port 8778: Connection refused for .211.x After rebooting the unit, the problem was solved.

Tbh, I would suspect a problem with Corosync because 192.168.210.229/192.168.211.229 are the VIPs, not the unit’s own IP and this never happens in non HA setup because there is no hacluster charm involved (and thus, no Corosync/Pacemaker). I’m still not very confident in hacluster because most of my issued seem to be related to HA setup and since hacluster is a subordinate to Placement, in fact, rebooting Placement leads to a reboot of hacluster too.

We had something similar recently in our infrastructure where we had to change the default gateway on a cluster where we had a VIP (but using keepalived, not Corosync) and as soon as we changed the default route, every accesses to the VIP were not functionning. As soon as we restarted Keepalived, the VIP was working again. I suspect some kind of route caching done at the VIP level (at least for this last issue we had with Keepalived) and only restarting the service forced the network to work properly again.

Maybe there is something similar in this issue.