[Fixed] Can't create new instances after deploying Octavia charm

Hybrid512 · 4 January 2021 16:10

Hi,

I have a very weird issue with the Octavia charm.
I have a quite complete bundle with HA with VIPs about everywhere and for what I can say, it works.
I can create instances, attach floating IPs, migrate, resize, … Even Masakari works.
Then I wanted to add Octavia so that I can create LBs.
My Octavia deployment is based on the Octavia overlay you can find in the Octavia Charm documentation except that it is HA.
Deployment goes well and I followed the instructions from the documentation (creating certificates, executing actions to create ressources, amphorae images, …)
Everything is green everywhere except that, at the moment I add Octavia to my model, I can’t create new instances in Openstack anymore.
In Horizon, when creating a new instance, after launching it, I get this error message :

Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1462, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 868, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist-packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 177, in call self.transport._send(self.target, msg_ctxt, msg, File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 124, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 652, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 644, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 200, in select_destinations raise exception.NoValidHost(reason="") nova.exception.NoValidHost: No valid host was found.

I tried to pinpoint what’s wrong but this is far from easy …
First weird stuff I discovered is this on a nova-compute machine :

2021-01-04 15:05:35.478 1043113 ERROR nova.compute.manager keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.211.229:8778/resource_providers?in_tree=e4495b7d-5692-45db-9524-1ca39335d4c8: HTTPSConnectionPool(host='192.168.211.229', port=8778): Max retries exceeded with url: /resource_providers?in_tree=e4495b7d-5692-45db-9524-1ca39335d4c8 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fcfb0055730>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))

192.168.211.229 is my Placement VIP on the “internal space”.

So I tried a simple test ==> curling this IP:PORT combo and see if I get an answer :

root@lab-mrs-worker3:~# curl https://192.168.211.229:8778
curl: (7) Failed to connect to 192.168.211.229 port 8778: Connection refused

root@lab-mrs-worker3:~# curl https://192.168.210.229:8778
{"versions": [{"id": "v1.0", "max_version": "1.36", "min_version": "1.0", "status": "CURRENT", "links": [{"rel": "self", "href": ""}]}]}

As you can see, 192.168.211.229 which is the “internal space” doesn’t answer but 192.168.210.229 which is my “public and admin space” does and this is weird because those 2 subnets are routed, there is no firewall between them so that just mean Placement is not listening on the “internal space”.

So I decided to ssh to the Placement unit which was supporting the VIP and tried to repeat my curl locally on “127.0.0.1”, “localhost” and my 2 VIPs on my 2 spaces : 192.168.210.229 and 192.168.211.229.
All those curls worked.
So wtf !?!
I got neither iptables nor firewall involved so why ??

Then I had the idea to trigger a reboot on all my 3 Placement units because I suspected Corosync or HAProxy to have some weird behavior.
After the reboot (one unit after the other, leaving 30s between each other to let the status settle down to green again each time), I tried again and guess what ? It worked.
Rebooting the unit fixed the issue and now, I can create new instances from Horizon.

I must say that this is the 4th time I deploy Octavia with this bundle and I ALWAYS encountered this issue.
I’m pretty confident that Placement is NOT the only charm that has that kind of problem but the other ones are not that easy to pinpoint so right now, I decided to reboot all my units once the deployèment is finished until there is a proper fix.

My guess is that my configuration is generating that and I might encountered some kind of corner case (just like the one I hitted about certificates symlinks a few months ago).

Here are my specs :

3 units per control plane component (for HA)
2 subnets for 2 spaces (“internal” + “public/admin”)
HA with VIP/hacluster (1 VIP per subnet)

If it can help … anyway, I’d love to see this fixed.

routergod · 5 January 2021 11:31

Hi @Hybrid512. This is intriguing, I suspect not specifically an Octavia problem.

I’m wondering about the topology you have here, can you confirm/elaborate a few points please? You have use-internal-endpoints=true on all charms? Are the machines dual-homed on both subnets, sharing a single L2, or is it actually a /23 which you have subdivided logically for the spaces you have defined?

Can you confirm that the Placement API is not bound to the .211.x address prior to you reboot?

ss -tlpn | grep 8778

If it is not, and a reboot fixes that then perhaps a bug in the Placement charm?

Hybrid512 · 5 January 2021 13:56

I doubt it too … or well … at least, not only.
It might be something more widespread and I suspect my kind of setup (multiple subnets with HA and VIPs) to be related to the issue but that still looks like a bug to me.

Nope, by now, it’s using the defaults, I was wanting to make a try with use-internal-endpoints=true but last time I did, it didn’t had any visible effect on the bugs I encountered before. (especially this one : How do you use hacluster?)
However, for this previous one, it was really a bug and there is a fix that is being done on every charm if I understood well (Bug #1893847 “Certificates are not created” : Bugs : OpenStack Placement Charm)

Here is my topology in details :

3 bare metal control plane nodes provisionned by MaaS with these subnets :
- “pxe space” : 192.168.203.0/24, non routed, untagged
- “public/admin space” : 192.168.210.0/24, routed, VLAN
- “internal space” : 192.168.211.0/24, routed, VLAN
4 bare metal compute nodes provisionned by MaaS with these subnets :
- “pxe space” : 192.168.203.0/24, non routed, untagged
- “public/admin space” : 192.168.210.0/24, routed, VLAN
- “internal space” : 192.168.212.0/24, routed, VLAN

All of these machines are connected through a L2/L3 switch, every routed VLAN can communicate with each other, no firewall involved, this is just L1/L3 routing done at the switch level.

Can you confirm that the Placement API is not bound to the .211.x address prior to you reboot?
ss -tlpn | grep 8778
If it is not, and a reboot fixes that then perhaps a bug in the Placement charm?

I can’t right now because the reboot fixed the issue but I’ll check again next time I redeploy.
However, as I said, on the Placement unit itself, the API was answering on every interface (localhost, .211.x and .210.x) while from a remote machine (a compute node), only .210.x answered while I was receiving a curl: (7) Failed to connect to 192.168.211.229 port 8778: Connection refused for .211.x
After rebooting the unit, the problem was solved.

Tbh, I would suspect a problem with Corosync because 192.168.210.229/192.168.211.229 are the VIPs, not the unit’s own IP and this never happens in non HA setup because there is no hacluster charm involved (and thus, no Corosync/Pacemaker).
I’m still not very confident in hacluster because most of my issued seem to be related to HA setup and since hacluster is a subordinate to Placement, in fact, rebooting Placement leads to a reboot of hacluster too.

We had something similar recently in our infrastructure where we had to change the default gateway on a cluster where we had a VIP (but using keepalived, not Corosync) and as soon as we changed the default route, every accesses to the VIP were not functionning.
As soon as we restarted Keepalived, the VIP was working again.
I suspect some kind of route caching done at the VIP level (at least for this last issue we had with Keepalived) and only restarting the service forced the network to work properly again.

Maybe there is something similar in this issue.

routergod · 5 January 2021 15:12

Ok thanks for clarifying. I think your suggestion about the VIP is worth pursuing, but the connection refused has me puzzled. If it were a routing problem I would expect a timeout.

Hybrid512 · 8 March 2021 10:55

Ho, I see I dodn’t reply to this one.
It was my fault at the end, I had a misconfiguration for the Placement and Octavia VIP (wrong copy/paste, they had the same VIP).
It’s fixed but I still have some other issues with Octavia that are referenced in other posts/bug reports.

Thanks.