Hi,
I have a very weird issue with the Octavia charm.
I have a quite complete bundle with HA with VIPs about everywhere and for what I can say, it works.
I can create instances, attach floating IPs, migrate, resize, … Even Masakari works.
Then I wanted to add Octavia so that I can create LBs.
My Octavia deployment is based on the Octavia overlay you can find in the Octavia Charm documentation except that it is HA.
Deployment goes well and I followed the instructions from the documentation (creating certificates, executing actions to create ressources, amphorae images, …)
Everything is green everywhere except that, at the moment I add Octavia to my model, I can’t create new instances in Openstack anymore.
In Horizon, when creating a new instance, after launching it, I get this error message :
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1462, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 868, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist-packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 177, in call self.transport._send(self.target, msg_ctxt, msg, File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 124, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 652, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 644, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 200, in select_destinations raise exception.NoValidHost(reason="") nova.exception.NoValidHost: No valid host was found.
I tried to pinpoint what’s wrong but this is far from easy …
First weird stuff I discovered is this on a nova-compute machine :
2021-01-04 15:05:35.478 1043113 ERROR nova.compute.manager keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.211.229:8778/resource_providers?in_tree=e4495b7d-5692-45db-9524-1ca39335d4c8: HTTPSConnectionPool(host='192.168.211.229', port=8778): Max retries exceeded with url: /resource_providers?in_tree=e4495b7d-5692-45db-9524-1ca39335d4c8 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fcfb0055730>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))
192.168.211.229 is my Placement VIP on the “internal space”.
So I tried a simple test ==> curling this IP:PORT combo and see if I get an answer :
root@lab-mrs-worker3:~# curl https://192.168.211.229:8778
curl: (7) Failed to connect to 192.168.211.229 port 8778: Connection refused
root@lab-mrs-worker3:~# curl https://192.168.210.229:8778
{"versions": [{"id": "v1.0", "max_version": "1.36", "min_version": "1.0", "status": "CURRENT", "links": [{"rel": "self", "href": ""}]}]}
As you can see, 192.168.211.229 which is the “internal space” doesn’t answer but 192.168.210.229 which is my “public and admin space” does and this is weird because those 2 subnets are routed, there is no firewall between them so that just mean Placement is not listening on the “internal space”.
So I decided to ssh to the Placement unit which was supporting the VIP and tried to repeat my curl locally on “127.0.0.1”, “localhost” and my 2 VIPs on my 2 spaces : 192.168.210.229 and 192.168.211.229.
All those curls worked.
So wtf !?!
I got neither iptables nor firewall involved so why ??
Then I had the idea to trigger a reboot on all my 3 Placement units because I suspected Corosync or HAProxy to have some weird behavior.
After the reboot (one unit after the other, leaving 30s between each other to let the status settle down to green again each time), I tried again and guess what ? It worked.
Rebooting the unit fixed the issue and now, I can create new instances from Horizon.
I must say that this is the 4th time I deploy Octavia with this bundle and I ALWAYS encountered this issue.
I’m pretty confident that Placement is NOT the only charm that has that kind of problem but the other ones are not that easy to pinpoint so right now, I decided to reboot all my units once the deployèment is finished until there is a proper fix.
My guess is that my configuration is generating that and I might encountered some kind of corner case (just like the one I hitted about certificates symlinks a few months ago).
Here are my specs :
- 3 units per control plane component (for HA)
- 2 subnets for 2 spaces (“internal” + “public/admin”)
- HA with VIP/hacluster (1 VIP per subnet)
If it can help … anyway, I’d love to see this fixed.