Network outage (OVS, BFD failures), potentially after enabling QOS in neutron-api charm (I cannot confirm 100% it happened when I enabled QOS, and set a 1Gbps QOS policy on a public provider network. Policy applied OK, with instances rate limited to 1Gbps during tests, using iperf3).
juju config neutron-api enable-qos=true
A while after enabling and testing QOS, I noted that I lost SSH session to my test instances a few times. Some times I could reconnect immediately, other times could be 30-60s before I could reconnect. I then had reports of down instances attached to a different public provider network, without policy applied. I jumped onto the instance consoles and starting pinging 8.8.8.8. This gave 0% packetloss for a few minutes, then 100% loss for a few minutes, repeatedly over and over, periods of working and periods of not.
First i removed qos again juju config neutron-api enable-qos=true
, with no improvement. Then finally after noting OVN controller log errors on all compute nodes, I restarted openvswitch on all compute nodes. All errors in logs gone, all connectivity issues gone, and remain OK days later.
Timeline:
~13:26: BFD sessions failed (“Control Detection Time Expired”) and BFD state to DOWN. 13:53-14:12: OVN controller lost connection to br-int 14:03: Multiple ovsdb-server connection drops 15:56: Recovery after openvswitch-switch restart and BFD state to UP.
The below logs occured on all compute/network (converged) nodes and repeatedly until the service was restarted (/etc/init.d/openvswitch-switch restart
)
# ovs-vswitch logs BFD failure
2025-01-30T13:26:38.786Z|bfd(monitor107)|INFO|ovn0-os-region0-0: BFD state change: up->down "Control Detection Time Expired"
2025-01-30T15:52:23.283Z|01662|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:31.287Z|01663|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:39.287Z|01664|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:47.291Z|01665|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:55.293Z|01666|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:03.295Z|01667|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:11.299Z|01668|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:19.300Z|01669|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:27.304Z|01670|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
# OVSDB logs connection drops
2025-01-30T14:03:41.101Z|jsonrpc|WARN|unix#415: receive error: Connection reset by peer
# OVN controller logs connection failures
2025-01-30T13:53:56.493Z|rconn|ERR|unix:/var/run/openvswitch/br-int.mgmt: no response to inactivity probe after 5 seconds, disconnecting
2025-01-30T14:10:50.828Z|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
# Syslog
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set-ssl /etc/ovn/key_host /etc/ovn/cert_host /etc/ovn/ovn-chassis.crt
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- set open-vswitch . external-ids:ovn-encap-type=geneve -- set open-vswitch . external-ids:ovn-encap-ip=X.X.X.134 -- set open-vswitch . external-ids:system-id=hostname-1.domain -- set open-vswitch . external-ids:ovn-remote=ssl:X.X.X.179:6642,ssl:X.X.X.175:6642,ssl:X.X.X.169:6642 -- set open-vswitch . external_ids:ovn-match-northd-version=False
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-int -- set bridge br-int external-ids:charm-ovn-chassis=managed -- set bridge br-int protocols=OpenFlow13,OpenFlow15 -- set bridge br-int datapath-type=system -- set bridge br-int fail-mode=secure -- set bridge br-int other-config:disable-in-band=true
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-ex -- set bridge br-ex external-ids:charm-ovn-chassis=managed -- set bridge br-ex protocols=OpenFlow13,OpenFlow15 -- set bridge br-ex datapath-type=system -- set bridge br-ex fail-mode=standalone
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-port br-ex bond0 -- set Interface bond0 type=system -- set Interface bond0 external-ids:charm-ovn-chassis=br-ex -- set Port bond0 external-ids:charm-ovn-chassis=br-ex
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set open_vswitch . external_ids:ovn-bridge-mappings=physnet1:br-ex
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove open_vswitch . external_ids ovn-cms-options
- Switching: Nexus (no reported issues or physical changes or port changing status)
- LACP bonded 2x 10g: bond0
- openvswitch bridge: br-ex on bond0
neutron-api 23.1.0 active 1 neutron-api 2024.1/candidate 603 no Unit is ready
neutron-api-hacluster 2.1.2 active 0 hacluster 2.4/stable 131 no Unit is ready and clustered
neutron-api-plugin-ovn 23.1.0 active 0 neutron-api-plugin-ovn 2024.1/candidate 137 no Unit is ready
ovn-central 22.09.1 active 1 ovn-central 23.09/stable 234 no Unit is ready (leader: ovnsb_db)
ovn-chassis 23.09.3 active 4 ovn-chassis 23.09/stable 296 no Unit is ready
nova-compute 28.2.0 active 4 nova-compute 2024.1/candidate 771 no Unit is ready
default-base: ubuntu@22.04/stable
series: jammy
Besides attempting to recreate the issue, is there anything else I should consider/check? Given all hosts were up and the physical network fine, surely it should have been able to recover from an issue?
Thanks,
A