Network outage (OVS, BFD failures)

ajames-tibus · 4 February 2025 15:38

Network outage (OVS, BFD failures), potentially after enabling QOS in neutron-api charm (I cannot confirm 100% it happened when I enabled QOS, and set a 1Gbps QOS policy on a public provider network. Policy applied OK, with instances rate limited to 1Gbps during tests, using iperf3).

juju config neutron-api enable-qos=true

A while after enabling and testing QOS, I noted that I lost SSH session to my test instances a few times. Some times I could reconnect immediately, other times could be 30-60s before I could reconnect. I then had reports of down instances attached to a different public provider network, without policy applied. I jumped onto the instance consoles and starting pinging 8.8.8.8. This gave 0% packetloss for a few minutes, then 100% loss for a few minutes, repeatedly over and over, periods of working and periods of not.

First i removed qos again juju config neutron-api enable-qos=true, with no improvement. Then finally after noting OVN controller log errors on all compute nodes, I restarted openvswitch on all compute nodes. All errors in logs gone, all connectivity issues gone, and remain OK days later.

Timeline:

~13:26: BFD sessions failed (“Control Detection Time Expired”) and BFD state to DOWN. 13:53-14:12: OVN controller lost connection to br-int 14:03: Multiple ovsdb-server connection drops 15:56: Recovery after openvswitch-switch restart and BFD state to UP.

The below logs occured on all compute/network (converged) nodes and repeatedly until the service was restarted (/etc/init.d/openvswitch-switch restart)

# ovs-vswitch logs BFD failure
2025-01-30T13:26:38.786Z|bfd(monitor107)|INFO|ovn0-os-region0-0: BFD state change: up->down "Control Detection Time Expired"
2025-01-30T15:52:23.283Z|01662|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:31.287Z|01663|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:39.287Z|01664|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:47.291Z|01665|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:52:55.293Z|01666|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:03.295Z|01667|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:11.299Z|01668|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:19.300Z|01669|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)
2025-01-30T15:53:27.304Z|01670|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)

# OVSDB logs connection drops
2025-01-30T14:03:41.101Z|jsonrpc|WARN|unix#415: receive error: Connection reset by peer

# OVN controller logs connection failures
2025-01-30T13:53:56.493Z|rconn|ERR|unix:/var/run/openvswitch/br-int.mgmt: no response to inactivity probe after 5 seconds, disconnecting
2025-01-30T14:10:50.828Z|rconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: connection failed (Protocol error)


# Syslog
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set-ssl /etc/ovn/key_host /etc/ovn/cert_host /etc/ovn/ovn-chassis.crt
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- set open-vswitch . external-ids:ovn-encap-type=geneve -- set open-vswitch . external-ids:ovn-encap-ip=X.X.X.134 -- set open-vswitch . external-ids:system-id=hostname-1.domain -- set open-vswitch . external-ids:ovn-remote=ssl:X.X.X.179:6642,ssl:X.X.X.175:6642,ssl:X.X.X.169:6642 -- set open-vswitch . external_ids:ovn-match-northd-version=False
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-int -- set bridge br-int external-ids:charm-ovn-chassis=managed -- set bridge br-int protocols=OpenFlow13,OpenFlow15 -- set bridge br-int datapath-type=system -- set bridge br-int fail-mode=secure -- set bridge br-int other-config:disable-in-band=true
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-ex -- set bridge br-ex external-ids:charm-ovn-chassis=managed -- set bridge br-ex protocols=OpenFlow13,OpenFlow15 -- set bridge br-ex datapath-type=system -- set bridge br-ex fail-mode=standalone
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-port br-ex bond0 -- set Interface bond0 type=system -- set Interface bond0 external-ids:charm-ovn-chassis=br-ex -- set Port bond0 external-ids:charm-ovn-chassis=br-ex
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl set open_vswitch . external_ids:ovn-bridge-mappings=physnet1:br-ex
Jan 30 14:03:41 hostname-1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl remove open_vswitch . external_ids ovn-cms-options

Switching: Nexus (no reported issues or physical changes or port changing status)
LACP bonded 2x 10g: bond0
openvswitch bridge: br-ex on bond0

neutron-api                      23.1.0   active      1  neutron-api             2024.1/candidate  603  no       Unit is ready
neutron-api-hacluster            2.1.2    active      0  hacluster               2.4/stable        131  no       Unit is ready and clustered
neutron-api-plugin-ovn           23.1.0   active      0  neutron-api-plugin-ovn  2024.1/candidate  137  no       Unit is ready
ovn-central                      22.09.1  active      1  ovn-central             23.09/stable      234  no       Unit is ready (leader: ovnsb_db)
ovn-chassis        23.09.3  active      4  ovn-chassis        23.09/stable      296  no       Unit is ready
nova-compute       28.2.0   active      4  nova-compute       2024.1/candidate  771  no       Unit is ready

default-base: ubuntu@22.04/stable
series: jammy

Besides attempting to recreate the issue, is there anything else I should consider/check? Given all hosts were up and the physical network fine, surely it should have been able to recover from an issue?

Thanks,

A