Facing issues with multiple k8s pods stop working every now and then

Hi,

First post in this forum.

Have a physical machine with 4x Core, 32GB RAM, 250GB SSD for system/snaps and 500GB for Microceph for testing.

Am running 2023.2/edge as I need Magnum.

At times, all routers stop working (see this with juju status) causing system to become unavailable in 10-20min before it sort itself out again. I see errors related to DB connectivity in Mysql-0 pod, Magnum pod and others. No errors in CoreDNS logs that I can find. Name resolution was the first thing I suspected could cause issues like this.

Cannot nail it down to something specific apart from errors I see in various logs. Any help on where to start is appreciated.

Excerpt form various log files:

Magnum log

2023-10-24T19:09:05.724256759Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 WARNING magnum.service.periodic [None req-1f5a5635-4483-4e36-8253-53d58f3e0085 - - - - - -] Ignore error [(pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'magnum-mysql-router.openstack.svc.cluster.local' ([Errno 111] ECONNREFUSED)")
2023-10-24T19:09:05.724288613Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] (Background on this error at: https://sqlalche.me/e/14/e3q8)] when syncing up cluster status.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'magnum-mysql-router.openstack.svc.cluster.local' ([Errno 111] ECONNREFUSED)")
2023-10-24T19:09:05.724291669Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-10-24T19:09:05.724294183Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic Traceback (most recent call last):
2023-10-24T19:09:05.724296948Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 613, in connect
2023-10-24T19:09:05.724298757Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic     sock = socket.create_connection(
2023-10-24T19:09:05.724300829Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic   File "/usr/lib/python3/dist-packages/eventlet/green/socket.py", line 63, in create_connection
2023-10-24T19:09:05.724303013Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic     raise err
2023-10-24T19:09:05.724304895Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic   File "/usr/lib/python3/dist-packages/eventlet/green/socket.py", line 53, in create_connection
2023-10-24T19:09:05.724306711Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic     sock.connect(sa)
2023-10-24T19:09:05.724312434Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic   File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 270, in connect
2023-10-24T19:09:05.724314459Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic     socket_checkerr(fd)
2023-10-24T19:09:05.724316248Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic   File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 54, in socket_checkerr
2023-10-24T19:09:05.724318297Z stdout F 2023-10-24T19:09:05.724Z [magnum-conductor] 2023-10-24 19:09:05.723 14 ERROR magnum.service.periodic     raise socket.error(err, errno.errorcode[err])

Mysql-0 log

2023-10-24T19:14:21.410730139Z stdout F 2023-10-24T19:14:21.410Z [container-agent] 2023-10-24 19:14:21 INFO juju-log Unit workload member-state is online with member-role primary
2023-10-24T19:14:23.355791345Z stdout F 2023-10-24T19:14:23.355Z [container-agent] 2023-10-24 19:14:23 WARNING juju-log No relation: certificates
2023-10-24T19:14:23.668300156Z stdout F 2023-10-24T19:14:23.668Z [container-agent] 2023-10-24 19:14:23 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2023-10-24T19:18:53.859476036Z stdout F 2023-10-24T19:18:53.859Z [container-agent] 2023-10-24 19:18:53 INFO juju-log Unit workload member-state is online with member-role primary
2023-10-24T19:18:55.832370583Z stdout F 2023-10-24T19:18:55.832Z [container-agent] 2023-10-24 19:18:55 WARNING juju-log No relation: certificates
2023-10-24T19:18:56.136698527Z stdout F 2023-10-24T19:18:56.136Z [container-agent] 2023-10-24 19:18:56 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2023-10-24T19:24:37.240198768Z stdout F 2023-10-24T19:24:37.240Z [container-agent] 2023-10-24 19:24:37 INFO juju-log Unit workload member-state is online with member-role primary
2023-10-24T19:24:39.210484049Z stdout F 2023-10-24T19:24:39.210Z [container-agent] 2023-10-24 19:24:39 WARNING juju-log No relation: certificates
2023-10-24T19:24:39.520496286Z stdout F 2023-10-24T19:24:39.520Z [container-agent] 2023-10-24 19:24:39 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2023-10-24T19:29:07.969921048Z stdout F 2023-10-24T19:29:07.969Z [container-agent] 2023-10-24 19:29:07 INFO juju-log Unit workload member-state is online with member-role primary
2023-10-24T19:29:11.124980548Z stdout F 2023-10-24T19:29:11.124Z [container-agent] 2023-10-24 19:29:11 WARNING juju-log Failed to get node count
2023-10-24T19:29:11.124992934Z stdout F 2023-10-24T19:29:11.124Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.124995481Z stdout F 2023-10-24T19:29:11.124Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 672, in _run_mysqlsh_script
2023-10-24T19:29:11.124997897Z stdout F 2023-10-24T19:29:11.124Z [container-agent]     stdout, _ = process.wait_output()
2023-10-24T19:29:11.125000003Z stdout F 2023-10-24T19:29:11.124Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1359, in wait_output
2023-10-24T19:29:11.125002533Z stdout F 2023-10-24T19:29:11.124Z [container-agent]     raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
2023-10-24T19:29:11.125004978Z stdout F 2023-10-24T19:29:11.124Z [container-agent] ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2023-10-24T19:29:10Z: Loading startup files...\nverbose: 2023-10-24T19:29:10Z: Loading plugins...\nverbose: 2023-10-24T19:29:10Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2003): Shell.connect: Can\'t connect to MySQL server on \'mysql-0.mysql-endpoints:3306\' (111)\n'
2023-10-24T19:29:11.125007216Z stdout F 2023-10-24T19:29:11.124Z [container-agent] 
2023-10-24T19:29:11.12500909Z stdout F 2023-10-24T19:29:11.124Z [container-agent] During handling of the above exception, another exception occurred:
2023-10-24T19:29:11.125027965Z stdout F 2023-10-24T19:29:11.124Z [container-agent] 
2023-10-24T19:29:11.125030389Z stdout F 2023-10-24T19:29:11.124Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.125043427Z stdout F 2023-10-24T19:29:11.124Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1383, in get_cluster_node_count
2023-10-24T19:29:11.12505083Z stdout F 2023-10-24T19:29:11.124Z [container-agent]     output = self._run_mysqlsh_script("\n".join(size_commands))
2023-10-24T19:29:11.125052836Z stdout F 2023-10-24T19:29:11.124Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 675, in _run_mysqlsh_script
2023-10-24T19:29:11.125054668Z stdout F 2023-10-24T19:29:11.124Z [container-agent]     raise MySQLClientError(e.stderr)
2023-10-24T19:29:11.125056493Z stdout F 2023-10-24T19:29:11.124Z [container-agent] charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
2023-10-24T19:29:11.125068677Z stdout F 2023-10-24T19:29:11.124Z [container-agent] verbose: 2023-10-24T19:29:10Z: Loading startup files...
2023-10-24T19:29:11.125070944Z stdout F 2023-10-24T19:29:11.124Z [container-agent] verbose: 2023-10-24T19:29:10Z: Loading plugins...
2023-10-24T19:29:11.125072771Z stdout F 2023-10-24T19:29:11.124Z [container-agent] verbose: 2023-10-24T19:29:10Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints
2023-10-24T19:29:11.125074639Z stdout F 2023-10-24T19:29:11.124Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.125076446Z stdout F 2023-10-24T19:29:11.124Z [container-agent]   File "<string>", line 1, in <module>
2023-10-24T19:29:11.125078559Z stdout F 2023-10-24T19:29:11.124Z [container-agent] mysqlsh.DBError: MySQL Error (2003): Shell.connect: Can't connect to MySQL server on 'mysql-0.mysql-endpoints:3306' (111)
2023-10-24T19:29:11.125082929Z stdout F 2023-10-24T19:29:11.124Z [container-agent] 
2023-10-24T19:29:11.760481789Z stdout F 2023-10-24T19:29:11.760Z [container-agent] 2023-10-24 19:29:11 WARNING juju-log Failed to get cluster primary addresses
2023-10-24T19:29:11.760492754Z stdout F 2023-10-24T19:29:11.760Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.760495295Z stdout F 2023-10-24T19:29:11.760Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 672, in _run_mysqlsh_script
2023-10-24T19:29:11.760497675Z stdout F 2023-10-24T19:29:11.760Z [container-agent]     stdout, _ = process.wait_output()
2023-10-24T19:29:11.760500009Z stdout F 2023-10-24T19:29:11.760Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1359, in wait_output
2023-10-24T19:29:11.760501995Z stdout F 2023-10-24T19:29:11.760Z [container-agent]     raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
2023-10-24T19:29:11.760504579Z stdout F 2023-10-24T19:29:11.760Z [container-agent] ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2023-10-24T19:29:11Z: Loading startup files...\nverbose: 2023-10-24T19:29:11Z: Loading plugins...\nverbose: 2023-10-24T19:29:11Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2003): Shell.connect_to_primary: Can\'t connect to MySQL server on \'mysql-0.mysql-endpoints:3306\' (111)\n'
2023-10-24T19:29:11.760506842Z stdout F 2023-10-24T19:29:11.760Z [container-agent] 
2023-10-24T19:29:11.76050894Z stdout F 2023-10-24T19:29:11.760Z [container-agent] During handling of the above exception, another exception occurred:
2023-10-24T19:29:11.760510813Z stdout F 2023-10-24T19:29:11.760Z [container-agent] 
2023-10-24T19:29:11.760512644Z stdout F 2023-10-24T19:29:11.760Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.760514924Z stdout F 2023-10-24T19:29:11.760Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1640, in get_cluster_primary_address
2023-10-24T19:29:11.760525688Z stdout F 2023-10-24T19:29:11.760Z [container-agent]     output = self._run_mysqlsh_script("\n".join(get_cluster_primary_commands))
2023-10-24T19:29:11.760527755Z stdout F 2023-10-24T19:29:11.760Z [container-agent]   File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 675, in _run_mysqlsh_script
2023-10-24T19:29:11.760529993Z stdout F 2023-10-24T19:29:11.760Z [container-agent]     raise MySQLClientError(e.stderr)
2023-10-24T19:29:11.760531871Z stdout F 2023-10-24T19:29:11.760Z [container-agent] charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
2023-10-24T19:29:11.760533726Z stdout F 2023-10-24T19:29:11.760Z [container-agent] verbose: 2023-10-24T19:29:11Z: Loading startup files...
2023-10-24T19:29:11.760535787Z stdout F 2023-10-24T19:29:11.760Z [container-agent] verbose: 2023-10-24T19:29:11Z: Loading plugins...
2023-10-24T19:29:11.760537709Z stdout F 2023-10-24T19:29:11.760Z [container-agent] verbose: 2023-10-24T19:29:11Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints
2023-10-24T19:29:11.760539595Z stdout F 2023-10-24T19:29:11.760Z [container-agent] Traceback (most recent call last):
2023-10-24T19:29:11.760541453Z stdout F 2023-10-24T19:29:11.760Z [container-agent]   File "<string>", line 1, in <module>
2023-10-24T19:29:11.760543282Z stdout F 2023-10-24T19:29:11.760Z [container-agent] mysqlsh.DBError: MySQL Error (2003): Shell.connect_to_primary: Can't connect to MySQL server on 'mysql-0.mysql-endpoints:3306' (111)
2023-10-24T19:29:11.760545179Z stdout F 2023-10-24T19:29:11.760Z [container-agent] 
2023-10-24T19:29:12.450387302Z stdout F 2023-10-24T19:29:12.450Z [container-agent] 2023-10-24 19:29:12 ERROR juju-log Failed to get cluster status for cluster-ee5714d7b407da0f362949385c684ea3
2023-10-24T19:29:12.478681506Z stdout F 2023-10-24T19:29:12.478Z [container-agent] 2023-10-24 19:29:12 INFO juju-log Kubernetes pod label error created
2023-10-24T19:29:12.488770606Z stdout F 2023-10-24T19:29:12.488Z [container-agent] 2023-10-24 19:29:12 WARNING juju-log No relation: certificates
2023-10-24T19:29:12.789121732Z stdout F 2023-10-24T19:29:12.789Z [container-agent] 2023-10-24 19:29:12 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
manager@node1:~$ juju status -m openstack
Model      Controller          Cloud/Region                Version  SLA          Timestamp
openstack  sunbeam-controller  sunbeam-microk8s/localhost  3.1.6    unsupported  20:06:51Z

SAAS       Status  Store  URL
microceph  active  local  admin/controller.microceph

App                       Version                  Status  Scale  Charm                     Channel        Rev  Address          Exposed  Message
barbican                                           active      1  barbican-k8s              2023.2/edge     10  10.152.183.42    no       
barbican-mysql-router     8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.174   no       
certificate-authority                              active      1  self-signed-certificates  latest/beta     33  10.152.183.45    no       
cinder                                             active      1  cinder-k8s                2023.2/edge     51  10.152.183.109   no       
cinder-ceph                                        active      1  cinder-ceph-k8s           2023.2/edge     44  10.152.183.145   no       
cinder-ceph-mysql-router  8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.125   no       
cinder-mysql-router       8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.251   no       
glance                                             active      1  glance-k8s                2023.2/edge     65  10.152.183.184   no       
glance-mysql-router       8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.249   no       
heat                                               active      1  heat-k8s                  2023.2/edge     26  10.152.183.118   no       
heat-cfn                                           active      1  heat-k8s                  2023.2/edge     26  10.152.183.203   no       
heat-cfn-mysql-router     8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.242   no       
heat-mysql-router         8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.216   no       
horizon                                            active      1  horizon-k8s               2023.2/edge     60  10.152.183.202   no       http://192.168.100.214/openstack-horizon
horizon-mysql-router      8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.212   no       
keystone                                           active      1  keystone-k8s              2023.2/edge    142  10.152.183.65    no       
keystone-mysql-router     8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.126   no       
magnum                                             active      1  magnum-k8s                2023.2/edge      6  10.152.183.141   no       
magnum-mysql-router       8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.64    no       
mysql                     8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s                 8.0/candidate   99  10.152.183.132   no       
neutron                                            active      1  neutron-k8s               2023.2/edge     57  10.152.183.96    no       
neutron-mysql-router      8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.225   no       
nova                                               active      1  nova-k8s                  2023.2/edge     54  10.152.183.159   no       
nova-api-mysql-router     8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.236   no       
nova-cell-mysql-router    8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.131   no       
nova-mysql-router         8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.222   no       
octavia                                            active      1  octavia-k8s               2023.2/edge     10  10.152.183.252   no       
octavia-mysql-router      8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.25    no       
ovn-central                                        active      1  ovn-central-k8s           23.09/edge      62  10.152.183.21    no       
ovn-relay                                          active      1  ovn-relay-k8s             23.09/edge      50  192.168.100.211  no       
placement                                          active      1  placement-k8s             2023.2/edge     46  10.152.183.237   no       
placement-mysql-router    8.0.34-0ubuntu0.22.04.1  active      1  mysql-router-k8s          8.0/candidate   69  10.152.183.173   no       
rabbitmq                  3.12.1                   active      1  rabbitmq-k8s              3.12/edge       33  192.168.100.212  no       
traefik                   2.10.4                   active      1  traefik-k8s               1.0/candidate  148  192.168.100.213  no       
traefik-public            2.10.4                   active      1  traefik-k8s               1.0/candidate  148  192.168.100.214  no       
vault                                              active      1  vault-k8s                 latest/edge     32  10.152.183.147   no       

Unit                         Workload  Agent  Address       Ports  Message
barbican-mysql-router/0*     active    idle   10.1.166.185         
barbican/0*                  active    idle   10.1.166.186         
certificate-authority/0*     active    idle   10.1.166.135         
cinder-ceph-mysql-router/0*  active    idle   10.1.166.138         
cinder-ceph/0*               active    idle   10.1.166.146         
cinder-mysql-router/0*       active    idle   10.1.166.159         
cinder/0*                    active    idle   10.1.166.162         
glance-mysql-router/0*       active    idle   10.1.166.156         
glance/0*                    active    idle   10.1.166.171         
heat-cfn-mysql-router/0*     active    idle   10.1.166.178         
heat-cfn/0*                  active    idle   10.1.166.180         
heat-mysql-router/0*         active    idle   10.1.166.177         
heat/0*                      active    idle   10.1.166.179         
horizon-mysql-router/0*      active    idle   10.1.166.155         
horizon/0*                   active    idle   10.1.166.161         
keystone-mysql-router/0*     active    idle   10.1.166.158         
keystone/0*                  active    idle   10.1.166.172         
magnum-mysql-router/0*       active    idle   10.1.166.187         
magnum/0*                    active    idle   10.1.166.188         
mysql/0*                     active    idle   10.1.166.173         Primary
neutron-mysql-router/0*      active    idle   10.1.166.151         
neutron/0*                   active    idle   10.1.166.152         
nova-api-mysql-router/0*     active    idle   10.1.166.136         
nova-cell-mysql-router/0*    active    idle   10.1.166.149         
nova-mysql-router/0*         active    idle   10.1.166.150         
nova/0*                      active    idle   10.1.166.157         
octavia-mysql-router/0*      active    idle   10.1.166.174         
octavia/0*                   active    idle   10.1.166.176         
ovn-central/0*               active    idle   10.1.166.148         
ovn-relay/0*                 active    idle   10.1.166.139         
placement-mysql-router/0*    active    idle   10.1.166.142         
placement/0*                 active    idle   10.1.166.153         
rabbitmq/0*                  active    idle   10.1.166.143         
traefik-public/0*            active    idle   10.1.166.145         
traefik/0*                   active    idle   10.1.166.144         
vault/0*                     active    idle   10.1.166.184         

Offer                  Application            Charm                     Rev  Connected  Endpoint              Interface             Role
certificate-authority  certificate-authority  self-signed-certificates  33   1/1        certificates          tls-certificates      provider
cinder-ceph            cinder-ceph            cinder-ceph-k8s           44   1/1        ceph-access           cinder-ceph-key       provider
keystone               keystone               keystone-k8s              142  1/1        identity-credentials  keystone-credentials  provider
ovn-relay              ovn-relay              ovn-relay-k8s             50   1/1        ovsdb-cms-relay       ovsdb-cms             provider
rabbitmq               rabbitmq               rabbitmq-k8s              33   1/1        amqp                  rabbitmq              provider
manager@node1:~$ sudo microk8s.kubectl get pods -n openstack
NAME                             READY   STATUS    RESTARTS         AGE
modeloperator-57b6686f86-hszpw   1/1     Running   0                3d9h
certificate-authority-0          1/1     Running   0                3d9h
rabbitmq-0                       2/2     Running   1 (3d9h ago)     3d9h
ovn-relay-0                      2/2     Running   1 (3d9h ago)     3d9h
cinder-ceph-0                    2/2     Running   0                3d9h
placement-0                      2/2     Running   0                3d9h
horizon-0                        2/2     Running   0                3d9h
ovn-central-0                    4/4     Running   1 (3d8h ago)     3d9h
cinder-0                         3/3     Running   0                3d9h
neutron-0                        2/2     Running   0                3d9h
glance-0                         2/2     Running   0                3d9h
keystone-0                       2/2     Running   0                3d9h
nova-0                           4/4     Running   0                3d9h
octavia-0                        4/4     Running   0                3d6h
heat-cfn-0                       3/3     Running   0                3d6h
heat-0                           3/3     Running   0                3d6h
vault-0                          2/2     Running   0                3d6h
barbican-0                       3/3     Running   0                3d6h
magnum-0                         3/3     Running   0                3d6h
traefik-0                        2/2     Running   4 (22h ago)      3d9h
heat-cfn-mysql-router-0          2/2     Running   10 (8h ago)      3d6h
keystone-mysql-router-0          2/2     Running   5 (7h49m ago)    3d9h
placement-mysql-router-0         2/2     Running   7 (7h41m ago)    3d9h
nova-api-mysql-router-0          2/2     Running   9 (6h41m ago)    3d9h
nova-mysql-router-0              2/2     Running   7 (6h33m ago)    3d9h
traefik-public-0                 2/2     Running   6 (6h24m ago)    3d9h
heat-mysql-router-0              2/2     Running   8 (6h8m ago)     3d6h
cinder-mysql-router-0            2/2     Running   14 (5h29m ago)   3d9h
glance-mysql-router-0            2/2     Running   8 (5h5m ago)     3d9h
barbican-mysql-router-0          2/2     Running   10 (4h3m ago)    3d6h
octavia-mysql-router-0           2/2     Running   8 (3h6m ago)     3d6h
cinder-ceph-mysql-router-0       2/2     Running   8 (79m ago)      3d9h
magnum-mysql-router-0            2/2     Running   9 (76m ago)      3d6h
nova-cell-mysql-router-0         2/2     Running   11 (58m ago)     3d9h
neutron-mysql-router-0           2/2     Running   9 (58m ago)      3d9h
horizon-mysql-router-0           2/2     Running   8 (51m ago)      3d9h
mysql-0                          2/2     Running   17 (46m ago)     3d9h

Using comments to document what I find.

Manage to capture what is happening when all get unstable. It seem k8s health checks fail and k8s restart pods. Worth pointing out is I stressed the node by creating a “coe cluster” when this occured.

manager@node1:~$ sudo microk8s.kubectl get events --all-namespaces
NAMESPACE   LAST SEEN   TYPE      REASON         OBJECT                         MESSAGE
openstack   3m59s       Warning   Unhealthy      pod/nova-api-mysql-router-0    Readiness probe failed: Get "http://10.1.166.136:38813/v1/health?level=ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   3m59s       Warning   Unhealthy      pod/nova-api-mysql-router-0    Liveness probe failed: Get "http://10.1.166.136:38813/v1/health?level=alive": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   3m59s       Normal    Killing        pod/nova-api-mysql-router-0    Container mysql-router failed liveness probe, will be restarted
openstack   3m58s       Warning   Unhealthy      pod/nova-cell-mysql-router-0   Readiness probe failed: Get "http://10.1.166.149:38813/v1/health?level=ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   3m58s       Warning   Unhealthy      pod/nova-cell-mysql-router-0   Liveness probe failed: Get "http://10.1.166.149:38813/v1/health?level=alive": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   3m58s       Normal    Killing        pod/nova-cell-mysql-router-0   Container mysql-router failed liveness probe, will be restarted
openstack   3m57s       Normal    Pulled         pod/nova-api-mysql-router-0    Container image "registry.jujucharms.com/charm/g78qli3013qicvevb3oj4z8u0zhjod1agws2d/mysql-router-image@sha256:3d665bce5076c13f430d5ab2e0864b3677698a33b4f635fc829ecbe14089ae45" already present on machine
openstack   3m57s       Normal    Created        pod/nova-api-mysql-router-0    Created container mysql-router
openstack   3m57s       Normal    Started        pod/nova-api-mysql-router-0    Started container mysql-router
openstack   3m55s       Normal    Pulled         pod/nova-cell-mysql-router-0   Container image "registry.jujucharms.com/charm/g78qli3013qicvevb3oj4z8u0zhjod1agws2d/mysql-router-image@sha256:3d665bce5076c13f430d5ab2e0864b3677698a33b4f635fc829ecbe14089ae45" already present on machine
openstack   3m54s       Warning   Unhealthy      pod/nova-cell-mysql-router-0   Readiness probe failed: Get "http://10.1.166.149:38813/v1/health?level=ready": dial tcp 10.1.166.149:38813: connect: connection refused
openstack   3m54s       Normal    Created        pod/nova-cell-mysql-router-0   Created container mysql-router
openstack   3m54s       Normal    Started        pod/nova-cell-mysql-router-0   Started container mysql-router
openstack   12s         Warning   Unhealthy      pod/traefik-public-0           Readiness probe failed: Get "http://10.1.166.145:38813/v1/health?level=ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   12s         Warning   Unhealthy      pod/traefik-public-0           Liveness probe failed: Get "http://10.1.166.145:38813/v1/health?level=alive": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
openstack   12s         Normal    Killing        pod/traefik-public-0           Container traefik failed liveness probe, will be restarted
openstack   9s          Normal    nodeAssigned   service/traefik-public         announcing from node "node1" with protocol "layer2"
openstack   8s          Normal    Pulled         pod/traefik-public-0           Container image "registry.jujucharms.com/charm/3cs9bo6mym14ler2zuvzsnqcd8pvucf6cxvlw/traefik-image@sha256:f808c7e1feaa41613e1bb4ea06556a73f686f18d7cb484f5da371c9cb403eb47" already present on machine
openstack   8s          Normal    Created        pod/traefik-public-0           Created container traefik
openstack   8s          Normal    Started        pod/traefik-public-0           Started container traefik
manager@node1:~$ 

Mysql pod is restarted by k8s because of memory pressure (code 137). Don´t have any monitoring on this but so far it seem to happen once or twice per day.

sudo microk8s.kubectl describe pod mysql-0 -n openstack

Name: mysql-0 Namespace: openstack Priority: 0 Service Account: mysql Node: node1/192.168.100.117 Start Time: Sat, 21 Oct 2023 11:21:13 +0000 Labels: app.juju.is/created-by=mysql app.kubernetes.io/name=mysql apps.kubernetes.io/pod-index=0 cluster-name=cluster-ee5714d7b407da0f362949385c684ea3 controller-revision-hash=mysql-57b899c857 role=primary statefulset.kubernetes.io/pod-name=mysql-0 Annotations: cni.projectcalico.org/containerID: 07cccf9192b50d65581207addb167a2cc60fcda71c20125de7f851084994bfcc cni.projectcalico.org/podIP: 10.1.166.173/32 cni.projectcalico.org/podIPs: 10.1.166.173/32 controller.juju.is/id: 140221ab-adf4-413d-8bc9-773c132014be juju.is/version: 3.1.6 model.juju.is/id: e5108146-d8e7-41dc-8ff9-ccc7e8074b70 unit.juju.is/id: mysql/0 Status: Running IP: 10.1.166.173 IPs: IP: 10.1.166.173 Controlled By: StatefulSet/mysql Init Containers: charm-init: Container ID: containerd://9756758dae663c8741e86ee6e1b41d1667f8132398be02e443c3a53fc0c2a897 Image: jujusolutions/jujud-operator:3.1.6 Image ID: docker.io/jujusolutions/jujud-operator@sha256:2a7dd57026c959124eaf13a4fc59e90a5b59f1fa57a019bf3e2871971023bd1b Port: <none> Host Port: <none> Command: /opt/containeragent Args: init --containeragent-pebble-dir /containeragent/pebble --charm-modified-version 0 --data-dir /var/lib/juju --bin-dir /charm/bin State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 21 Oct 2023 11:21:16 +0000 Finished: Sat, 21 Oct 2023 11:21:17 +0000 Ready: True Restart Count: 0 Environment Variables from: mysql-application-config Secret Optional: false Environment: JUJU_CONTAINER_NAMES: mysql JUJU_K8S_POD_NAME: mysql-0 (v1:metadata.name) JUJU_K8S_POD_UUID: (v1:metadata.uid) Mounts: /charm/bin from charm-data (rw,path="charm/bin") /charm/containers from charm-data (rw,path="charm/containers") /containeragent/pebble from charm-data (rw,path="containeragent/pebble") /var/lib/juju from charm-data (rw,path="var/lib/juju") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lffmr (ro) Containers: charm: Container ID: containerd://7b275572220abab4a20af51ed09abdb4c2c764c8d66c81e105ee30d34427da77 Image: jujusolutions/charm-base:ubuntu-22.04 Image ID: docker.io/jujusolutions/charm-base@sha256:8b3f6bbfdd2f03575f53b493594d3ee2cc488edc506e070848594a030d7c76f5 Port: <none> Host Port: <none> Command: /charm/bin/pebble Args: run --http :38812 --verbose State: Running Started: Sat, 21 Oct 2023 11:21:18 +0000 Ready: True Restart Count: 0 Limits: memory: 2Gi Requests: memory: 2Gi Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1 Readiness: http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1 Startup: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1 Environment: JUJU_CONTAINER_NAMES: mysql HTTP_PROBE_PORT: 3856 Mounts: /charm/bin from charm-data (ro,path="charm/bin") /charm/containers from charm-data (rw,path="charm/containers") /var/lib/juju from charm-data (rw,path="var/lib/juju") /var/lib/juju/storage/database/0 from mysql-database-e7e74ab4 (rw) /var/lib/pebble/default from charm-data (rw,path="containeragent/pebble") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lffmr (ro) mysql: Container ID: containerd://fd497da33c8059ee1c652ee92ddf2027a929ac3bb9a699bd364521ded435463a Image: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:3d665bce5076c13f430d5ab2e0864b3677698a33b4f635fc829ecbe14089ae45 Image ID: registry.jujucharms.com/charm/62ptdfbrjpw3n9tcnswjpart30jauc6wc5wbi/mysql-image@sha256:3d665bce5076c13f430d5ab2e0864b3677698a33b4f635fc829ecbe14089ae45 Port: <none> Host Port: <none> Command: /charm/bin/pebble Args: run --create-dirs --hold --http :38813 --verbose State: Running Started: Thu, 26 Oct 2023 08:37:54 +0000 Last State: Terminated Reason: Error Exit Code: 137 Started: Thu, 26 Oct 2023 04:06:51 +0000 Finished: Thu, 26 Oct 2023 08:37:54 +0000 Ready: True Restart Count: 27 Limits: memory: 2Gi Requests: memory: 2Gi Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1 Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1 Environment: JUJU_CONTAINER_NAME: mysql PEBBLE_SOCKET: /charm/container/pebble.socket Mounts: /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble") /charm/container from charm-data (rw,path="charm/containers/mysql") /var/lib/mysql from mysql-database-e7e74ab4 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lffmr (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: mysql-database-e7e74ab4: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mysql-database-e7e74ab4-mysql-0 ReadOnly: false charm-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-lffmr: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/arch=amd64 Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none>