Mysql-innodb-cluster unable to reboot-cluster-from-complete-outage

mario-chirinos · 27 July 2022 14:46

After a power outage our openstack cluster is unable to start again due to a problem with mysql-innodb-cluster. I have run

juju run-action --wait mysql-innodb-cluster/1 reboot-cluster-from-complete-outage

without success, I am attaching the output from juju satus and reboot-cluster-from-complete-outage.

I hope some one can help me to get my cluster back online.

innodb.pdf (24.9 KB) juju-status.pdf (33.0 KB)

pmatulis · 27 July 2022 15:04

Mario, I’m sorry to hear about this. Can you pastebin* the actual MySQL logs? They should be found under /var/log. I would look at the mysql-innodb-cluster/0 unit since it is apparently the leader unit based on your juju status output:

juju ssh mysql-innodb-cluster/0

* pastebin, the PDF documents are not very legible.

mario-chirinos · 27 July 2022 15:11

@pmatulis thank you very much again for your quick replay.
I am attaching the my sql error log in mysql-innodb-cluster/0

error.log.pdf (110.2 KB)

mario-chirinos · 27 July 2022 15:30

@pmatulis here is again with paste bin https://pastebin.com/zCuYDuWp

pmatulis · 27 July 2022 18:55

There are some useful clues and troubleshooting direction provided in the logs. This stood out for me:

Error on opening a connection to peer node 10.2.101.149:33061

Can you confirm that network connectivity has been restored?

mario-chirinos · 27 July 2022 19:56

that address is the mysql-innodb-cluster/1 it is accessible with juju ssh

geoint@MAAS-01:~$ ping 10.2.101.149 PING 10.2.101.149 (10.2.101.149) 56(84) bytes of data. 64 bytes from 10.2.101.149: icmp_seq=1 ttl=64 time=0.657 ms

pmatulis · 27 July 2022 20:08

But can the three database units reach one another?

mario-chirinos · 27 July 2022 20:17

yes they can

mysql-innodb-cluster/0

geoint@MAAS-01:~$ juju ssh mysql-innodb-cluster/0
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-122-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Jul 27 16:29:10 UTC 2022

  System load:  1.14             Temperature:           58.0 C
  Usage of /:   2.9% of 1.72TB   Processes:             39
  Memory usage: 1%               Users logged in:       0
  Swap usage:   0%               IPv4 address for eth0: 10.2.101.153

 * Super-optimized for small spaces - read how we shrank the memory
   footprint of MicroK8s to make it the smallest full K8s around.

   https://ubuntu.com/blog/microk8s-memory-optimisation

54 updates can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Wed Jul 27 18:46:27 2022 from 10.2.101.2
ubuntu@juju-5025f7-0-lxd-3:~$ ping 10.2.101.149
PING 10.2.101.149 (10.2.101.149) 56(84) bytes of data.
64 bytes from 10.2.101.149: icmp_seq=1 ttl=64 time=0.426 ms
64 bytes from 10.2.101.149: icmp_seq=2 ttl=64 time=0.499 ms

--- 10.2.101.149 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1017ms
rtt min/avg/max/mdev = 0.426/0.462/0.499/0.036 ms

ubuntu@juju-5025f7-0-lxd-3:~$ ping 10.2.101.181
PING 10.2.101.181 (10.2.101.181) 56(84) bytes of data.
64 bytes from 10.2.101.181: icmp_seq=1 ttl=64 time=0.693 ms
64 bytes from 10.2.101.181: icmp_seq=2 ttl=64 time=0.606 ms

mysql-innodb-cluster/1

geoint@MAAS-01:~$ juju ssh mysql-innodb-cluster/1
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-122-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Jul 27 20:15:25 UTC 2022

  System load:  1.28             Temperature:           61.0 C
  Usage of /:   2.7% of 1.72TB   Processes:             40
  Memory usage: 0%               Users logged in:       0
  Swap usage:   0%               IPv4 address for eth0: 10.2.101.149

 * Super-optimized for small spaces - read how we shrank the memory
   footprint of MicroK8s to make it the smallest full K8s around.

   https://ubuntu.com/blog/microk8s-memory-optimisation

54 updates can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Wed Jul 27 20:11:40 2022 from 10.2.101.2
ubuntu@juju-5025f7-1-lxd-3:~$ ping 10.2.101.153
PING 10.2.101.153 (10.2.101.153) 56(84) bytes of data.
64 bytes from 10.2.101.153: icmp_seq=1 ttl=64 time=0.351 ms
64 bytes from 10.2.101.153: icmp_seq=2 ttl=64 time=0.364 ms

--- 10.2.101.153 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1032ms
rtt min/avg/max/mdev = 0.351/0.357/0.364/0.006 ms
ubuntu@juju-5025f7-1-lxd-3:~$ ping 10.2.101.181
PING 10.2.101.181 (10.2.101.181) 56(84) bytes of data.
64 bytes from 10.2.101.181: icmp_seq=1 ttl=64 time=0.813 ms

mysql-innodb-cluster/2

geoint@MAAS-01:~$ juju ssh mysql-innodb-cluster/2
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-122-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Jul 27 20:16:24 UTC 2022

  System load:  0.89                Temperature:           39.0 C
  Usage of /:   25.5% of 217.95GB   Processes:             41
  Memory usage: 0%                  Users logged in:       0
  Swap usage:   0%                  IPv4 address for eth0: 10.2.101.181

 * Super-optimized for small spaces - read how we shrank the memory
   footprint of MicroK8s to make it the smallest full K8s around.

   https://ubuntu.com/blog/microk8s-memory-optimisation

54 updates can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Wed Jul 27 20:13:22 2022 from 10.2.101.2
ubuntu@juju-5025f7-2-lxd-3:~$ ping 10.2.101.153
PING 10.2.101.153 (10.2.101.153) 56(84) bytes of data.
64 bytes from 10.2.101.153: icmp_seq=1 ttl=64 time=0.569 ms
64 bytes from 10.2.101.153: icmp_seq=2 ttl=64 time=0.908 ms

--- 10.2.101.153 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1016ms
rtt min/avg/max/mdev = 0.569/0.738/0.908/0.169 ms
ubuntu@juju-5025f7-2-lxd-3:~$ ping 10.2.101.149
PING 10.2.101.149 (10.2.101.149) 56(84) bytes of data.
64 bytes from 10.2.101.149: icmp_seq=1 ttl=64 time=0.592 ms
64 bytes from 10.2.101.149: icmp_seq=2 ttl=64 time=0.591 ms

mario-chirinos · 27 July 2022 23:00

port 33061 does not seams to be open , could that be the problem?

ubuntu@juju-5025f7-1-lxd-3:~$  sudo ss -tulpn | grep LISTEN
tcp     LISTEN   0        1200        10.2.101.149:3306           0.0.0.0:*      users:(("mysqld",pid=6057,fd=24))                                              
tcp     LISTEN   0        4096       127.0.0.53%lo:53             0.0.0.0:*      users:(("systemd-resolve",pid=192,fd=13))                                      
tcp     LISTEN   0        128              0.0.0.0:22             0.0.0.0:*      users:(("sshd",pid=317,fd=3))                                                  
tcp     LISTEN   0        128                 [::]:22                [::]:*      users:(("sshd",pid=317,fd=4))                                                  
tcp     LISTEN   0        70                     *:33060                *:*      users:(("mysqld",pid=6057,fd=22))

mario-chirinos · 28 July 2022 01:43

I need some help here, I am really struggling to solve this problem

pmatulis · 28 July 2022 15:31

Mario, we may want to consider this a pure MySQL cluster problem (i.e. not a charm issue). That’s certainly what the log messages seem to indicate.

mario-chirinos · 28 July 2022 16:13

I was able to bring one node up following this instructions on the leader node https://forums.percona.com/t/mysql-group-replication-error-on-opening-a-connection-to-192-168-2-1-33061-on-local-port-33061/8087

SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

But I am unable to bring the other two nodes up, the current status is

billy-olsen · 29 July 2022 00:22

Sorry you’re having trouble, its not always easy to bring a mysql cluster up after a full outage. The charm in general does not make an attempt to start up in this scenario as its not always safe to do so and needs someone to be able to make some decisions.

The link that you’ve pasted from the forum is for percona cluster, which is a bit different than the mysql-innodb-cluster charm that’s been deployed here. The official MySQL docs do have a bit more detail regarding how to recover a cluster. Its similar to what you are doing, but you may want to log into one of the units and use the mysql-shell in order to do the recovery manually.

You can get the password for connecting to your mysql instances by running:

juju run --unit mysql-innodb-cluster/0 -- leader-get cluster-password

You can then access the mysql-shell by logging into one of the units and running the mysql-shell.mysqlsh command. I’ve included an example using the ip address for your mysql-innodb-cluster/0 unit (though it may change if you are using space bindings, etc):

> juju ssh mysql-innodb-cluster/0
...
ubuntu@juju-0ec1c5-foo-11:~$ sudo mysql-shell.mysqlsh 
mysql-py> shell.connect('clusteruser:<cluster-password-above>@10.2.101.153')
mysql-py> cluster = dba.get_cluster('jujuCluster')
mysql-py> cluster.status()
...

Hope this helps a bit.

mario-chirinos · 29 July 2022 22:56

Thanks @billy-olsen and @pmatulis for your replay I was able to get back the leader online with

SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

and the other two units with

STOP GROUP REPLICATION
RESET SLAVE

pmatulis · 29 July 2022 23:33

Great to hear Mario. Well done!

mario-chirinos · 15 August 2022 18:42

And update on this issue We had a power outage again and the backup generator failed so the machines shut down when the UPS ran out of power. I was unable to et back the cluster with reboot-cluster-from-complete-outage.

What worked for me was to run in the leader unit:

SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

and then pause and resume the remaining units