MySQL - failed to recover cluster

modzilla · 16 November 2023 20:36

Hey there,

after rebooting my single node OpenStack I cannot get the cluster back up. Many containers never come back up. It’s obviously because of the mysql instance not being able to boot.

It’s trying to recover as it seems but the mysqld_safe exists unexpectedly. When starting up a docker.io/library/mysql container with that very hostpath volume, it works totally fine. How can this be resolved. It seems like there is not recover option like in the old charmed mysql operator…

unit-mysql-0: 21:27:32 ERROR unit.mysql/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-0/charm/./src/charm.py", line 724, in <module>
    main(MySQLOperatorCharm)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-mysql-0/charm/./src/charm.py", line 525, in _on_mysql_pebble_ready
    self._reconcile_pebble_layer(container)
  File "/var/lib/juju/agents/unit-mysql-0/charm/./src/charm.py", line 364, in _reconcile_pebble_layer
    container.replan()
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/model.py", line 1984, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1686, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1767, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "mysqld_safe" (cannot start service: exited quickly with code 0)
----- Logs from task 0 -----
2023-11-16T20:27:32Z INFO Most recent service output:
    2023-11-16T20:27:32.460327Z mysqld_safe Logging to '/var/log/mysql/error.log'.
    2023-11-16T20:27:32.474107Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    2023-11-16T20:27:32.707547Z mysqld_safe mysqld from pid file /var/lib/mysql/mysql-0.pid ended
2023-11-16T20:27:32Z ERROR cannot start service: exited quickly with code 0
-----

modzilla · 16 November 2023 20:49

Sorry, the log in the first comment was wrong. I did a stupid mistake, this is the real problem:

2023-11-16T21:13:08.288Z [container-agent] verbose: 2023-11-16T21:13:08Z: Dba.reboot_cluster_from_complete_outage: tid=51: CONNECTED: mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306
2023-11-16T21:13:08.288Z [container-agent] No PRIMARY member found for cluster 'cluster-7ed1e5bb9cf519c113646d7c9d8d08c7'
2023-11-16T21:13:08.288Z [container-agent] verbose: 2023-11-16T21:13:08Z: ClusterSet info: member, primary, not primary_invalidated, not removed from set, primary status: UNKNOWN
2023-11-16T21:13:08.288Z [container-agent] Restoring the Cluster 'cluster-7ed1e5bb9cf519c113646d7c9d8d08c7' from complete outage...
2023-11-16T21:13:08.288Z [container-agent] 
2023-11-16T21:13:08.288Z [container-agent] ERROR: RuntimeError: The current session instance does not belong to the Cluster: 'cluster-7ed1e5bb9cf519c113646d7c9d8d08c7'.
2023-11-16T21:13:08.288Z [container-agent] Traceback (most recent call last):
2023-11-16T21:13:08.288Z [container-agent]   File "<string>", line 2, in <module>
2023-11-16T21:13:08.288Z [container-agent] RuntimeError: Dba.reboot_cluster_from_complete_outage: The current session instance does not belong to the Cluster: 'cluster-7ed1e5bb9cf519c113646d7c9d8d08c7'.
2023-11-16T21:13:08.288Z [container-agent] 
2023-11-16T21:13:08.288Z [container-agent] 
2023-11-16T21:13:08.291Z [container-agent] 2023-11-16 21:13:08 ERROR juju-log Failed to reboot cluster from complete outage.
2023-11-16T21:13:08.613Z [container-agent] 2023-11-16 21:13:08 ERROR juju-log Failed to get cluster status for cluster-7ed1e5bb9cf519c113646d7c9d8d08c7
2023-11-16T21:13:08.627Z [container-agent] 2023-11-16 21:13:08 WARNING juju-log No relation: certificates

modzilla · 16 November 2023 21:49

    def reboot_from_complete_outage(self) -> None:
        """Wrapper for reboot_cluster_from_complete_outage command."""
        reboot_from_outage_command = (
            f"shell.connect('{self.cluster_admin_user}:{self.cluster_admin_password}@{self.instance_address}')",
            f"dba.reboot_cluster_from_complete_outage('{self.cluster_name}')",
        )

        try:
            self._run_mysqlsh_script("\n".join(reboot_from_outage_command))
        except MySQLClientError as e:
            logger.exception("Failed to reboot cluster")
            raise MySQLRebootFromCompleteOutageError(e.message)

https://github.com/canonical/mysql-k8s-operator/blob/main/lib/charms/mysql/v0/mysql.py#L2006

Who is executing the dba.reboot_cluster_from_complete_outage python function? I can’t seem to find an implementation of the _run_mysqlsh_script function.