Ceph-osd charm stuck while upgrading Nautilus->Octopus

hallback · 10 February 2021 18:21

Hi all,

I was about to upgrade OpenStack Train to Ussuri today, and got stuck on upgrading the ceph-osd charm. Five of my twelve ceph-osd units have had their disks out:ed, purged, zapped, and re-added, in an attempt to resize the bluestore db, which also worked. This is the current status of ceph-osd:

The seven units on the bottom are finished upgrading, while the first five are stuck, but they are still up, the daemons have not been restarted yet. The system is working, for now.

The problem is that the five units in maintenance are stuck on getting the status from an admin deamon on a socket that does not exist anymore:

unit-ceph-osd-11: 18:14:05 WARNING unit.ceph-osd/11.config-changed admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
unit-ceph-osd-11: 18:14:05 DEBUG unit.ceph-osd/11.juju-log Command '['ceph', 'daemon', '/var/run/ceph/ceph-osd.233.asok', 'status']' returned non-zero exit status 22

These are the sockets I have on the particular unit ceph-osd/11:

root@osd11:/var/run/ceph# ls -la
total 0
drwxrwx---  2 ceph ceph  280 Feb 10 00:51 .
drwxr-xr-x 30 root root 1000 Feb 10 17:10 ..
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.34.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.61.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.7.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.72.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.73.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.74.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.77.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.78.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.79.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.80.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.81.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:51 ceph-osd.82.asok

Earlier during the lifetime of this unit, it had OSDs with some other numbers, like 233, 258, 280 and so on.

I guess I can’t easily fake the output of what such a socket puts out, or can I?

The only idea I have at the moment, is to osd-out everything on these five problematic units and remove them, if it at all is possible to run an action when the charm is in the config-changed state.

Does anyone have an idea?

Thanks a lot in advance!

Johan Hallbäck

hallback · 10 February 2021 20:09

Sorry if this is the wrong forum, and advice on where to go with this.

I think I have located where I am stuck in the code, on my system it is here:

/var/lib/juju/agents/unit-ceph-osd-11/charm/lib/charms_ceph/utils.py

As shown above, I’m stuck in the config-changed hook, here is the process:

root      179235  3.7  0.0 145876 38204 ?        S    16:29   7:07 python3 /var/lib/juju/agents/unit-ceph-osd-11/charm/hooks/config-changed

In the function get_all_osd_states(), the call to get_local_osd_ids() creates a list of the local OSDs:

def get_all_osd_states(osd_goal_states=None):
    """Get all OSD states or loop until all OSD states match OSD goal states.

    If osd_goal_states is None, just return a dictionary of current OSD states.
    If osd_goal_states is not None, loop until the current OSD states match
    the OSD goal states.

    :param osd_goal_states: (Optional) dict indicating states to wait for
                            Defaults to None
    :returns: Returns a dictionary of current OSD states.
    :rtype: dict
    """
    osd_states = {}
    for osd_num in get_local_osd_ids():
        if not osd_goal_states:
            osd_states[osd_num] = get_osd_state(osd_num)
        else:
            osd_states[osd_num] = get_osd_state(
                osd_num,
                osd_goal_state=osd_goal_states[osd_num])
    return osd_states

Note the comment there, “loop until the current OSD states matches the OSD goal state”.

Then I found that get_local_osd_ids() producese a list of local OSDs by looking at the contents of /var/lib/ceph/osd/*:

def get_local_osd_ids():
    """This will list the /var/lib/ceph/osd/* directories and try
    to split the ID off of the directory name and return it in
    a list.

    :returns: list. A list of osd identifiers
    :raises: OSError if something goes wrong with listing the directory.
    """
    osd_ids = []
    osd_path = os.path.join(os.sep, 'var', 'lib', 'ceph', 'osd')
    if os.path.exists(osd_path):
        try:
            dirs = os.listdir(osd_path)
            for osd_dir in dirs:
                osd_id = osd_dir.split('-')[1]
                if _is_int(osd_id):
                    osd_ids.append(osd_id)
        except OSError:
            raise
    return osd_ids

The problem is, that empty directories of previously purged & zapped OSDs exist on some of my ceph-osd units. On my example unit ceph-osd/11, the empty directories from July 20 2020 does not exist anymore:

root@osd11:/var/lib/ceph/osd# ls -lart
total 48
-rw-------  1 ceph ceph   69 Jul 20  2020 ceph.client.osd-upgrade.keyring
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-86
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-110
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-136
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-158
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-184
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-210
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-233
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-258
drwxr-xr-x  2 ceph ceph 4096 Jul 20  2020 ceph-280
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:48 ceph-7
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:48 ceph-34
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:48 ceph-61
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:48 ceph-72
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:49 ceph-73
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:49 ceph-74
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:49 ceph-77
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:50 ceph-78
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:50 ceph-79
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:50 ceph-80
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:50 ceph-81
drwxr-xr-x 23 ceph ceph 4096 Feb 10 00:50 .
drwxrwxrwt  2 ceph ceph  340 Feb 10 00:51 ceph-82
drwxr-x--- 15 ceph ceph 4096 Feb 10 16:37 ..

Due to this, I get stuck in an endless loop in the get_osd_state() function as I have no socket /var/run/ceph/ceph-osd.233.asok for ceph-233 for example. I seem to be stuck here in an endless loop:

def get_osd_state(osd_num, osd_goal_state=None):
    """Get OSD state or loop until OSD state matches OSD goal state.

    If osd_goal_state is None, just return the current OSD state.
    If osd_goal_state is not None, loop until the current OSD state matches
    the OSD goal state.

    :param osd_num: the osd id to get state for
    :param osd_goal_state: (Optional) string indicating state to wait for
                           Defaults to None
    :returns: Returns a str, the OSD state.
    :rtype: str
    """
    while True:
        asok = "/var/run/ceph/ceph-osd.{}.asok".format(osd_num)
        cmd = [
            'ceph',
            'daemon',
            asok,
            'status'
        ]
        try:
            result = json.loads(str(subprocess
                                    .check_output(cmd)
                                    .decode('UTF-8')))
        except (subprocess.CalledProcessError, ValueError) as e:
            log("{}".format(e), level=DEBUG)
            continue
        osd_state = result['state']
        log("OSD {} state: {}, goal state: {}".format(
            osd_num, osd_state, osd_goal_state), level=DEBUG)
        if not osd_goal_state:
            return osd_state
        if osd_state == osd_goal_state:
            return osd_state
        time.sleep(3)

The first three loops in the particular unit ceph-osd/11 had success with three OSDs before getting stuck on OSD 233:

unit-ceph-osd-11-2021-02-10T17-41-05.303.log.gz:2021-02-10 16:36:53 DEBUG config-changed Reading state information...
unit-ceph-osd-11-2021-02-10T17-41-05.303.log.gz:2021-02-10 16:37:01 DEBUG config-changed Reading state information...
unit-ceph-osd-11-2021-02-10T17-41-05.303.log.gz:2021-02-10 16:37:41 DEBUG juju-log OSD 78 state: active, goal state: None
unit-ceph-osd-11-2021-02-10T17-41-05.303.log.gz:2021-02-10 16:37:41 DEBUG juju-log OSD 73 state: active, goal state: None
unit-ceph-osd-11-2021-02-10T17-41-05.303.log.gz:2021-02-10 16:37:41 DEBUG juju-log OSD 72 state: active, goal state: None

The directory /var/run/ceph only has the 12 valid sockets:

root@osd11:/var/run/ceph# ls -la
total 0
drwxrwx---  2 ceph ceph  280 Feb 10 00:51 .
drwxr-xr-x 30 root root 1000 Feb 10 19:07 ..
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.34.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.61.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.7.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:48 ceph-osd.72.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.73.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.74.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.77.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:49 ceph-osd.78.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.79.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.80.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:50 ceph-osd.81.asok
srwxr-xr-x  1 ceph ceph    0 Feb 10 00:51 ceph-osd.82.asok

I have figured out that I can’t run juju actions (like osd-out) on the problematic nodes, the config-changed hook blocks that, although I maybe could do it manually?

Since this is “just” the config-changed hook that has been stuck during the upgrade, maybe this process can be killed?

root      179235  3.7  0.0 145876 38204 ?        S    16:29   7:51 python3 /var/lib/juju/agents/unit-ceph-osd-11/charm/hooks/config-changed

The status is that the ceph packages have been upgraded, but the 12 ceph-osd processes haven’t been reloaded yet:

root@osd11:/var/run/ceph# ps auxwww | grep -i ceph
root        2244  0.0  0.0  40584 11772 ?        Ss   00:43   0:00 /usr/bin/python3 /usr/bin/ceph-crash
ceph        7978  9.0  0.9 4654580 3904056 ?     Ssl  00:48 103:40 /usr/bin/ceph-osd -f --cluster ceph --id 7 --setuser ceph --setgroup ceph
ceph        8775  9.4  1.0 4894340 4144852 ?     Ssl  00:48 109:13 /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph --setgroup ceph
ceph        9573  9.3  1.0 4921552 4170208 ?     Ssl  00:48 107:12 /usr/bin/ceph-osd -f --cluster ceph --id 61 --setuser ceph --setgroup ceph
ceph       10350  9.1  0.9 4665432 3914972 ?     Ssl  00:48 105:08 /usr/bin/ceph-osd -f --cluster ceph --id 72 --setuser ceph --setgroup ceph
ceph       11113  8.1  0.9 4358148 3606248 ?     Ssl  00:49  93:06 /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph --setgroup ceph
ceph       11857  9.5  1.0 4901808 4152348 ?     Ssl  00:49 109:56 /usr/bin/ceph-osd -f --cluster ceph --id 74 --setuser ceph --setgroup ceph
ceph       12611  8.7  0.9 4620984 3869780 ?     Ssl  00:49 100:16 /usr/bin/ceph-osd -f --cluster ceph --id 77 --setuser ceph --setgroup ceph
ceph       13360  8.7  1.0 4731300 3980628 ?     Ssl  00:49 100:41 /usr/bin/ceph-osd -f --cluster ceph --id 78 --setuser ceph --setgroup ceph
ceph       14078  8.4  0.9 4531084 3779904 ?     Ssl  00:50  97:32 /usr/bin/ceph-osd -f --cluster ceph --id 79 --setuser ceph --setgroup ceph
ceph       14815  9.2  0.9 4659480 3908848 ?     Ssl  00:50 105:54 /usr/bin/ceph-osd -f --cluster ceph --id 80 --setuser ceph --setgroup ceph
ceph       15529  9.0  0.9 4669668 3918880 ?     Ssl  00:50 104:11 /usr/bin/ceph-osd -f --cluster ceph --id 81 --setuser ceph --setgroup ceph
ceph       16257  8.9  0.9 4548732 3797292 ?     Ssl  00:51 102:38 /usr/bin/ceph-osd -f --cluster ceph --id 82 --setuser ceph --setgroup ceph
root      126062  0.0  0.0  21776  3508 ?        Ss   11:27   0:00 bash /etc/systemd/system/jujud-unit-ceph-osd-11-exec-start.sh
root      126084  0.5  0.0 835820 91400 ?        Sl   11:27   2:57 /var/lib/juju/tools/unit-ceph-osd-11/jujud unit --data-dir /var/lib/juju --unit-name ceph
-osd/11 --debug
root      179235  3.7  0.0 145876 38204 ?        S    16:29   7:51 python3 /var/lib/juju/agents/unit-ceph-osd-11/charm/hooks/config-changed
root      852063  0.0  0.0  14864  2680 pts/0    S+   19:58   0:00 grep --color=auto -i ceph

erik-lonroth · 11 February 2021 05:30

Looks like a charm bug causing you to get stuck in maintenance state in your upgrade.

I know very little about ceph or openstack, but given production grade system:

manually move all data out from the stuck units (osd out?)
once complete, nuke the units to unstuck (maintenance) your openstack upgrade
wait for juju to complete the openstack upgrade which should occur once the maintenance state goes away.
add back one osd unit at the time

… is probably a path forward not risking your data.

This seems like a serious bug. @Dmitrii @hallback @jamesbeedy

How can one reach out to the CEPH charmers? @wallyworld

hallback · 11 February 2021 13:52

First of all I’d like to thank @erik-lonroth and @jamesbeedy for helping out last night choosing a path out of this mess. I’m in the process of doing as Erik described above. Getting some perspective and direction was my initial reason for the post.

I think the proper place for a bug report is here: https://bugs.launchpad.net/charm-ceph-osd/+filebug - according to https://jaas.ai/ceph-osd. I’ll let them know about this soon.

The basic problem as I see it is that in the ceph-osd charm, running the action zap-disk on does not clean up after itself. I have found two places where remains of an old OSD is left on disk:

The empty directory in /var/lib/ceph/osd/ceph-NNN (causing an endless loop in get_osd_state() )
There is a service remaining: /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-NNN-46fc6892-86ca-41e8-8b23-346b500e4409.service

If a disk is zapped and then re-added, it may receive the same number, in which case this corner case doesn’t happen. But if the disk gets an “osd number” that is new to the current unit, my situation will occur.

The problem can be triggered by having directories in /var/lib/ceph/osd/ that have no socket in /var/run/ceph, and then running a ceph upgrade. A normal config-change does not trigger this, but I’ll try to trigger this thing in a test environment somehow. When a ceph-osd units differ like this, the upgrade will get stuck:

# ls -l /var/lib/ceph/osd/  
total 40
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-110
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-136
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-158
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-184
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-210
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-233
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-258
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-280
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:48 ceph-34
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:48 ceph-61
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:48 ceph-7
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:48 ceph-72
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:49 ceph-73
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:49 ceph-74
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:49 ceph-77
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:50 ceph-78
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:50 ceph-79
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:50 ceph-80
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:50 ceph-81
drwxrwxrwt 2 ceph ceph  340 Feb 10 00:51 ceph-82
drwxr-xr-x 2 ceph ceph 4096 Jul 20  2020 ceph-86
-rw------- 1 ceph ceph   69 Jul 20  2020 ceph.client.osd-upgrade.keyring

# ls -l /var/run/ceph/
total 0
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:48 ceph-osd.34.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:48 ceph-osd.61.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:48 ceph-osd.7.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:48 ceph-osd.72.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:49 ceph-osd.73.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:49 ceph-osd.74.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:49 ceph-osd.77.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:49 ceph-osd.78.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:50 ceph-osd.79.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:50 ceph-osd.80.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:50 ceph-osd.81.asok
srwxr-xr-x 1 ceph ceph 0 Feb 10 00:51 ceph-osd.82.asok

Just some final thoughts on this:

When units don’t work, try to add a new one that does and destroy the old one, never fall in love with them
Don’t try to outsmart charms and create possible corner cases that no one ever tried
Know you application well enough to know how to be able to replace units if they don’t behave

Thanks again Erik and James!

james-page · 11 February 2021 15:53

Hi Johan

Please do raise a bug - I can see several ways the code to determine local OSD’s could be improved to exclude OSD’s that are no longer actually present on the server to help improve the upgrade process on clouds that have been through some changes like this.

afreiberger · 7 July 2021 20:58

I just filed this as Bug #1934938 “After replacing ceph-osd disk, blank directories /...” : Bugs : OpenStack ceph-osd charm

rmdir /var/lib/ceph/osd/ceph-<non-existant-ID> and then running systemctl restart jujud-unit-ceph-osd-$UNIT_NUMBER and juju resolved ceph-osd/$UNIT_NUMBER is a workaround I’ve found for this issue.