Remove Ceph osd from a server with down state

mario-chirinos · 30 July 2022 14:41

After a power outage one of our servers which contains an osd shows down status on juju, regardless of being turned on and listed as deployed in MAAS. I performed a test on the server with MAAS and now the server shows as ready instead of deployed.

I would like to remove this osd form the cluster, and I was wondering what is the proper way to do it.

I am attaching juju and ceph status

geoint@MAAS-01:~$ juju show-status-log ceph-osd/6
Time                        Type      Status  Message
24 Jul 2022 15:52:53-05:00  workload  active  Unit is ready (1 OSD)

geoint@MAAS-01:~$  juju ssh ceph-mon/0 sudo ceph status
  cluster:
    id:     bf2cbfe8-9b3c-11ec-81ad-3fc481233260
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
 
  services:
    mon: 3 daemons, quorum juju-5025f7-0-lxd-0,juju-5025f7-1-lxd-0,juju-5025f7-2-lxd-0 (age 2d)
    mgr: juju-5025f7-0-lxd-0(active, since 3d), standbys: juju-5025f7-2-lxd-0, juju-5025f7-1-lxd-0
    osd: 9 osds: 8 up (since 2d), 8 in (since 3d)
    rgw: 3 daemons active (3 hosts, 1 zones)
 
  data:
    pools:   19 pools, 197 pgs
    objects: 2.02M objects, 7.6 TiB
    usage:   23 TiB used, 160 TiB / 183 TiB avail
    pgs:     197 active+clean
 
  io:
    client:   60 KiB/s rd, 20 KiB/s wr, 60 op/s rd, 41 op/s wr
 
Connection to 10.2.101.140 closed.

juju-status.pdf (29.5 KB)

pmatulis · 1 August 2022 16:29

There is an operation for removing an OSD in the Charmed Ceph documentation:

mario-chirinos · 1 August 2022 23:40

Step 3 wont work, becuse $OSD_UNIT is down

juju run-action --wait $OSD_UNIT remove-disk osd-ids=$OSD purge=true

mario-chirinos · 2 August 2022 00:09

I am attaching my osd tree

geoint@MAAS-01:~$ juju ssh ceph-mon/leader sudo ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 204.70752 root default
-7 21.38379 host clean-hog
4 hdd 21.38379 osd.4 up 1.00000 1.00000 -13 21.38379 host exotic-goblin
5 hdd 21.38379 osd.5 up 1.00000 1.00000 -5 21.38379 host key-ox
2 hdd 21.38379 osd.2 up 1.00000 1.00000 -9 21.38379 host liked-hermit
1 hdd 21.38379 osd.1 up 1.00000 1.00000 -17 21.83060 host pumped-bat
7 hdd 21.83060 osd.7 up 1.00000 1.00000 -15 21.83060 host sharp-grouse
6 hdd 21.83060 osd.6 down 0 1.00000 -19 32.74359 host sharp-heron
8 hdd 32.74359 osd.8 up 1.00000 1.00000 -11 21.38379 host stable-liger
0 hdd 21.38379 osd.0 up 1.00000 1.00000 -3 21.38379 host star-koala
3 hdd 21.38379 osd.3 up 1.00000 1.00000 Connection to 10.2.101.167 closed.

mario-chirinos · 2 August 2022 14:20

What i need to know is what is the proper way to remove the unit and osd, the unit down

chris.macnaughton · 3 August 2022 15:35

The purge-osd action on the ceph-mon should help you with this situation!

mario-chirinos · 8 August 2022 13:56

shold I change the current wight of the osd?

6 hdd 21.83060 osd.6 down 0 1.00000 -19 32.74359 host sharp-heron

chris.macnaughton · 8 August 2022 14:08

If that host has been down for a while already, it doesn’t really matter, as the cluster will already have routed around that loss.

mario-chirinos · 8 August 2022 19:30

geoint@MAAS-01:~$ juju run-action --wait ceph-mon/leader purge-osd osd=6  i-really-mean-it=yes
unit-ceph-mon-2:
  UnitId: ceph-mon/2
  id: "256"
  message: OSD has weight 21.830596923828125, must have zero weight before this operation
  results: {}
  status: failed
  timing:
    completed: 2022-08-08 19:30:05 +0000 UTC
    enqueued: 2022-08-08 19:30:04 +0000 UTC
    started: 2022-08-08 19:30:04 +0000 UTC

chris.macnaughton · 8 August 2022 19:48

Looks like you should reweight the OSD to 0

mario-chirinos · 9 August 2022 15:17

I was able to remove the osd with the commands below, I am trying to remove the unit with juju remove-unit --wait ceph-osd/6, but it seams to do nothing, the unit is still there .

juju run-action --wait ceph-mon/leader change-osd-weight osd=6 weight=0
juju run-action --wait ceph-mon/leader purge-osd osd=6  i-really-mean-it=yes

mario-chirinos · 15 August 2022 20:17

Should I use --force?