juju storage never attaches when using pre-created MAAS LXD Pod VM, but does with identical auto-composed VM

lathiat · 6 March 2024 09:37

I’m trying to use juju storage to provision a block device for the ceph-osd charm on juju 3.3.3/3.4.0, against MAAS 3.4 using an LXD Pod to hold VMs for all machines in the MAAS. This is in my home lab, trying to test functionality around attaching storage in different configurations (e.g. as bcache, an LVM PV, raw disk, etc) as part of adding juju storage support to the microceph charm. Although I am currently testing with the existing ‘ceph-osd’ charm, trying to see how it behaves with that.

If I pre-define a VM in MAAS that is then used by the model, it fails to see the storage on the machine (which is there). It’s stuck pending the storage to appear, the charm never starts installing as a result and the machine status keeps showing “agent initialising”.

However if I let juju/MAAS auto-compose a seemingly identical VM with the same bundle, the storage is seen and gets attached. Looking at the MAAS and LXD config, it looks the same, the disks are the same size, etc. In my example I’m attaching two different disks but it also fails if I try only 1 of them, and I’ve also tried both.

I have not tried this with “real” hardware machines at this stage due to lack of such resources. Traditionally in actual field deployments, the juju storage functionality is not usually used here, and raw disk devices are provisioned in MAAS and passed directly to the osd-devices config item. So this functionality in MAAS is perhaps not often excercised compared to the openstack provider etc where it’s more commonly used. I don’t have experience using it in the past against MAAS to know if it worked at some point on an older juju version, etc. Trying to use juju 2.9 gives me other problems around MAAS/LXD Pods that make it non-trivial to test that.

I’ve crawled over the logs and made a very poor attempt to reckon with the code that handles the storage attachment, and have not managed to figure out what the difference is and why it sees the disk in one case and not the other. It’s attached and alive in “juju storage --format json” in a way that doesn’t seem to make it to the machine.

I’ve created a juju-crashdump once it hits steady-state with logging-config=“=TRACE;unit=TRACE”. You can find it here: https://drive.google.com/file/d/1amPePTqV8Ea6blVtFP2SeizHwlJ76hw-/view?usp=drive_link

ceph-osd/{0,1,2} on hostname ceph1 (machine 1)/ceph2 (machine 2)/ceph3 (machine 3) are pre-created - you can see they are stuck in “agent initialising”

ceph-osd/3 on hostname merry-mullet (Machine 4) is the auto-composed machine - you can see the charm deployed and used the storage here

You can compare {1,2,3}/baremetal/var/log/juju/.log with 4/baremetal/var/log/juju/.log

On the broken unit the unit-ceph-osd-3.log prints still pending [storage-bluestore-db-4 storage-osd-devices-5] On the working unit we see got storage change for ceph-osd/3: [bluestore-db/6 osd-devices/7] ok=true

There are no super obvious lines to me from machine-N.log about identifying the storage, other than when it gets the info about the attachment being “alive” for the ceph-osd units storage-attached hook.

We can see the broken units keep repeating udevadm calls, it does query the relevant disks (/dev/sdb and /dev/sdc) every 30 seconds but then says no changes to block devices detected and doesn’t do anything.

I am composing the MAAS machines with this command:

for i in 1 2 3; do maas admin vm-host compose 1 hostname=ceph${i} cores=4 memory=4096 storage="0:24,1:32,2:8"; done

The bundle I’m deploying is like so:

series: jammy
applications:
  ceph-mon:
    charm: ch:ceph-mon
    channel: quincy/edge
    series: jammy
    num_units: 1
    constraints: mem=2G
    options:
      monitor-count: 1
      expected-osd-count: 3
  ceph-osd:
    charm: ch:ceph-osd
    channel: quincy/edge
    series: jammy
    num_units: 4
    constraints: mem=4G root-disk=16G
    options:
      osd-devices: ''  # must be empty string when using juju storage
      bluestore-block-db-size: 1900000000
    storage:
      osd-devices: maas,32GB,1
      bluestore-db: maas,8GB,1
relations:
  - [ ceph-mon, ceph-osd ]

juju status

lathiat@zlab:~/src/stsstack-bundles/ceph$ juju status
Model   Controller  Cloud/Region  Version  SLA          Timestamp
quincy  maas        maas/default  3.4.0    unsupported  09:36:47Z

App       Version  Status   Scale  Charm     Channel        Rev  Exposed  Message
ceph-mon  17.2.6   waiting      1  ceph-mon  quincy/stable  201  no       Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-osd  17.2.6   waiting    1/4  ceph-osd  quincy/stable  576  no       agent initialising

Unit         Workload  Agent       Machine  Public address  Ports  Message
ceph-mon/0*  waiting   idle        0        172.16.0.45            Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-osd/0*  waiting   allocating  1        172.16.0.39            agent initialising
ceph-osd/1   waiting   allocating  2        172.16.0.75            agent initialising
ceph-osd/2   waiting   allocating  3        172.16.0.55            agent initialising
ceph-osd/3   active    idle        4        172.16.0.53            Unit is ready (1 OSD)

Machine  State    Address      Inst id       Base          AZ       Message
0        started  172.16.0.45  hardy-eagle   ubuntu@22.04  default  Deployed
1        started  172.16.0.39  ceph1         ubuntu@22.04  default  Deployed
2        started  172.16.0.75  ceph2         ubuntu@22.04  default  Deployed
3        started  172.16.0.55  ceph3         ubuntu@22.04  default  Deployed
4        started  172.16.0.53  merry-mullet  ubuntu@22.04  default  Deployed

juju storage

lathiat@zlab:~/src/stsstack-bundles/ceph$ juju storage
Unit        Storage ID      Type   Pool  Size     Status    Message
ceph-osd/0  bluestore-db/0  block  maas  7.5 GiB  attached
ceph-osd/0  osd-devices/1   block  maas  30 GiB   attached
ceph-osd/1  bluestore-db/2  block  maas  7.5 GiB  attached
ceph-osd/1  osd-devices/3   block  maas  30 GiB   attached
ceph-osd/2  bluestore-db/4  block  maas  7.5 GiB  attached
ceph-osd/2  osd-devices/5   block  maas  30 GiB   attached
ceph-osd/3  bluestore-db/6  block  maas  7.5 GiB  attached
ceph-osd/3  osd-devices/7   block  maas  30 GiB   attached

lathiat · 8 March 2024 06:10

OK I have partially tracked down the cause of this. Comparing Machine #3 and #4 from the same deployment as the status info above.

For some reason, these disks show up with two IDs in /dev/disk/by-id (on both machines). On both machines, “udevinfo info /dev/sdb” shows: ID_SERIAL=0QEMU_QEMU_HARDDISK_lxd_disk1

However there are two symlinks in /dev/disk/by-id, one with a leading “0” and one with a leading “S”. On the pre-created machine #3 the “0” version is output first. On the auto-composed machine #4 the “S” version is output first:

root@ceph3:~# udevadm info /dev/sdb|grep -E "[0S]QEMU"
S: disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1
S: disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_lxd_disk1
E: DEVLINKS=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1 /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:1:1 /dev/disk/by-dname/sdb /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1

root@sweet-tetra:~# udevadm info /dev/sdb|grep -E "[0S]QEMU"
S: disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1
S: disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_lxd_disk1
E: DEVLINKS=/dev/disk/by-id/lvm-pv-uuid-HLh32H-paTA-Vc02-Ulyj-9JIC-v9q8-tr4cXh /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:1:1 /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1 /dev/disk/by-dname/sdb /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1

In both cases, juju seems to be reading this from the ID_SERIAL line, which is consistent, however it seems MAAS has perhaps read this from the symlinks instead and is returning a different result for each machine:

# broken machine 3
lathiat@zlab:~$ maas admin node read sqwwfw|jq '.blockdevice_set[1].id_path'
"/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1"
lathiat@zlab:~$ maas admin node read sqwwfw|jq '.blockdevice_set[1]' -c
{"firmware_version":"2.5+","partition_table_type":null,"available_size":31994150912,"storage_pool":"srv","id":1467,"partitions":[],"id_path":"/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_disk1","used_size":0,"size":32000049152,"block_size":512,"type":"physical","model":"QEMU HARDDISK","path":"/dev/disk/by-dname/sdb","filesystem":null,"used_for":"Unused","uuid":null,"numa_node":0,"system_id":"sqwwfw","serial":"lxd_disk1","tags":["rotary","1rpm"],"name":"sdb","resource_uri":"/MAAS/api/2.0/nodes/sqwwfw/blockdevices/1467/"}

# working machine 4
lathiat@zlab:~$ maas admin node read thp7xp|jq '.blockdevice_set[1].id_path'
"/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1"
lathiat@zlab:~$ maas admin node read thp7xp|jq '.blockdevice_set[1]' -c
{"firmware_version":null,"uuid":null,"storage_pool":"srv","partition_table_type":null,"available_size":7994343424,"used_for":"Unused","type":"physical","name":"sdb","size":8000000000,"system_id":"thp7xp","partitions":[],"numa_node":0,"serial":"lxd_disk1","tags":[],"model":"QEMU HARDDISK","used_size":0,"block_size":512,"id":1477,"path":"/dev/disk/by-dname/sdb","filesystem":null,"id_path":"/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_lxd_disk1","resource_uri":"/MAAS/api/2.0/nodes/thp7xp/blockdevices/1477/"}

I think there is a bit of imperfectness here. In the MAAS provider code here: https://github.com/juju/juju/blob/55fb5d03683b70dea08d1e54687e0f90bb06a63a/provider/maas/volumes.go#L246

We see that juju makes an assumption that /dev/disk/by-id/* can be converted to “HardwareId” which is later compared with ID_SERIAL. Which I guess is often/usually true but as we see here, is perhaps not always true. If we had used the third code path there to store the DeviceLink instead of the HardwareId, it would have worked.

I’ll look a little further into:

Why we have these two IDs and where the S/0 comes from and whether that should change
Whether MAAS could or should use ID_SERIAL instead of /dev/disk/by-id
Whether Juju should be trying to convert a /dev/disk/by-id to “HardwareId”. Given /dev/disk/by-id is a unique path anyway, it seems possibly pointless.

And file some bugs based on my findings. Thanks @wallyworld for your help debugging.