@szeestraten
As a member of Canonical’s Boostack team which provides day 2 managed service operations for our cloud products, I have quite a bit of experience with the replacement of failed disk devices provided to the ceph-osd charm.
While I am not familiar with juju storage beyond the documentation which can be found here, the basic principles of the underlying technology are going to be the same between juju, maas, and the ceph charms.
While the MAAS storage provider can provide you pools of storage to your ceph-osd units based on storage tagging, it does not provide “cloudy” type replacement of failing OSD storage devices such as you may find in the AWS EBS storage provider for juju.
Juju’s basic operating model is such that if a unit fails, you should be able (in a high availability application configuration), to juju remove-unit <app-name/unitX> and juju add-unit to replace the failed component. If you only have one or two OSD disks installed in each metal server, this would likely be the ideal replacement scenario.
However, in practice, most ceph-osd nodes have between 3 and 20 disks per node (depending on packing factors, performance requirements, etc) and evacuating (ungracefully) the storage from the entire unit to replace one drive is not ideal.
In this situation, you must turn to using the ceph documentation along with the ceph-mon and ceph-osd charm actions to provide a process to replace the singleton failed disk from the running metal.
The high level process is described in one of the bugs filed regarding the entire single-disk-replacement process not being entirely action-based.
For reference: Bug #1813360 “[wishlist] Action to 'purge-osd' and 'set-osd-out'...” : Bugs : OpenStack ceph-mon charm
You will basically be performing ceph operations for the most part via ssh to ceph-mon/$leader and the ceph-osd unit using a couple of actions along the way.
Here is the ceph documentation for the process of cleanly removing the failing disk:
Once those bits are completed, and you have replaced the drive and performed any low-level tasks to provide the disk to the running OS (raid configuration, disk probing, partitioning and adding bcache if you so choose), you can use the add-disk action on the ceph-osd unit to have the disk added back into ceph.
That being said, the add-disk function may come with some limitations if you’re using advanced/multiple ceph pool definitions.
If you are using bcache, there is also no way to readd the bcache caching device to the replaced disk via charmed actions. This is noted in Bug #1813359 “[wishlist] Action to “bcachify-disk” needed for en...” : Bugs : OpenStack ceph-osd charm and the manual process is outlined as well.
If you might be interested in our managed cloud services to cover your day 2 operations of this environment, please feel free to click the “Talk to Us” button at this link:
Managed OpenStack on Ubuntu | OpenStack | Ubuntu.
Best of luck,
-Drew