Back up control plane

dnplas · 23 April 2024 07:23

This guide describes how to back up the Charmed Kubeflow (CKF) control plane data to a compatible S3 storage.

It is expected that these steps are followed all at once, backing up all databases, pipelines MinIO bucket, and ML Metadata database at the same time. Failing to do so may result in data loss.

Running Kubeflow pipelines and Katib experiments can affect the outcome of the backup, please make sure all pipelines and experiments are stopped and no other processes are calling them, such as Jupyter Notebooks.

User workloads in user namespaces are not backed up.

Requirements

Access to an S3 compatible storage, such as RadosGW, AWS S3, or MinIO, for the backup data.
Admin access to the Kubernetes cluster where CKF is deployed.
Juju admin access to the kubeflow model.
yq binary.
Ensure the local storage is big enough to back up the data.

Configure `rclone`

rclone is a tool that allows file management in cloud storage. This tool will be used for backing up several files throughout this guide and it can be installed as a snap:

sudo snap install rclone

Connect to a shared S3 storage

Configure rclone to connect to the shared S3 storage. The following can be used as reference:

[remote-s3]
type = s3
provider = AWS
env_auth = true
access_key_id = ...
secret_access_key = ...
region = eu-central-1
acl = private
server_side_encryption = AES256

You can check where this configuration file is located with rclone config file.

Save the name of the S3 remote in an ENV variable:

RCLONE_S3_REMOTE=remote-s3

Connect to CKF MinIO

The following steps require an accessible MinIO endpoint, which can be done port forwarding the minio service:

kubectl port-forward -n kubeflow svc/minio 9000:9000

Get minio’s secret-key value:

juju show-unit kfp-ui/0 \
        | yq '.kfp-ui/0.relation-info.[] | select (.endpoint == "object-storage") | .application-data.data' \
        | yq '.secret-key'

Get minio’s access-key:

juju config minio access-key

Configure rclone to connect to CKF MinIO. The following can be used as reference:

[minio-ckf]
type = s3
provider = Minio
access_key_id = minio
secret_access_key = ...
endpoint = http://localhost:9000
acl = private

Save the name of the MinIO remote in an ENV variable:

RCLONE_MINIO_REMOTE=minio-ckf

Back up CKF databases to S3 storage

CKF uses katib-db and kfp-db as databases for Katib and Kubeflow pipelines respectively.

Deploy and configure the s3-integrator to connect to the shared S3 storage.

See S3 AWS and S3 Radowsg configuration guides for this step.

Scale up kfp-db and katib-db.

This step avoids the Primary database from becoming unavailable during backup.

juju scale-application kfp-db 2
juju scale-application katib-db 2

Create a backup for each database.

Replace mysql-k8s with the name of the database you intend to create a backup for in the commands from that guide.

Back up ML metadata using `sqlite3`

The mlmd charm uses a SQLite database to store ML metadata generated from Kubeflow pipelines.

Install the required tools inside the application container:

This step expects the mlmd application container to have Internet access.

# MLMD > 1.14, CKF 1.9
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd-grpc-server"

# MLMD 1.14, CKF 1.8
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd"

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
    /bin/bash -c "apt update && apt install sqlite3 -y"

Scale down kfp-metadata-writer. This is done to prevent any additional writes to mlmd.

juju scale-application kfp-metadata-writer 0

Perform a database backup.

This moves all the database contents into a compressed text file inside the mlmd-0 container:

MLMD_BACKUP=mlmd-$(date -d "today" +"%Y-%m-%d-%H-%M").dump.gz

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
	/bin/bash -c \
	"sqlite3 /data/mlmd.db .dump | gzip -c >/tmp/$MLMD_BACKUP"

Copy the backup file to local storage:

kubectl cp -n kubeflow -c $MLMD_CONTAINER \
	$MLMD_POD:/tmp/$MLMD_BACKUP \
	./$MLMD_BACKUP

Copy the mlmd backup data to the S3 storage:

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3
RCLONE_BWIDTH_LIMIT=20M

rclone --size-only copy \
	--bwlimit $RCLONE_BWIDTH_LIMIT \
	./$MLMD_BACKUP \
	$RCLONE_S3_REMOTE:$S3_BUCKET

Optionally, you can remove the mlmd data from your local machine:

rm -rf $MLMD_BACKUP

Scale up kfp-metadata-writer:

juju scale-application kfp-metadata-writer 1

Back up `mlpipeline` MinIO bucket

Sync all files from minio to the shared S3 storage:

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3
RCLONE_BWIDTH_LIMIT=20M

rclone --size-only sync \
	--bwlimit $RCLONE_BWIDTH_LIMIT \
	$RCLONE_MINIO_REMOTE:mlpipeline \
	$RCLONE_S3_REMOTE:$S3_BUCKET/mlpipeline

Back up ML metadata with `kubectl`

You can also perform the backup using kubectl.

Scale down kfp-metadata-writer. This is done to prevent any additional writes to mlmd:

juju scale-application kfp-metadata-writer 0

Copy the backup file to local storage:

# MLMD > 1.14, CKF 1.9
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd-grpc-server"

# MLMD 1.14, CKF 1.8
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd"

kubectl cp -n kubeflow -c $MLMD_CONTAINER \
	$MLMD_POD:/data/mlmd.db \
	./$MLMD_BACKUP

Copy the mlmd backup data to the S3 storage:

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3
RCLONE_BWIDTH_LIMIT=20M

rclone --size-only copy \
	--bwlimit $RCLONE_BWIDTH_LIMIT \
	./$MLMD_BACKUP \
	$RCLONE_S3_REMOTE:$S3_BUCKET

Optionally, you can remove the `mlmd’ backup data from your local machine:

rm -rf $MLMD_BACKUP

Scale up kfp-metadata-writer:

juju scale-application kfp-metadata-writer 1

birru2 · 13 December 2024 07:01

Hi,

It is mentioned in the document in the Requirements part:

Access to a S3 storage used for the backup data - only AWS S3 and S3 RadosGW are supported.

I think a separate Minio is also an alternative. An organization might not get an S3 service from AWS and might not have a Ceph radosgw deployment. In such cases, Minio is an alternative to back up all the data.

Thanks

dnplas · 18 December 2024 14:00

Thanks for noticing! Yeah, it can be any S3 storage available to the user/customer. The guide suggests AWS S3 and S3 RadosGW, but the configuration for Minio would be exactly the same! Here’s an example of using the s3-integrator with Minio.

I’ll change the message to make it clear for everyone.

mykola.marzhan · 11 June 2025 06:14

Good document. For the database backup process, I’d like to suggest an alternative that might be more efficient. Instead of the current method:

“sqlite3 /data/mlmd.db .dump | gzip -c >/tmp/$MLMD_BACKUP”

We could consider use the native .backup command:

“sqlite3 /data/mlmd.db .backup /tmp/$MLMD_BACKUP”

Such a method has a number of advantages:

The .backup operation is atomic and doesn’t require taking the kfp-metadata-writer service offline.
The backup is smaller than a text-based dump, making gzip unnecessary and avoiding decompression overhead during recovery.
If the target filesystem supports deduplication (like ZFS or Btrfs) and both the source and backup are on the same volume, shared data can be stored only once, saving space.
Restoration is much faster as it’s a direct file copy, not a re-execution of SQL statements.

Back up control plane

Requirements

Configure rclone