How to Use Advanced Pebble Features in Juju Charms: Custom Notices and Health Checks

On Sep 10 Charm-Tech Office Hours (UTC 08:00 AM, CEST 10:00 AM), we will do a live demo of these features mentioned in the post. Join us! Google Meet: https://meet.google.com/uiv-tmrh-jnp

0 Background

Starting from Juju 3.4.0 (released on Feb 15, 2024), Pebble notices are supported. Pebble notices is a subsystem that allows the user to introspect events that occur in the Pebble server, as well as record custom client events with optional data. For more information, read Pebble notices on the official Pebble documentation website.

As of Juju version 2.9.26, Pebble included an HTTP /v1/health endpoint that allows a user to query the health of configured checks. Starting from Juju 3.6 (still in beta as of Sept 2024), Juju fully supports Pebble health checks by adding support to the <container>-pebble-check-failed and <container>-pebble-check-recovered events. For a full list of Juju events, see List of events. For more information on Pebble health checks, refer to Pebble’s official documentation website.

In this post, I will walk you through the steps required to use these newly added features in Juju Charms. But first, let’s have a look at what areas they could potentially make our lives better.


1 A Brief Introduction to Custom Notices and Health Checks

1.1 Custom Notices

Imagine you have a database type Charm running something like a PostgreSQL database as the workload. The database might need some backups, which creates a local file in the pod, but for data backup and recovery reasons, you want to back it up someplace else, like in object stores provided by certain public cloud providers.

Theoretically, you could add some code in your workload container to call cloud APIs to back up the data, but as the team that owns the workload but not the operation, you might not know where to back it up.

What’s more, multicloud has become more and more popular today: According to a research done by 451 Research (part of S&P Global Market Intelligence; read the full report here), multicloud is no longer a choice, we are already living in it: 98% of enterprises (the research studied 1,500 companies) using or planning to use public cloud in the next six months have already adopted a multicloud strategy.

Given these circumstances, the application/workload team might not know where to do the operations. What if the workload can send a notice when the backup is finished, then somehow the Charm can catch that notice, then handle the backup, I.E., the operation part for you?

This is precisely one of the problems that Pebble custom notices can help.

1.2 Health Checks

The health check is a critical part of an application lifecycle in the DevOps culture because you want to automatically take some measures to respond to certain check failures. Similar to K8s health checks, you can configure health checks for your workload with Pebble.

For example, you can run some commands to check if the disk is full, and if it is, you might want to automagically run some operation code to handle the situation.

For another example, you can also configure some HTTP endpoint health check to detect if certain stuff is still up, and when not, you might want to if not, you might want to run some other operation code to take some remediations, for example, maybe set the unit status in Juju, execute something else to try to recover it.

Starting from Juju 3.6, these examples are not assumptions any more, they are a reality: Juju will send events about failed and recovered health checks so that you can take measures accordingly.

Next, let’s have a look at a simple workload and Charm, then try to add these features to them.


2 Review The Initial Setup

2.1 Dev ENV

If you haven’t set up your local development environment yet, follow the “Set up your development environment” how-to guide in Juju SDK documentation to do so. TL;DR: These are the prerequisites:

  • A Ubuntu VM
  • Python 3.8+
  • MicroK8s
  • Charmcraft
  • Juju (3.6 and above, use --channel=3.6/beta to install)

Note: Juju 3.6 or above is mandatory, because as aforementioned, Pebble health-check-related events are supported from version 3.6 on.

2.2 The Sample Service

Next, let’s have a look at the sample service we use, and here’s the code.

Note: If you want to clone the repo and have a look in your favourite IDE, run the following command:

git clone -b 0-a-sample-service https://github.com/IronCore864/my-sample-service.git
cd my-sample-service
# do with it what you will

The core part of the app:

func main() {
	router := gin.Default()

	router.GET("/", homePage)

	router.Run()
}

func homePage(c *gin.Context) {
	c.String(http.StatusOK, "This is my home page")
}

As you can see, we use this simple hello-world app written in Golang using the Gin web framework, listening port 8080 with only one route /, serving as a mock of our workload.

You can have a look at what’s also included in the repo:

It’s OK to skip the above content; the Docker image has been built and we will use it directly in the following sections without running any Docker command so that you can follow this guide smoothly and focus on what’s more important (Juju and Charms, of course).

2.3 The Charm

A bare-minimum Charm is prepared to deploy the above sample service as our workload, see the Charm here.

The core part of the Charm:

class MySampleServiceCharm(ops.CharmBase):
    def __init__(self, framework):
        super().__init__(framework)
        # ...
        framework.observe(
            self.on.my_sample_service_pebble_ready, self._update_layer_and_restart
        )

    def _update_layer_and_restart(self, event) -> None:
        self.unit.status = ops.MaintenanceStatus("Assembling Pebble layers")
        new_layer = self._pebble_layer.to_dict()
        try:
            # ...
            self.container.add_layer(
                "my_sample_service", self._pebble_layer, combine=True
            )

		# ...
        self.unit.status = ops.ActiveStatus()

    @property
    def _pebble_layer(self):
        pebble_layer = {
        	# ...
        }
        return ops.pebble.Layer(pebble_layer)

It observes the pebble-ready event and configures a layer to deploy the sample service.

If you want to clone the repo and read it in your favourite IDE, run the following command:

git clone -b 0-a-sample-charm https://github.com/IronCore864/my-sample-service-operator.git

2.4 Deploy the Charm

We can pack and deploy the Charm by running:

charmcraft pack
juju deploy ./my-sample-service_ubuntu-22.04-arm64.charm --resource my-sample-service=ironcore864/my-sample-service:1.0.0

Note: Since I’m with an arm64 machine, the packed charm is named my-sample-service_ubuntu-22.04-arm64.charm. Replace the name to my-sample-service_ubuntu-22.04-amd64.charm if you are on amd64 architecture; same below.

After deployment, the sample service will be up and running, which can be verified by curling the :8080/ endpoint.


3 Pebble Custom Notices

To use Pebble custom notices, there are two parts: the party that sends it and the party that receives then handles it:

Next, let’s add exactly these to our sample service and Charm.

3.1 Sending a Custom Notice

Since the workload and Pebble are running in the same Pod, the workload can call pebble notify directly (use full path /charm/bin/pebble notify).

Imagine our sample service has an endpoint /backup-db, when called, it backs up the database and creates a local backup file. After the backup finishes, we want the workload to notify Pebble that the backup is done. Let’s modify our sample service to do so.

First, add a /backup-db route:

func main() {
	router := gin.Default()

	router.GET("/", homePage)
	router.GET("/backup-db", backupDB)

	router.Run()
}

Then, let’s implement the backupDB function:

func backupDB(c *gin.Context) {
	log.Println("DB Backup started ...")

	time.Sleep(1 * time.Second)
	err := sendPebbleNotification()
	if err != nil {
		log.Println("DB Backup failed!")
		c.String(http.StatusInternalServerError, "DB Backup failed")
		return
	}

	log.Println("DB Backup finished!")
	c.String(http.StatusOK, "DB Backup finished succesfully")
}

func sendPebbleNotification() error {
	cmd := exec.Command("/charm/bin/pebble", "notify", "guotiexin.com/db/backup", "path=/tmp/mydb.sql")
	if err := cmd.Run(); err != nil {
		return errors.Join(errors.New("couldn't execute a pebble notify: "), err)
	}
	return nil
}

This function mocks a DB backup using sleep, and after it’s done, it executes /charm/bin/pebble notify command with key guotiexin.com/db/backup and optional data path=/tmp/mydb.sql.

Note: You can view the full code here. If you want to view it locally, clone it with this command: git clone -b 1-custom-notice https://github.com/IronCore864/my-sample-service.git. Or, if you have already cloned it in the previous step, switch to the 1-custom-notice branch in the repo by running git checkout 1-custom-notice. Alternatively, you can view the diff here.

3.2 Respond to a Custom Notice

Next, let’s watch the pebble_custom_notice event in the Charm:

    def __init__(self, framework):
        super().__init__(framework)
        self.pebble_service_name = "my-sample-service"
        self.container = self.unit.get_container("my-sample-service")
        framework.observe(
            self.on.my_sample_service_pebble_ready, self._update_layer_and_restart
        )
        framework.observe(
            self.on["my_sample_service"].pebble_custom_notice,
            self._on_pebble_custom_notice,
        )

Then we can defie the _on_pebble_custom_notice handler:

    def _on_pebble_custom_notice(self, event: ops.PebbleCustomNoticeEvent) -> None:
        if event.notice.key == "guotiexin.com/db/backup":
            path = event.notice.last_data["path"]
            logger.info("Backup finished. Backup file: %s", path)
            logger.info("Uploading backup file to ...")

Here we switch on the notice’s key, and if it matches what we send from the workload, take some actions (here we get the path from the data and log something; in the real world, you can respond to the event however you like, for example, detect which cloud the app is running in, then backing it up to the cloud service provider’s object store).

Note: You can see the full code here. If you want to view it locally, clone it with this command: git clone -b 1-custom-notice https://github.com/IronCore864/my-sample-service-operator.git. Or, if you have already cloned it in the previous step, switch to the 1-custom-notice branch in the repo by running git checkout 1-custom-notice. Alternatively, you can view the diff here.

3.3 Pack and Redeploy

After the above code change, let’s repack it, and refresh the deployment:

charmcraft pack
juju refresh --path="./my-sample-service_ubuntu-22.04-arm64.charm" my-sample-service --force-units --resource my-sample-service=ironcore864/my-sample-service:1.0.0

After Juju status shows active, find out the unit address and accessing the /backup-db endpoint:

curl 10.1.226.144:8080/backup-db

If we check the workload container’s log (kubectl logs -f -c my-sample-service my-sample-service-0), we will see similar logs:

2024-08-28T13:52:40.719Z [my-sample-service] 2024/08/28 13:52:40 DB Backup started ...
2024-08-28T13:52:41.733Z [pebble] POST /v1/notices 10.60448ms 200
2024-08-28T13:52:41.733Z [pebble] GET /v1/notices?after=2024-08-28T13%3A52%3A06.113711524Z&timeout=30s 5.295186415s 200
2024-08-28T13:52:41.734Z [my-sample-service] 2024/08/28 13:52:41 DB Backup finished!
2024-08-28T13:52:41.734Z [my-sample-service] [GIN] 2024/08/28 - 13:52:41 | 200 |   1.01457296s |   192.168.64.18 | GET      "/backup-db"
2024-08-28T13:52:41.822Z [pebble] GET /v1/notices/289 61.458µs 200

This means the workload container has successfully recorded a notice by running the pebble notify command after the DB backup finishes.

If we check juju debug-log, we will see similar logs like:

unit-my-sample-service-0: 21:52:06 INFO unit.my-sample-service/0.juju-log Backup finished. Backup file: /tmp/mydb.sql
unit-my-sample-service-0: 21:52:06 INFO unit.my-sample-service/0.juju-log Uploading backup file to ...

As we can see, these log lines come from the Charm, meaning the Charm has observed the custom notice event sent by Juju, and responded to it successfully.

With the custom notices feature sorted out, next, let’s add health checks for our sample service charm.


4 Pebble Health Checks

To use Pebble health checks, these are the requirements:

  • The workload should have a health check endpoint (or, we can define some command to execute health checks).
  • The health checks should be configured in a Pebble layer.
  • The Charm should watch for health-check-related events and handle them accordingly.

Next, let’s work them out one by one.

4.1 Adding a /health Endpoint for Our Sample Service

First, let’s define a /health HTTP endpoint for our sample service that can be used as a health check endpoint in Pebble.

Add a new route:

func main() {
	router := gin.Default()

	router.GET("/", homePage)
	router.GET("/backup-db", backupDB)
	router.GET("/health", healthCheck)

	router.Run()
}

Then, implement the healthCheck function as follows:

func healthCheck(c *gin.Context) {
	if rand.Intn(10) == 0 {
		c.String(http.StatusInternalServerError, "Health check failed")
		return
	}
	c.String(http.StatusOK, "Health check passed")
}

Note: Here we used RNG to simulate a 10% chance of failed health checks.

You can view the full code here, or clone it by running git clone -b 2-add-health-check https://github.com/IronCore864/my-sample-service.git. If you have already cloned the repo in previous steps, switch to the branch by running git checkout 2-add-health-check. Alternatively, view the diff here.

4.2 Configure Health Checks in Pebble

To configure a health check, update the Pebble layer config with the following:

	# ...
    "checks": {
        "health": {
            "override": "replace",
            "threshold": 1,
            "http": {
                "url": "http://localhost:8080/health",
            },
        },
    },
    # ...

Explanations:

  • The replace value in the override field means that it will entirely override the existing check spec in the plan if a check with the same name already exists so that this one we define here is the single source of truth.
  • threshold: Number of times in a row the check must error to be considered a failure (which triggers the on-check-failure action). It defaults to 3, but here we change it to 1 so that it fails faster and we can observe the result quicker.
  • The http keyword configures an HTTP check, which is successful if a GET to the specified URL returns a 20x status code. We defined the URL of the http check, but left another optional field of it headers empty.
  • Here we left some other fields empty to use their default values. For example:
    • period: defaults to 10 seconds, which is the time interval of the check.
    • timeout: defaults to 3 seconds. If this time elapses before a single check operation has finished, it is cancelled and considered an error.

For more information on how to configure health checks, read Pebble health checks and the Layer specification.

4.3 Observe and Respond to Health Check Events

To make the Charm respond to health check events, the pebble_check_failed and pebble_check_recovered events should be watched:

    def __init__(self, framework):
        super().__init__(framework)
        self.pebble_service_name = "my-sample-service"
        self.container = self.unit.get_container("my-sample-service")
        framework.observe(
            self.on.my_sample_service_pebble_ready, self._update_layer_and_restart
        )
        framework.observe(
            self.on["my_sample_service"].pebble_custom_notice,
            self._on_pebble_custom_notice,
        )
        framework.observe(
            self.on["my_sample_service"].pebble_check_failed,
            self._on_pebble_check_failed,
        )
        framework.observe(
            self.on["my_sample_service"].pebble_check_recovered,
            self._on_pebble_check_recovered,
        )

Then we can define the _on_pebble_check_failed and _on_pebble_check_recovered handlers:

    def _on_pebble_check_failed(self, event: ops.PebbleCheckFailedEvent):
        if event.info.name == "health":
            logger.info("The http health check failed!")
            self.unit.status = ops.ActiveStatus("Degraded functionality ...")

    def _on_pebble_check_recovered(self, event: ops.PebbleCheckRecoveredEvent):
        if event.info.name == "health":
            logger.info("The http health check has recovered!")
            self.unit.status = ops.ActiveStatus()

Here we switch on the check’s name then log something and set the unit status. In the real world, here you can implement some remediations according to the failed checks.

Note: You can view the full code here, or clone it by running git clone -b 2-add-health-check https://github.com/IronCore864/my-sample-service-operator.git. If you have already cloned the repo in previous steps, switch to the branch by running git checkout 2-add-health-check. Alternatively, view the diff here.

4.4 Pack and Redeploy

After the above code change, let’s repack it, and refresh the deployment:

charmcraft pack
juju refresh --path="./my-sample-service_ubuntu-22.04-arm64.charm" my-sample-service --force-units --resource my-sample-service=ironcore864/my-sample-service:1.0.0

After Juju status shows active, if we check the workload container’s logs, we will see similar logs like:

2024-08-28T14:11:51.239Z [my-sample-service] [GIN] 2024/08/28 - 14:11:51 | 200 |      15.208µs |             ::1 | GET      "/health"
2024-08-28T14:12:01.236Z [my-sample-service] [GIN] 2024/08/28 - 14:12:01 | 200 |      22.124µs |             ::1 | GET      "/health"
2024-08-28T14:12:11.241Z [my-sample-service] [GIN] 2024/08/28 - 14:12:11 | 200 |      18.208µs |             ::1 | GET      "/health"
2024-08-28T14:12:21.238Z [my-sample-service] [GIN] 2024/08/28 - 14:12:21 | 200 |      15.959µs |             ::1 | GET      "/health"
2024-08-28T14:12:31.235Z [my-sample-service] [GIN] 2024/08/28 - 14:12:31 | 200 |      15.458µs |             ::1 | GET      "/health"
2024-08-28T14:12:41.235Z [my-sample-service] [GIN] 2024/08/28 - 14:12:41 | 200 |      17.042µs |             ::1 | GET      "/health"
2024-08-28T14:12:51.239Z [my-sample-service] [GIN] 2024/08/28 - 14:12:51 | 500 |      14.333µs |             ::1 | GET      "/health"
2024-08-28T14:12:51.248Z [pebble] Check "health" failure 1/1: non-20x status code 500
2024-08-28T14:12:51.248Z [pebble] Check "health" threshold 1 hit, triggering action and recovering
2024-08-28T14:12:51.248Z [pebble] Change 294 task (Perform HTTP check "health") failed: non-20x status code 500
2024-08-28T14:13:01.269Z [my-sample-service] [GIN] 2024/08/28 - 14:13:01 | 500 |       5.667µs |             ::1 | GET      "/health"
2024-08-28T14:13:01.277Z [pebble] Check "health" failure 2/1: non-20x status code 500
2024-08-28T14:13:11.267Z [my-sample-service] [GIN] 2024/08/28 - 14:13:11 | 200 |        5.75µs |             ::1 | GET      "/health"

We can see that sometimes the health check fails (we mocked a 10% chance failure rate), and when it does, we can see from juju debug-log that the Charm has received and responded to events on the failed then recovered checks:

unit-my-sample-service-0: 22:12:51 INFO unit.my-sample-service/0.juju-log The http health check failed!
unit-my-sample-service-0: 22:12:51 INFO juju.worker.uniter.operation ran "my-sample-service-pebble-check-failed" hook (via hook dispatching script: dispatch)
unit-my-sample-service-0: 22:13:11 INFO unit.my-sample-service/0.juju-log The http health check has recovered!
unit-my-sample-service-0: 22:13:11 INFO juju.worker.uniter.operation ran "my-sample-service-pebble-check-recovered" hook (via hook dispatching script: dispatch)

When the health check fails, we can also see from juju status that the workload has the message “Degraded functionality …”, which is set by our Charm.


5 Summary

In the examples above, we implemented some simple mock operations based on custom notices and health checks.

In the real world, the applications could be endless! Just a few examples on the custom notices:

  • The backup example above is quite useful in many cases, especially if you need credentials to put the backup somewhere, in which case, the creds are best as a user secret, and it’s better to contain that to the charm container, rather than let the workload have access to them.
  • Charms sometimes want to do things when the workload is ready (setting the status, workload version, etc.). Checks can potentially be used for this.
  • Sometimes charms want to do something on a schedule, or at a specific time. Juju doesn’t support this, but you can use custom notices scheduled in the workload. And, maybe you also want to change status or coordinate with other units.
  • Rolling-ops: each unit needs to release the lock when their part (e.g., upgrade) is done - you might not want to block for the entire time the operation is running, so a custom notice lets you “resume” the charm code when the op is finished.

Custom notices extend the capabilities of the workload, like Ironman’s suite which gives you extra superpower; health checks improve the lifecycle of the workload to another level.

With newly added capabilities, the Charm is acting like a second runtime of the workload container where traditional middleware functions are migrated to and loose coupling of business logic and distributed system concerns are achieved. It’s a bit like the Multi-Runtime Microservices Architecture (aka Mecha architecture). Powerful stuff, right? I know.

If you enjoyed reading this post, please like, comment, and subscribe. See you in the next piece!

2 Likes

Reminder: On the Charm-Tech Office Hours (UTC 08:00 AM, CEST 10:00 AM) on Sep 10, we will do a live demo of these features! Join us!

Google Meet: https://meet.google.com/uiv-tmrh-jnp

1 Like