Charm development best practices

0x12b · 29 November 2021 11:52

This document describes the current best practices around developing and contributing to a charm.

Contents:

Conventions
Documentation
Custom events
Backward compatibility
Dependency management
Code style
Patterns
- Fetching network information
- Random values
Testing
Unit tests
- Functional tests
- Integration tests
Recommended tooling
- Continuous integration
- Linters
Common integrations
- Observability
Security considerations
Example repositories

Conventions

Programming languages and frameworks

The primary programming language charms are written in Python, and the primary framework for developing charms is the Python Operator Framework, or ops.

The recommended way to import the library is to write import ops and use it similar to:

import ops
class MyCharm(ops.CharmBase):
    ...
    def _my_event_handler(self, event: ops.EventBase):  # or a more specific type
        ...
        self.unit.status = ops.ActiveStatus()
...
if __name__ == "__main__":
    ops.main(MyCharm)

Naming

Use clear and consequent naming. For example, prometheus includes multiple charms cover different scenarios:

Running on bare metal as a machine charm, under the name prometheus
Running in kubernetes as a k8s charm, under the name prometheus-k8s

See more: Charm naming guidelines

When naming configuration items or actions, prefer lowercase alphanumeric names, separated with dashes if required. For example timeout or enable-feature. For charms that have already standardized on underscores, it is not necessary to change them, and it is better to be consistent within a charm then to have some values be dashes and some be underscores.

State

Write your charm to be stateless. If a charm needs to track state between invocations it should be done as described in the guide on uses and limitations of stored state.

For sharing state between units of the same application, use peer relation data bags.

Do not track the emission of events, or elements relating to the charm’s lifecycle, in a state. Where possible, construct this information by accessing the model, i.e. self.model, and the charm’s relations (peer relations or otherwise).

Revisions

If your charm’s workload is delivered by a snap: Pin the snap revision to the charm revision by hard-coding it in the charm. Examples:

mysql-router-operator (the snap revision is hard-coded directly; note that this charm also keeps it in sync with the workload version – something database charms like to do because, to minimize risk and cost, they favor in-place upgrades, and this helps them check compatibility prior to the upgrade)
postgresql-operator (the snap revision is hard-coded in a format suitable for when you have multiple workload snaps; because the charm is a multi-architecture charm, the revision is also pinned to a particular architecture, and will be selected depending on the current architecture)

Resources

Resources can either be of the type oci-image or file. When providing binary files as resources, provide binaries for all CPU architectures your binary might end up being run on. An example of this can be found here.

Implement the usage of these resources in such a way that the user may build a binary for their architecture of choice and supply it themselves. An example of this can be found here.

Integrations

Use Charm Libraries to distribute code that simplifies implementing any integration for people who wish to integrate with your application. Name charm libraries after the integration interfaces they manage (example).

Implement a separate class for each side of the relation in the same library, for instance:

class MetricsEndpointProvider(ops.Object):

# …

class MetricsEndpointRequirer(ops.Object):

# …

These classes should do whatever is necessary to handle any relation events specific to the relation interface you are implementing, throughout the lifecycle of the application. By passing the charm object into the constructor to either the Provider or Requirer, you can gain access to the on attribute of the charm and register event handlers on behalf of the charm, as required.

Application and unit statuses

Only make changes to the charm’s application or unit status directly within an event handler.

An example:

class MyCharm(ops.CharmBase):

    # This is an event handler, and can therefore set status
    def _on_config_changed(self, event):
        if self._some_helper():
            self.unit.status = ops.ActiveStatus()

    # This is a helper method, not an event handler, so don't set status here
    def _some_helper(self):
        # do stuff
        return True

Libraries should never mutate the status of a unit or application. Instead, use return values, or raise exceptions and let them bubble back up to the charm for the charm author to handle as they see fit.

In cases where the library has a suggested default status to be raised, use a custom exception with a .status property containing the suggested charm status as shown here or here. The calling charm can then choose to accept the default by setting self.unit.status to raised_exception.status or do something else.

Logging

Templating

Use the default Python logging module. The default charmcraft init template will set this up for you. Do not build strings for the logger. This avoids the string formatting if it’s not needed at the given log level.

Prefer

logger.info("something %s", var)

over

logger.info("something {}".format(var))

# or

logger.info(f"something {var}")

Due to logging features, using f-strings or str.format is a security risk (see issue46200) when creating log messages and also causes the string formatting to be done even if the log level for the message is disabled.

Frequency

Avoid spurious logging, ensure that log messages are clear and meaningful and provide the information a user would require to rectify any issues.

Avoid excess punctuation or capital letters.

logger.error("SOMETHING WRONG!!!")

is significantly less useful than

logger.error("configuration failed: '8' is not valid for field 'enable_debug'.")

Sensitive information

Never log credentials or other sensitive information. If you really have to log something that could be considered sensitive, use the trace error level.

Charm configuration option description

The description of configuration in config.yaml is a string type (scalar). YAML supports two types of formats for that: block scalar and flow scalar (more information in YAML Multiline). Prefer to use the block style (using |) to keep new lines. Using > will replace new lines with spaces and make the result harder to read on Charmhub.io.

When to use Python or Shell

Limit the use of shell scripts and commands as much as possible in favour of writing Python for charm code. There needs to be a good reason to use a shell command rather than Python. Examples where it could be reasonable to use a script include:

Extracting data from a machine or container which can’t be obtained through Python
Issuing commands to applications that do not have Python bindings (e.g., starting a process on a machine)

Documentation

Documentation should be considered the user’s handbook for using a charmed application safely and successfully.

It should apply to the charm, and not to the application that is being charmed. Assume that the user already has basic competency in the use of the application. Documentation should include:

on the home page: what this is, what it does, what problem it solves, who it’s useful for
an introductory tutorial that gives the new user an experience of what it’s like to use the charm, and an insight into what they’ll be able to do with it - by the end of this, they should have deployed the charm and had a taste of success with it
how-to guides that cover common tasks/problems/application cases
reference detailing what knobs and controls the charm offers
guides that explain the bigger picture, advise on best practice, offer context

A good rule of thumb when testing your documentation is to ask yourself whether it provides a means for “guaranteed getting started”. You only get one chance at a first impression, so your quick start should be rock solid.

The front page of your documentation should not carry information about how to build, test or deploy the charm from the local filesystem: put this information in other documentation pages specific to the development of and contribution to your charm. This information can live as part of your Charm Documentation, or in the version control repository for your charm (example).

If you’d like some feedback or advice on your Charm’s documentation, ask in our Mattermost Charmhub Docs channel.

Custom events

Charms should never define custom events themselves. They have no need for emitting events (custom or otherwise) for their own consumption, and as they lack consumers, they don’t need to emit any for others to consume either. Instead, custom events should only be defined in a library.

Backward compatibility

When authoring your charm, consider the target Python runtime. Kubernetes charms will have access to the default Python version on the Ubuntu version they are running.

Your code should be compatible with the operating system and Juju versions it will be executed on. For example, if your charm is to be deployed with Juju 2.9, its Python code should be compatible with Python 3.5.

Compatibility checks for Python 3.5 can be automated in your CI or using mypy.

Dependency management

External dependencies must be specified in a requirements.txt file. If your charm depends on other charm libraries, you should vendor and version the library you depend on (see the prometheus-k8s-operator). This is the default behaviour when using charmcraft fetch-lib. For more information see the docs on Charm Libraries.

Including an external dependency in a charm is a significant choice. It can help with reducing the complexity and development cost. However, a poor dependency pick can lead to critical issues, such as security incidents around the software supply chain. The Our Software Dependency Problem article describes how to assess dependencies for a project in more detail.

Code style

Error Handling

Use the following mapping of errors that can occur to the state the charm should enter:

Automatically recoverable error: the charm should go into maintenance status until the error is resolved and then back to active status. Examples of automatically recoverable errors are those where the operation that resulted in the error can be retried.
Operator recoverable error: The charm should go into the blocked state until the operator resolves the error. An example is that a configuration option is invalid.
Unexpected/unrecoverable error: the charm should enter the error state. The operator will need to file a bug and potentially downgrade to a previous version of the charm that doesn’t have the bug.

The charm should not catch the parent Exception class and instead only catch specific exceptions. When the charm is in error state, the event that caused the error will be retried by juju until it can be processed without an error. More information about charm statuses is in the juju documentation.

Clarity

Charm authors should choose clarity over cleverness when writing code. A lot more time is spent reading code than writing it, opt for clear code that is easily maintained by anyone. For example, don’t write nested, multi-line list comprehensions, and don’t overuse itertools.

User experience / UX

Charms should aim to keep the user experience of the operator as simple and obvious as possible. If it is harder to use your charm than to set up the application from scratch, why should the user even bother with your charm?

Ensure that the application can be deployed without providing any further configuration options, e.g.

juju deploy foo

is preferable over

juju deploy foo --config something=value

This will not always be possible, but will provide a better user experience where applicable. Also consider if any of your configuration items could instead be automatically derived from a relation.

A key consideration here is which of your application’s configuration options you should initially expose. If your chosen application has many config options, it may be prudent to provide access to a select few, and add support for more as the need arises.

For very complex applications, consider providing “configuration profiles” which can group values for large configs together. For example, “profile: large” that tweaks multiple options under the hood to optimise for larger deployments, or “profile: ci” for limited resource usage during testing.

Event handler visibility

Charms should make event handlers private: _on_install, not on_install. There is no need for any other code to directly access the event handlers of a charm or charm library.

Subprocess calls within Python

For simple interactions with an application or service or when a high quality Python binding is not available, a Python binding may not be worth the maintenance overhead and shell/ subprocess should be used to perform the required operations on the application or service.

For complex use cases where a high quality Python binding is available, using subprocess or the shell for interactions with an application or service will carry a higher maintenance burden than using the Python binding. In these cases, the Python binding for the application or service should be used.

When using subprocess or the shell:

Log exit_code and stderr when errors occur.
Use absolute paths to prevent security issues.
Prefer subprocess over the shell
Prefer array-based execution over building strings to execute

For example:

import subprocess

try:
    # Comment to explain why subprocess is used.
    result = subprocess.run(
        # Array based execution.
        ["/usr/bin/echo", "hello world"],
        capture_output=True,
        check=True,
    )
    logger.debug("Command output: %s", result.stdout)
except subprocess.CalledProcessError as err:
    logger.error("Command failed with code %i: %s", err.returncode, err.stderr)
    raise

Linting

Use linters to make sure the code has a consistent style regardless of the author. An example configuration can be found in the pyproject.toml the charmcraft init template.

This config makes some decisions about code style on your behalf. At the time of writing, it configures type checking using Pyright, code formatting using Black, and uses ruff to keep imports tidy and to watch for common coding errors.

In general, run these tools inside a tox environment named lint, and one called fmt alongside any testing environments required. See the Recommended tooling section for more details.

Docstrings

Charms should have docstrings. Use the Google docstring format when writing docstrings for charms. To enforce this, use ruff as part of our linter suite. See this example from the Google style guide.

Class layout

The class layout of a charm should be organised in the following order:

Constructor (inside which events are subscribed to, roughly in the order they would be activated)
Factory methods (classmethods), if any
Event handlers, placed in order that they’re subscribed to
Public methods
Private methods

Further, the use of nested functions is discouraged, instead, use either private methods or module-level functions. Likewise, the use of static methods that could be functions defined near the class in the same module is also discouraged.

String formatting

f-strings are the preferred way of including variables in a string. For example:

foo = "substring"

# .format is not preferred

bar = "string {}".format(foo)

# string concatenation is not preferred

bar = "string " + foo

# f-strings are preferred

bar = f"string {foo}"

The only exception to this is logging, where %-formatting should be used. See above.

Note: f-strings are supported as of Python 3.6. Charms that are based on pre-Bionic Ubuntu versions or libraries needing to support these versions will not have access to f-strings.

Type hints

Declare type hints on function parameters, return values, and class and instance variables.

Type hints should be checked during development and CI using Pyright. Although there are other options, Pyright is the recommended one, as it is what is used in ops itself (see an example Pyright config). More information on type hints can be found in PEP 484 and related PEPs.

This will help users know what functions expect as parameters and return and catch more bugs earlier.

Note that there are some cases when type hints might be impractical, for example:

dictionaries with many nested dictionaries
decorator functions

Patterns

Fetching network information

As a majority of workloads integrate through the means of network traffic, it is common that a charm needs to share its network address over any established relations, or use it as part of its own configuration.

Depending on timing, routing, and topology, some approaches might make more sense than others. Likewise, charms in Kubernetes won’t be able to communicate their cluster FQDN externally, as this address won’t be routable outside of the cluster.

Below you’ll find a couple of different alternatives.

Using the bind address

This alternative has the benefit of not relying on name resolution to work. Trying to get a bind_address too early after deployment might result in a None if the DHCP has yet to assign an address.

@property
def _address(self) -> Optional[str]:
    binding = self.model.get_binding(self._peer_relation_name)
    address = binding.network.bind_address

    return str(address) if address else None

Using FQDN

This alternative has the benefit of being available immediately after the charm is deployed, which eliminates the possible race of the previous example. However, it will in most cases only work for deployments that share a DNS provider (for instance inside a Kubernetes cluster), while in a cross-substrate deployment it most likely won’t resolve.

import socket
...

    @property
    def address(self) -> str:
        """Unit's hostname."""
        return socket.getfqdn()

Using the ingress URL

This alternative has the benefit of working in most Kubernetes deployment scenarios, even if the opposite side of the relation is not within the same cluster, or even the same substrate.

This does however add a dependency, as it requires an ingress to be available. In the example below, the traefik ingress is used, falling back to the FQDN if it isn’t available.

Further, keep in mind that unless using ingress-per-unit, the ingress url will not point to the individual unit, but based on the ingress strategy (i.e. per app, or the leader).

@property
def external_url(self) -> str:
    try:
        if ingress_url := self.ingress.url:
            return ingress_url
    except ModelError as e:
        logger.error("Failed obtaining external url: %s. Shutting down?", e)
    return f"http://{socket.getfqdn()}:{self._port}"

Random values

While creating tests, sometimes you need to assign values to variables or parameters in order to simulate a user behaviour, for example. In this case, instead of using constants or fixed values, consider using random ones generated by secrets.token_hex(). This is preferred because:

If you use the same fixed values in your tests every time, your tests may pass even if there are underlying issues with your code. This can lead to false positives and make it difficult to identify and fix real issues in your code.
Using random values generated by secrets.token_hex() can help to prevent collisions or conflicts between test data.
In the case of sensitive data, if you use fixed values in your tests, there is a risk that may be exposed or leaked, especially if your tests are run in a shared environment.

For example:

from secrets import token_hex

email = token_hex(16)

Testing

Charms should have tests to verify that they are functioning correctly. These tests should cover the behaviour of the charm both in isolation (unit tests) and when used with other charms (integration tests). Charm authors should use tox to run these automated tests.

The unit and integration tests should be run on the same minor Python version as is shipped with the OS as configured under the charmcraft.yaml bases.run-on key. With tox, for Ubuntu 22.04, this can be done using:

[testenv]

basepython = python3.10

Unit tests

Unit tests are written using the unittest library shipped with Python or pytest. To facilitate unit testing of charms, use the testing harness specifically designed for charms which is available in the Charm SDK. An example of charm unit tests can be found here.

Functional tests

Functional tests in charms often take the form of integration-, performance- and/or end-to-end tests.

Use the pytest library for integration and end-to-end tests. Pytest-operator is a testing library for interacting with Juju and your charm in integration tests. Examples of integration tests for a charm can be found in the prometheus-k8-operator repo.

Integration tests

Integration tests ensure that the charm operates as expected when deployed by a user. Integration tests should cover:

Charm actions
Charm integrations
Charm configurations
That the workload is up and running, and responsive

When writing an integration test, it is not sufficient to simply check that Juju reports that running the action was successful. Additional checks need to be executed to ensure that whatever the action was intended to achieve worked.

Recommended tooling

Continuous integration

The quality assurance pipeline of a charm should be automated using a continuous integration (CI) system. The CI should be configured to use the same version of Ubuntu as configured under the charmcraft.yaml bases.run-on key.

For repositories on GitHub, use the actions-operator, which will take care of setting up all dependencies needed to be able to run charms in a CI workflow. You can see an example configuration for linting and testing a charm using Github Actions here.

The charming-actions repo provides GitHub actions for common workflows, such as publishing a charm or library to charmhub.

The automation should also allow the maintainers to easily see whether the tests failed or passed for any available commit. Provide enough data for the reader to be able to take action, i.e. dumps from juju status, juju debug-log, kubectl describe and similar. To have this done for you, you may integrate charm-logdump-action into your CI workflow.

Linters

At the time of writing, linting modules commonly used by charm authors include black, ruff, and codespell. bandit can be used to statically check for common security issues. Pyright is recommended for static type checking (though MyPy can be used as well).,

Common integrations

Observability

Charms need to be observable, meaning that they need to allow the Juju administrator to reason about their internal state and health from the outside. This means that charm authors need to ensure that their charms expose appropriate telemetry, alert rules and dashboards.

Metrics

Metrics should be provided in a format compatible with Prometheus. This means that the metrics either should be provided using the prometheus_remote_write or the prometheus_scrape relation interface.

Some charm workloads have native capabilities of exposing metrics, while others might rely on external tooling in the form of exporters. For Kubernetes charms, these exporters should be placed in a separate container in the same pod as the workload itself.

Logs

Any logs relevant for the charm should also be forwarded to Loki over the loki_push_api relation interface. For Kubernetes charms, this is usually accomplished using the loki_push_api library, while machine charms will want to integrate with the Grafana Agent subordinate charm.

Alert rules

Based on the telemetry exposed above, charms should also include opinionated, generic, and robust, alert rules. See the how-to article on CharmHub for more information.

Grafana Dashboards

Just as with alert rules, charms should be shipped with good, opinionated Grafana dashboards. The goal of these dashboards should be to provide an at-a-glance image of the health and performance of the charm. See the grafana_dashboard library for more information.

Security considerations

Use Juju secrets to share secrets between units or applications.

Do not log secret or private information: no passwords, no private keys, and even be careful with logging “unexpected” keys in json or yaml dictionaries. Logs are likely to be pasted to public pastebin sites in the process of troubleshooting problems.

If you need to generate a password or secret token, prefer the Python secrets library over the random library or writing your own tool. 16 bytes of random data (32 hex chars) is a reasonable minimum; longer may be useful, but 32 bytes (64 hex chars) is probably a reasonable upper limit for “tokens”.

Enforce a ‘chain of trust’ for all executable content: Ubuntu’s apt configuration defaults to a Merkle-tree of sha-512 hashes, rooted with GPG keys. This is preferred. Simplestreams can be used with CA-verified HTTPS transfers and sha-512 hashes. This is acceptable but much more permissive. If pulling content from third-party sites, consider embedding sha-384 or sha-512 hashes into the charm, and verifying the hash before operating on the file in any way. Verifying GPG signatures would work but is more challenging to do correctly. Use gpgv for GPG signature verification.

Make use of standard Unix user accounts to the extent that it is possible: programs should only run as root when absolutely necessary. Beware of race conditions: it is safer to create or modify files as the target account with the correct permissions in comparison to creating or modifying files as root and then changing ownership or permissions.

Consider using systemd security features:

Seccomp SystemCallArchitectures=native and SystemCallFilter=
User= and Group= are often safer than program’s built-in privilege dropping
CapabilityBoundingSet= can limit the capabilities a service can use
AmbientCapabilities= can provide limited capabilities without requiring root – this can be helpful for eg CAP_NET_BIND_SERVICE. (Many capabilities can be leveraged to full root, so it’s not perfect. Binding low ports, for when systemd socket-activation doesn’t work, is probably the best use case.)
Systemd’s various ProtectHome= or ProtectSystem= may not play nicely with AppArmor policies.

Consider writing AppArmor profiles for services: Juju ‘owns’ the configuration of services, so there should be minimal conflict with local configurations.

Example repositories

There are a number of sample repositories you could use for inspiration and a demonstration of good practice.

Kubernetes charms:

Machine charms:

Contributors: @0x12b , @benhoyt , @carlcsaposs , @jameinel , @jdkandersson , @jnsgruk , @tony-meyer , @tmihoc

erik-lonroth · 30 November 2021 23:40

Broken link?

mthaddon · 1 December 2021 14:52

Does anyone have a reference for this? I know there’s a reason why but would like to have the reference if possible.

mthaddon · 1 December 2021 15:07

Sorry, I’ve answered my own question here… Per https://github.com/google/styleguide/blob/gh-pages/pyguide.md#3101-logging

“”“Always call them with a string literal (not an f-string!) as their first argument with pattern-parameters as subsequent arguments. Some logging implementations collect the unexpanded pattern-string as a queryable field. It also prevents spending time rendering a message that no logger is configured to output.”""

pedroleaoc · 7 April 2022 08:35

jose · 30 May 2022 14:53

In some charms we need to get unit’s address to share over relation data with other charms. Getting unit IPs is error prone, since sometimes bind_address may return None, for instance:

    @property
    def private_address(self) -> Optional[str]:
        """Get the unit's ip address.

        Technically, receiving a "joined" event guarantees an IP address is available. If this is
        called beforehand, a None would be returned.
        When operating a single unit, no "joined" events are visible so obtaining an address is a
        matter of timing in that case.

        This function is still needed in Juju 2.9.5 because the "private-address" field in the
        data bag is being populated by the app IP instead of the unit IP.
        Also in Juju 2.9.5, ip address may be None even after RelationJoinedEvent, for which
        "ops.model.RelationDataError: relation data values must be strings" would be emitted.

        Returns:
          None if no IP is available (called before unit "joined"); unit's ip address otherwise
        """
        # if bind_address := check_output(["unit-get", "private-address"]).decode().strip()
        if bind_address := self.model.get_binding(self._peer_relation_name).network.bind_address:
            bind_address = str(bind_address)
        return bind_address

A better approach is getting the hostname with the socket module:

import socket

...

    @property
    def hostname(self) -> str:
        """Unit's hostname."""
        return socket.getfqdn()

0x12b · 2 June 2022 11:00

Thanks José! I’ve updated the doc to include (a somewhat altered version of) your suggestion.

kian-parvin · 26 July 2022 09:09

Broken link here?

0x12b · 26 July 2022 09:22

Fixed it. Thank you!

0x12b · 6 July 2023 12:09

Dropped reference to Mypy as it is something that we intentionally decided we don’t want to push as an option.