Best practices for testing charms

sed-i · 14 August 2024 02:57

This post is a summary of chapters 11, 12, 13, 14 from the Google SWE book, focused on / interpreted for charmed operators.

Table of Contents

Motivation
Principles
Process
Metrics
Summary
See also
References

Motivation

We write tests because we want an early reflection on design choices; we want to be able to refactor with confidence; we want to catch many bugs without manual tests.

Principles

Be mindful of the testing pyramid (`PYRAMID`)

The most important qualities we want from our test suite are speed and determinism.

Larger, system-scale tests are slower, less reliable and more difficult to debug than smaller tests.

Ensure that larger tests are valuable assets and not resource sinks.

The majority of tests (~80%) should be small unit tests. (SMALL)
- “Executable documentation” of business logic.
- Must run on a single thread.
- Not allowed to sleep.
- Not allowed to perform IO (unless using a hermetic in-memory implementation).
- Not allowed to make blocking calls (e.g. network).
Charm integration tests (~15%) should be limited to a single node. (MEDIUM)
- Allowed to make blocking calls only to localhost.
- Test inter-component interaction (juju – charm – workload).
- Should have a strict timeout, e.g. 15 minutes.
- The best way to speed up a test is often to reduce its scope or to split a large test into two smaller tests that can run in parallel.
Bundle tests (~5%) can span multiple models/machines/nodes. (LARGE)
- Single system/solution validation (functional/e2e).
- Cross-solution tests - these could be run by dedicated QA teams.
- Large tests should have documented owners.

Metrics (`METRICS`)

About 80% (number percent) of the tests should be small unit tests.
The rate of flaky tests should be very small (at Google it’s 0.15%).
Track code coverage, but only from small tests, to avoid coverage inflation.

Be mindful of {ops,juju}-dependent vs. independent logic (`DEPS`)

If new code is difficult to test, it is often because the code being tested has too many responsibilities or difficult-to-manage dependencies.

Charms have many parts:

	{ops,juju}-independent	{ops,juju}-dependent
Charm libs	Lib utils, config builders	Custom event emission, event handlers, pebble operations, relation data r/w
charm.py	Charm utils, config builders	Workload manager (pebble operations), event handlers
Rocks	Workload (binary), default pebble service	N/A

The better we isolate the parts the easier it would be to test them.

Tests should be hermetic (`HERMETIC`)

Make your tests complete (self-contained) and concise (contains only relevant information) (ref).
- A test should contain all of the information necessary to set up, execute, and tear down its environment.
- A test should contain only the information required to exercise the behavior in question.
Test methods in a unit test module should not assume the order in which they are run.
Test methods in integration tests can be chained, such that the output of one test is used as the input to another test. Separate test modules should not depend on each other.

A test should be obvious upon inspection (`DAMP`)

When you are called to fix a broken test that you have never seen before, you will be thankful someone took the time to make it easy to understand.

A clear test is one whose purpose for existing and reason for failing is immediately clear to the engineer diagnosing a failure.

If you feel like you need to write a test to verify your test, something has gone wrong!

It can often be worth violating the DRY (Don’t Repeat Yourself) principle if it leads to clearer tests. (ref)
Don’t put logic in tests. Stick to straight-line code. (ref)
- Control flow statements (conditionals, loops) are strongly discouraged (ref).
- “Write the test you’d like to read”.
Tests should demonstrate: code correctness, edge cases, error conditions / failure modes.
Tests should clearly express cause and effect.
Instead of sharing data, use helper methods for constructing data in each test. (ref)
Include in-code documentation where appropriate.

Test behaviors, not methods (`BDD`)

The mapping between methods and behaviors is many-to-many (some behaviors rely on the interaction of multiple methods). The structure of the tests does not need to match the structure of the methods.

The ideal test is unchanging after it’s written. A good test should need to change only if user-facing behavior of an API changes.

Prefer testing public APIs (refs: 1, 2).
Test state, not interactions (ref).
- Example: assert on unit status rather than that a method was called.
- Interaction testing leads to tests that are brittle because it exposes implementation details of the system under test.
Write BDD-style tests: GIVEN (controlled environment), WHEN (specific input), THEN (Specific single behavior produces and observable output).
Name tests after the behavior being tested (ref).
- If you’re stuck, start with “shouldX”.
- If you find yourself needing to use the word “and” then it means you’re trying to test too much in once place.
- A behavioral test can be read at three levels of granularity: test method name, the given/when/then comments, and the actual code.
Each test should cover only a single behavior.

Write clear failure messages (`MSG`)

Tests that are assertive must provide a clear pass/fail signal and must provide meaningful error output to help triage the source of failure.

A good error anticipates the test runner’s unfamiliarity with the code and provides a message that gives context.

In an ideal world, an engineer could diagnose a problem just from reading its failure message without ever having to look at the test itself.

Ship test-doubles/fakes together with your charm libs (`FAKE`)

Stubbing leaks implementation details of your code into your test.

If your charm lib performs IO or makes blocking calls (e.g. network), it is appropriate to use a test double / fake for unit-testing (otherwise engineers should prefer to use real implementations instead).
Write a fake if productivity boost outweighs the cost of writing and maintaining it.
The team that owns the real implementation of the charm lib should write and maintain the fake. Without up-to-date test doubles, other engineers will pollute their charm tests with mocks, which could quickly get out of sync with the real implementation.
A fake must have its own tests.
- A fake should maintain fidelity to the API contracts of the real implementation, but only from the perspective of the test.
- Run the public api tests for both the real implementation and the fake (contract tests).
To use test doubles, a codebase needs to be designed to be testable (refs: 1, 2).

Use test data captured in production (`REPLAY`)

Test data copied from production is much more faithful to reality (having been captured that way), but a big challenge is how to create realistic test traffic before launching the new code.

First implement high-value generic tests (`GENERIC`)

Single charm:
- Isolated upgrade of multi-unit charm (parametrized over charmhub channel)
- Isolated load test (potentially together with a test charm).
- Cross-controller CMR for every relation the charm has.
Single bundle:
- All charms active/idle after enabling/disabling TLS
- Update status every 10 sec for 2 min.
- Upgrade all charms concurrently.

Include load tests in your charm’s quality gates (`LOAD`)

Built-in load-tests can serve as verifiable guarantees.

Include probers with your charm (`PROBER`)

Probers perform well-known and deterministic read-only actions.

Use self-test probers (e.g. pebble checks or goss definitions) to continuously validate live deployments.

Process

As part of issue triaging, add a comment with the BDD/Gherkin tests you think would make sense to have the issue covered.
Do not approve PRs for new code that does not include tests. There should be a match between PR’s description and test changes/additions.
Test coverage shouldn’t decrease.
Treat your tests like production code.

Be mindful of the testing pyramid (PYRAMID)
Be mindful of {ops,juju}-dependent vs. independent logic (DEPS)
Tests should be hermetic (HERMETIC)
A test should be obvious upon inspection (DAMP)
Test behaviors, not methods (BDD)
Write clear failure messages (MSG)
Ship test-doubles together with your charm libs (FAKE)
Use test data captured in production (REPLAY)
First implement high-value generic tests (GENERIC)
Include load tests in your charm’s quality gates (LOAD)
Include probers with your charm (PROBER)

References

Titus Winters, Tom Manshreck, Hyrum Wright, Software Engineering at Google: Lessons Learned from Programming Over Time, O’Reilly 2020, ISBN 9781492082798.
xUnit Patterns: Principles of test automation (2011).

gruyaume · 11 June 2025 11:52

Hello @sed-i ,

You mentioned test-doubles (fake) in this post, do you have example of charm projects providing such doubles? We have a feature request in the TLS lib asking for them and I’m trying to look at what’s out there and find inspiration.

Thank you,

sed-i · 11 June 2025 16:31

Thanks @gruyaume, I replied on github.

More than happy for you to contribute further insight here.

Best practices for testing charms

Motivation

Principles

Be mindful of the testing pyramid (PYRAMID)

Metrics (METRICS)

Be mindful of {ops,juju}-dependent vs. independent logic (DEPS)

Tests should be hermetic (HERMETIC)

A test should be obvious upon inspection (DAMP)

Test behaviors, not methods (BDD)

Write clear failure messages (MSG)

Ship test-doubles/fakes together with your charm libs (FAKE)

Use test data captured in production (REPLAY)

First implement high-value generic tests (GENERIC)

Include load tests in your charm’s quality gates (LOAD)

Include probers with your charm (PROBER)