Valkey HA Under the Hood: Replication, Sentinel, and Failover in the Valkey VM/K8s Operator

skourta · 13 March 2026 14:10

Abstract

For modern infrastructure, resilience and reliability are critical. In Canonical’s valkey-operator charm, achieving high availability (HA) means coordinating complex, interdependent processes. This post explores HA principles and workflows behind highly available Charmed Valkey standalone deployments, starting with the core architecture enabling this.

This blog post is the first in a series that will cover some of the cool concepts we are exposed to during the development of the Valkey Operator. In this post, we will explore the core architecture that enables high availability for standalone deployment in our operator, including how it handles replication, failover, and scaling. We will also discuss how Canonical’s valkey-operator charm automates these processes to ensure resilient, production-ready deployments.

Introduction

When deploying fast in-memory data stores like Valkey, organizations expect speed, but they must actively build reliability. Maintaining High Availability (HA) requires engineering teams to manage data replication, handle dynamic scaling, and prevent cluster fragmentation.

While Valkey provides the essential tools for resilience, managing these features manually introduces significant risk. Recovering seamlessly from a node failure without losing data or fracturing the system demands a robust architectural design.

This post breaks down the core principles of Valkey’s HA architecture. It also explores how Canonical’s valkey-operator automates these complex tasks to maintain reliable, production-ready deployments. To understand how the operator achieves this, we must first examine how Valkey distributes data across multiple nodes.

The Foundation of Valkey High Availability

At its core, Valkey Replication allows for the creation of exact copies of primary database instances. Valkey operates as an in-memory data store, prioritizing extreme speed and low latency by keeping data in RAM. Because RAM is inherently volatile, Valkey relies on an asynchronous replication model to provide high availability across a primary-replica topology.

The data replication flow operates as follows:

The Primary Node: Handles all write operations and immediately confirms them to the client. It does not wait for replicas to acknowledge receipt.
The Replica Nodes: Receive the data stream asynchronously from the primary, updating their own in-memory state slightly behind the primary.

This non-blocking approach guarantees high throughput, but it introduces a Replication Gap. This is a brief window during which data exists on the primary but has not yet been replicated to the replicas.

Should the primary node crash before this gap closes, the system risks data loss. To mitigate this and automate the recovery process, Valkey requires a secondary management layer to monitor the cluster.

Valkey Sentinel: Distributed Monitoring and Consensus

Valkey Sentinel is the official high-availability solution for Valkey. It is a distributed system designed to monitor database instances, detect failures, and automatically reconfigure the cluster to promote a new primary when necessary.

Sentinel functions as a collective rather than a single agent. This design overcomes the “Observer Problem” where a single monitor might mistakenly attribute a localized network issue to a node failure. By running as a cluster, Sentinels collaborate to reach a consensus, preventing unnecessary failovers.

Zero-Config Discovery

Rather than relying on manual configuration for peer discovery, Valkey Sentinel uses a “Zero-Config” mechanism based on a Gossip Protocol. It treats the Primary node as a central hub and uses it to communicate. Different Sentinels broadcast their presence via a dedicated Pub/Sub channel (__sentinel__:hello). They then subscribe to this channel and query the primary for a list of replicas, allowing Sentinels to dynamically map the entire network topology and establish direct peer-to-peer connections.

SDOWN vs. ODOWN

As Sentinels monitor the network, they evaluate the cluster against two distinct failure states to ensure stability before taking action:

State	Full Name	Trigger	Consequence
SDOWN	Subjective Down	A local observation of a timeout by a single Sentinel.	None. Treated as a potential false positive; does not trigger failover.
ODOWN	Objective Down	A configured Quorum of Sentinels confirms the failure via the `is-master-down-by-addr` command.	Action required. This collective agreement is the mandatory trigger for leader election.

Once an ODOWN state is confirmed by the collective, the cluster must act quickly to restore read/write capabilities. This initiates the failover mechanism itself.

The Failover Mechanism

Failover is the process of choosing a new primary node from the existing replicas when the current primary is considered down. This process is critical for maintaining availability, but must be executed with precision to avoid data loss or split-brain scenarios.

Triggering a failover in Valkey is a two-step process that distinctly separates detecting the failure from executing the recovery:

Detection (The Quorum): The minimum number of Sentinels required to agree that a primary is objectively down (ODOWN). Meeting the quorum initiates the failover attempt.
Authorization (The Majority): To execute the failover, a Sentinel must be elected as the leader. This requires a majority vote from all configured Sentinel processes.

Even if the quorum detects a failure, the failover will abort if the majority of Sentinels are unreachable. This is a vital safeguard: it ensures a minority group partitioned from the main network cannot mistakenly promote a “rogue” primar, causing a split-brain scenario. (Note: To maintain a stable majority, it is mathematically recommended to deploy an odd number of Sentinel nodes).

How the valkey-operator automates this: Managing this quorum manually during scaling operations is tedious and prone to errors. The operator automatically handles this calculation. On scale-up, it waits for the new sentinel to be discovered by all existing ones before safely incrementing the quorum value. On scale-down, it gracefully reduces the quorum on peer_relation_departed, taking into account only active sentinels.

Because the failover process relies on a precise mathematical majority, modifying the Sentinel topology requires deliberate orchestration.

Managing Sentinel Nodes Safely

Maintaining the delicate balance of the cluster during scaling operations is one of the primary reasons infrastructure teams rely on automated operators.

Scaling Up (Adding Sentinels)

When adding nodes, it is critical to introduce them sequentially. If performed manually, administrators must wait until all existing Sentinels are aware of the newly added node before introducing the next, ensuring a quorum majority is always possible.

How the valkey-operator automates this: The operator manages this by scaling nodes one at a time, preventing network and CPU saturation. The designated unit acquires a lock upon the start event, initializes its workload, and continuously polls until replication is connected. It verifies that all active Sentinels have discovered the new unit before releasing the lock for the next unit in sequence.

Scaling Down (Removing Sentinels)

Scaling down is equally as sensitive because Sentinels intentionally “remember” old peers to maintain a stable majority count. To safely remove a Sentinel manually, an administrator must shut down the specific process and send a SENTINEL RESET * command to every remaining instance sequentially, pausing in between to let the cluster stabilize.

How the valkey-operator automates this: When scaling down, the operator intercepts the storage_detaching event. If the unit designated for removal is the current primary, the operator explicitly triggers a failover first to prevent data loss. Afterward, it stops the workload and utilizes the synchronous SENTINEL RESET PRIMARY command across all remaining Sentinels, actively polling until the cluster reflects the new, reduced replica count.

Whether executing an emergency failover or safely scaling nodes, the cluster’s integrity ultimately depends on how the underlying data is synchronized during these shifts.

Deep Dive: Valkey Replication States

During any topology change, Valkey’s asynchronous replication operates in one of three states to keep data aligned across nodes:

Replication State	Scenario	Mechanism
Continuous Replication	Normal Operation	The primary sends a constant stream of commands to the replica, mirroring every write, expiration, or eviction.
Partial Resynchronization	Brief Disconnection	The replica reconnects and requests only the specific commands it missed, minimizing network overhead.
Full Resynchronization	New Node / Missing Data	The primary creates a complete RDB snapshot, transfers it, and streams any buffered commands made during the transfer.

Tracking Synchronization

Valkey identifies the appropriate synchronization state by tracking data with a fingerprint, composed of a Replication ID (the dataset’s history) and an Offset (a byte counter).

When a replica reconnects, it presents these credentials. If the primary has the missing data in its backlog buffer, Partial Resynchronization occurs. If the data is too old or the node is entirely new, a full resynchronization is triggered.

While these consensus and replication mechanics are incredibly robust, they were fundamentally designed for traditional server environments with static IP addresses. Adapting this architecture to dynamic, cloud-native infrastructure introduces a distinct set of challenges.

Kubernetes Integration: Hostname-Based Topologies

Deploying Valkey on Kubernetes requires a strict hostname-based setup to prevent the instability caused by ephemeral, constantly changing Pod IP addresses.

To successfully run Valkey HA on Kubernetes:

Nodes must bind to specific hostnames, and replicas must target the primary using its Fully Qualified Domain Name (FQDN).
The Sentinel quorum must explicitly process hostnames via sentinel resolve-hostnames yes and sentinel announce-hostnames yes.
To prevent TCP IP leakage during auto-discovery, database nodes must define replica-announce-ip with their FQDN.

How the valkey-operator automates this: The operator configures Kubernetes deployments to handle this translation seamlessly. For Valkey, it sets the bind, replica-of, and replica-announce-ip configuration options utilizing the container’s stable hostname. For Sentinel, it sets monitor with the primary’s hostname instead of the IP and enables hostname resolution. This ensures Kubernetes deployments remain stable even when Pods are rescheduled to new IPs.

Conclusion

Achieving true High Availability with Valkey goes far beyond simply initializing a few replica nodes. It requires a delicate orchestration of quorum consensus, precise scaling routines, and strict hostname-based discovery to survive the unpredictable nature of modern environments.

By understanding how replication states and failover mechanics interact and by leveraging tools like Canonical’s valkey-operator to automate them, infrastructure teams can abstract away the manual complexities of Sentinel management. The result is a data tier that remains as resilient as it is fast.

Forum