Skip to main content

Replication and failover architecture

cloudnative-mysql builds a primary-replica topology with Percona Server for MySQL GTID replication. The operator owns topology policy, while each instance manager owns local mysqld role changes. This split keeps primary changes declarative: the operator writes the intended primary in Cluster status, and the pods converge themselves toward that target.

Replication model

Replicas are created from a physical XtraBackup stream taken from the current primary. After prepare and copy-back, the instance manager configures MySQL with GTID auto-positioning so it follows the primary from the restored GTID point.

MySQL transport TLS is rendered for every instance. Replication uses the per-instance certificate material and a dedicated replication account requiring X509. The application-facing require_secure_transport setting remains a user choice through spec.mysql.parameters.

Semi-synchronous replication can be enabled with:

spec:
instances: 3
minSyncReplicas: 1
maxSyncReplicas: 1
mysql:
semiSync:
enabled: true
timeoutMillis: 1000

Semi-sync improves failover durability when configured so acknowledged commits reach at least one replica. Without that guarantee, an acknowledged write that only existed on a lost primary can be lost before a replica or the object store sees it.

Semi-sync self-healing (data durability)

With semi-sync on, the primary blocks each commit until minSyncReplicas replicas acknowledge it. If a synchronous replica becomes unhealthy, that floor can stall writes. spec.mysql.semiSync.dataDurability decides how the operator responds:

spec:
minSyncReplicas: 2
mysql:
semiSync:
enabled: true
dataDurability: preferred # or "required"

With preferred (the default), the operator keeps lowering the primary's required acknowledgement count (rpl_semi_sync_source_wait_for_replica_count) to the number of healthy replicas, never below one, so writes keep flowing during a replica outage. It raises the count back to minSyncReplicas as replicas recover. This favours availability over strict durability.

With required, the count stays pinned to minSyncReplicas. When fewer healthy replicas can acknowledge, writes block until timeoutMillis elapses and replication falls back to async. This favours durability over availability.

The operator applies the change on the primary over the mTLS control API during its steady-state reconcile. It is a runtime adjustment only; the static my.cnf floor does not change.

Liveness isolation check

Each cluster-managed instance keeps probing the Kubernetes API server. If it cannot reach the API server for 30 seconds it treats itself as network-isolated and fails its liveness probe, so the kubelet restarts the container. This is a last-resort guard against a partitioned primary the operator can no longer coordinate, which is a split-brain risk. The check runs locally inside the instance, so a genuinely isolated node still restarts itself even while it is unreachable from the control plane.

Fencing an instance

Fencing takes a single instance out of service without deleting it or its data. Use the plugin:

kubectl cloudnative-mysql fence on <cluster> <cluster>-2
kubectl cloudnative-mysql fence off <cluster> <cluster>-2

The operator drops the Pod from every routing Service (rw, ro, r, and any user-defined ones) by clearing its routable label, and records it under status.fencedInstances. The instance's in-Pod reconciler reads that list and holds the instance read-only, so it accepts no writes and its continuous archiver stands down. A fenced instance is also skipped as a failover candidate, so the operator never promotes it. Unfencing reverses all of this and the instance rejoins normal routing and role reconciliation.

Fencing the primary stops writes for the whole cluster, because the rw Service loses its only endpoint. That is deliberate: use it to freeze an instance for inspection or maintenance rather than as a failover trigger.

Primary lease fencing

Each cluster creates a standard coordination.k8s.io/v1 Lease named {cluster}-primary. Before an instance promotes itself to primary, it must acquire this lease. While it is primary, it renews the lease every 30 seconds. When it demotes or shuts down, it releases (deletes) the lease.

During automatic failover, the operator checks whether the old primary's lease is still held before promoting a candidate. If the lease is still active, failover waits for the lease TTL (15 seconds) before proceeding. This prevents the operator from promoting a new primary while the old one might still be accepting writes on a network-partitioned node.

Combined with Pod deletion, this gives two independent timeouts before the old primary can be replaced: the lease TTL and the kubelet Pod termination grace period. The split-brain window narrows to whichever expires first.

Feature gate

spec:
enablePrimaryLease: false

The feature is on by default. Set it to false to skip all lease management. This is safe for single-instance or test clusters where the extra API call is unnecessary.

Lifecycle

EventLease action
Instance promotes itselfAcquire (create or take over the lease, set holderIdentity)
Instance is already primaryRenew renewTime
Instance demotes or is fencedDelete the lease
Instance shuts downDelete the lease (graceful shutdown handler)
Lease acquisition failsInstance does not promote; retries on next reconcile
API server unreachableLease cannot be renewed. The instance's liveness probe eventually fails (isolation detector) and kubelet restarts the container.

Failover interaction

The lease adds a 15-second wait to failover when the old primary is still renewing. The operator sequence becomes:

  1. Detect primary failure, wait for failoverDelay.
  2. Select a candidate.
  3. Check the old primary's lease: if still held, requeue for 15 seconds.
  4. Fence the old primary Pod.
  5. Set targetPrimary and let the candidate promote.

If the old primary's Pod is deleted quickly (step 4), the lease is also released because the in-Pod shutdown handler fires. In that case the wait in step 3 is negligible. The lease guard is a backstop for when the Pod takes longer than 15 seconds to terminate.

Monitoring

The lease exists in the same namespace as the Cluster. Inspect it:

kubectl get lease <cluster>-primary -o yaml

The holderIdentity, renewTime, and leaseTransitions fields show which instance holds the lease and when it last renewed. An expired lease (renewTime older than 15 seconds with a different holderIdentity) means the holder stopped renewing and the lease is safe to take.

Deleting a cluster

Deleting a Cluster tears down its instances. Like CloudNativePG, the operator does not hold a finalizer to block the delete: kubectl delete cluster <cluster> removes the Pods immediately, and owned resources are garbage-collected via owner references.

To protect the data, rely on the standard Kubernetes mechanisms — set a Retain reclaim policy on the StorageClass (or retain the PVCs) so the volumes survive the Cluster deletion, and use RBAC to restrict who can delete Clusters.

Role services

cloudnative-mysql creates three default Services:

  • <cluster>-rw: selects the current primary.
  • <cluster>-ro: selects ready replicas.
  • <cluster>-r: selects any ready instance.

Default Services can be disabled by name (the rw service cannot be disabled):

spec:
managed:
services:
disabledDefaultServices:
- ro

Customising default services

A shared template is merged onto the three default Services, letting you change their type and add labels and annotations. The operator always owns the selector and the mysql:3306 port.

spec:
managed:
services:
template:
metadata:
labels:
app.kubernetes.io/part-of: my-app
annotations:
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
spec:
type: LoadBalancer

Additional services

You can declare extra Services routed to a role (rw, ro, or r). Each entry is rendered as <cluster>-<name> and carries its own template. updateStrategy: patch (default) merges your template onto the role defaults; replace swaps them entirely, keeping only the operator-owned selector, ports, and owner-tracking labels. Additional service names must be unique and must not collide with the default rw/ro/r names.

spec:
managed:
services:
additional:
- name: mysql-lb
selectorType: rw
serviceTemplate:
spec:
type: LoadBalancer
- name: mysql-internal-read
selectorType: ro
updateStrategy: replace
serviceTemplate:
metadata:
labels:
pool: reporting
spec:
type: ClusterIP

The user-customisable spec fields are type, externalTrafficPolicy, sessionAffinity, loadBalancerSourceRanges, externalName, and healthCheckNodePort. The selector, ports, and clusterIP are operator-managed and cannot be overridden. Per-instance headless Services remain internal ClusterIP: None and are not user-configurable.

Service routing is driven by Pod labels. The operator updates labels only after the database role change is considered safe, so client traffic follows the observed MySQL topology.

Dynamic role reconciliation

Every instance starts read only. The in-pod role reconciler watches the owning Cluster and compares its own name with status.targetPrimary and status.currentPrimary.

If the pod is the target primary, it drains replication state, promotes itself, clears read-only mode, and writes status.currentPrimary.

If the pod is not the target primary, it stays or becomes read only and follows the current primary. A diverged instance is kept read only and is not silently re-cloned over its retained PVC.

Planned switchover

A planned switchover promotes a named healthy replica. cloudnative-mysql models this like CloudNativePG: the request is a status transition rather than a spec change. The normal trigger is to set status.targetPrimary to a replica name.

The operator then:

  1. validates that the target exists, is ready, is a replica, and has healthy replication threads;
  2. checks that the target GTID set contains the old primary's observed GTID set;
  3. demotes the old primary while it is still reachable;
  4. lets the target promote itself;
  5. reconfigures remaining replicas to follow the new primary;
  6. updates role labels, Services, currentPrimary, and conditions.

spec.maxSwitchoverDelay bounds how long the target may take to catch up before the switchover is aborted and surfaced as blocked.

Automatic failover

Automatic failover begins when the established primary is unreachable or not healthy. The operator records when the primary first started failing and waits for spec.failoverDelay. A value of 0 means immediate failover.

After the delay, the operator selects a candidate only if it can prove the move is safe enough:

  • the candidate must be ready and a replica;
  • replication SQL state must be healthy and free of a last error;
  • GTID sets must be comparable;
  • the chosen candidate must contain the best known GTID history among candidates.

If the GTID sets are divergent or no safe candidate exists, failover is blocked instead of risking data loss.

During failover, the old primary is fenced by removing its primary role and deleting its Pod while retaining the PVC. The promoted replica becomes currentPrimary, and surviving replicas follow it.

Former primary rejoin

When a fenced or crashed primary returns, it does not automatically become primary again. It boots read only, observes Cluster status, and attempts to follow the promoted primary.

If its GTID set is contained in the new primary's GTID set, it can safely rejoin as a replica. If it contains errant transactions that the promoted primary never saw, cloudnative-mysql marks it diverged and keeps it out of service. The retained PVC is left for deliberate human recovery instead of being destroyed.

Status and events

Useful status fields during topology changes:

  • currentPrimary: the instance currently accepted as primary.
  • targetPrimary: the desired primary.
  • currentPrimaryTimestamp: when the current primary became primary.
  • targetPrimaryTimestamp: when a primary change was requested.
  • primaryFailingSince: when the current primary became unhealthy.
  • divergedInstances: instances excluded because their GTID set is unsafe.
  • fencedInstances: instances fenced out of routing and held read-only.
  • gtidExecutedByInstance: last observed GTID state per instance.

Watch Events for phase transitions such as switchover, failover, fencing, and blocked operations. The operator also reports Ready, Progressing, and Degraded conditions on the Cluster.

Operational notes

  • Use three instances for meaningful automatic failover.
  • Prefer semi-sync when the recovery objective requires acknowledged writes to survive primary loss.
  • Keep failover delay low for availability, but high enough to avoid promoting during transient node or network blips.
  • Do not manually write to a replica or recovered former primary. Errant GTIDs intentionally block automatic rejoin.
  • Do not delete retained PVCs until you have decided the data is no longer needed for diagnosis or manual recovery.

Verification coverage

Unit tests cover GTID parsing and containment, candidate selection, switchover validation, failover delay, blocked failover, role-label updates, and divergence detection. Kind e2e coverage validates planned switchover, automatic failover by primary Pod deletion, service rerouting, writes after promotion, and compatible former-primary rejoin.