Troubleshooting

This page starts with symptoms and points to the first places to inspect. cloudnative-mysql surfaces most issues through Cluster/Backup status, Kubernetes Events, and the instance-manager logs.

First commands

kubectl cloudnative-mysql status <cluster>
kubectl cloudnative-mysql logs <cluster>
kubectl describe cluster <cluster>
kubectl get events --sort-by=.lastTimestamp
kubectl get backup
kubectl get scheduledbackup

Operator logs:

kubectl logs -n cloudnative-mysql-system deployment/cloudnative-mysql-controller-manager -c manager

Instance logs:

kubectl logs pod/<cluster>-1 -c manager

Cluster is not Ready

Check:

kubectl cloudnative-mysql status <cluster>
kubectl cloudnative-mysql logs <cluster>
kubectl describe pod <pod>

Common causes:

cert-manager has not produced TLS Secrets yet;
PVC is Pending due to storage class or capacity;
image pull failed;
unsupported Cluster shape is blocked by the controller;
instance-manager /status is unavailable;
initdb, restore, or join init container failed.

Look at status.phase, status.phaseReason, and Events first.

Replica will not join

Check the replica init container logs:

kubectl logs pod/<replica-pod> -c initdb

Common causes:

primary is not Ready yet;
mTLS material is missing or invalid;
source manager endpoint is unreachable;
XtraBackup stream failed;
target PVC already contains incompatible data;
MySQL version/image is incompatible with the source backup.

Replica provisioning uses XtraBackup over the existing instance-manager mTLS port. Network policies or service DNS issues can break the join path.

Primary change is stuck

Inspect:

kubectl cloudnative-mysql status <cluster>

Common causes:

target replica is not healthy;
target GTID set does not contain the old primary's observed GTID set;
spec.maxSwitchoverDelay expired;
old primary could not be demoted or fenced;
a former primary returned with errant transactions.

Check status.currentPrimary, status.targetPrimary, status.targetPrimaryTimestamp, status.divergedInstances, and Events.

Automatic failover did not happen

cloudnative-mysql blocks failover when it cannot prove a safe candidate.

Check:

kubectl cloudnative-mysql status <cluster>

Likely explanations:

failover delay has not elapsed;
Kubernetes still reports the primary Pod as Ready;
no ready replica exists;
replication SQL state is unhealthy;
GTID sets are incomparable or divergent;
the only candidate is being deleted.

Failover should not be triggered solely by a temporary manager status endpoint failure while Kubernetes still routes the primary as Ready.

Backup failed

Inspect:

kubectl describe backup <backup>
kubectl get job <backup>-backup
kubectl logs job/<backup>-backup

Common causes:

missing object-store configuration;
missing or invalid S3 credentials;
no healthy backup source;
source instance-manager stream failed;
XtraBackup failed;
object-store upload failed.

The controller writes the backup phase, error, Job name, selected source instance, destination path, and conditions into Backup status.

ScheduledBackup did not create a Backup

Inspect:

kubectl describe scheduledbackup <scheduledbackup>
kubectl get backup -l mysql.cloudnative-mysql.io/scheduled-backup=<scheduledbackup>

Common causes:

spec.suspend: true;
invalid six-field cron expression;
a child Backup is still running, so the concurrency guard is deferring;
deterministic Backup name collision with a non-owned Backup;
first scheduled time has not arrived and immediate is false.

The schedule has six fields including seconds.

Continuous archiving is degraded

Inspect:

kubectl get cluster <cluster> -o jsonpath='{.status.continuousArchiving}'
kubectl describe cluster <cluster>

Common causes:

object-store endpoint or credentials are wrong;
primary cannot upload objects;
active binlog has not rotated yet;
object-store outage;
archiver cannot update manifests or _index.json;
purge guard is detecting lag.

PITR depends on the archive index and manifests, not just raw binlog objects.

PITR target is unsatisfiable

Common causes:

recovery target is before the base backup anchor;
target GTID or target time is beyond archived coverage;
_index.json is missing or stale;
required binlog segment or manifest was deleted;
archive has a forked or incoherent timeline.

Prefer targetGTID for exact recovery boundaries. targetTime depends on binlog event timestamps and server clocks.

Object-store data remains after deleting Backup

This is expected today. Deleting a Backup object does not delete backup.xbstream or metadata.json from the object store. Remote cleanup is a planned finalizer/retention feature.

Useful labels

mysql.cloudnative-mysql.io/cluster=<cluster>
mysql.cloudnative-mysql.io/instance=<instance>
mysql.cloudnative-mysql.io/role=primary|replica
mysql.cloudnative-mysql.io/scheduled-backup=<scheduledbackup>

These labels make it easier to list Pods, PVCs, Services, and generated Backups for one Cluster or schedule.

First commands​

Cluster is not Ready​

Replica will not join​

Primary change is stuck​

Automatic failover did not happen​

Backup failed​

ScheduledBackup did not create a Backup​

Continuous archiving is degraded​

PITR target is unsatisfiable​

Object-store data remains after deleting Backup​

Useful labels​

First commands

Cluster is not Ready

Replica will not join

Primary change is stuck

Automatic failover did not happen

Backup failed

ScheduledBackup did not create a Backup

Continuous archiving is degraded

PITR target is unsatisfiable

Object-store data remains after deleting Backup

Useful labels