In kURL installations, the migration from OpenEBS (non-HA) storage to Rook-Ceph (HA) is automatically handled by the kURL EKCO operator. This migration process is triggered when:
- At least 3 nodes are available (meeting the
minimumNodeCount
requirement) - The user accepts the migration prompt during the third node join operation
If these conditions are met but the migration doesn’t complete successfully (e.g., no monitor or OSD pods are created), follow these troubleshooting steps:
- Check if there’s error in Rook operator
kubectl logs deploy/rook-ceph-operator -n rook-ceph
- Check if there’s error in EKCO operator
kubectl logs deploy/ekc-operator -n kurl | grep rook
- Describe
CephCluster
CRD to check for error
kubectl describe cephclusters/rook-ceph -n rook-ceph
e.g. here by describing the CephCluster
CR we can see that there’s reconcile error in mon quorum
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileFailed 24h (x10 over 45h) rook-ceph-cluster-controller failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum b: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum
Warning ReconcileFailed 23h (x15 over 44h) rook-ceph-cluster-controller failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum a: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum
From this we can get the rook-ceph mon pod, and troubleshoot further on why there are restarts
NAME READY STATUS RESTARTS AGE
rook-ceph-mon-b-b588cf9d6-skmn9 2/2 Running 6 (23h ago) 45h
rook-ceph-mon-c-5b94dc7558-rhfbr 2/2 Running 6 (23h ago) 45h
In this case, describing the pod shows that there’s OOM event that prevent mon pods from reaching a quorum
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
We will have to modify the Rook add-on nodes
setting with higher resources limit for the mon pods.