Troubleshoot OpenEBS to Rook-Ceph storage migrations

In kURL installations, the migration from OpenEBS (non-HA) storage to Rook-Ceph (HA) is automatically handled by the kURL EKCO operator. This migration process is triggered when:

  • At least 3 nodes are available (meeting the minimumNodeCount requirement)
  • The user accepts the migration prompt during the third node join operation

If these conditions are met but the migration doesn’t complete successfully (e.g., no monitor or OSD pods are created), follow these troubleshooting steps:

  1. Check if there’s error in Rook operator
kubectl logs deploy/rook-ceph-operator -n rook-ceph
  1. Check if there’s error in EKCO operator
kubectl logs deploy/ekc-operator -n kurl | grep rook
  1. Describe CephCluster CRD to check for error
kubectl describe cephclusters/rook-ceph -n rook-ceph

e.g. here by describing the CephCluster CR we can see that there’s reconcile error in mon quorum.

Events:
  Type     Reason           Age                 From                          Message
  ----     ------           ----                ----                          -------
  Warning  ReconcileFailed  24h (x10 over 45h)  rook-ceph-cluster-controller  failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum b: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum
  Warning  ReconcileFailed  23h (x15 over 44h)  rook-ceph-cluster-controller  failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to check mon quorum a: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum

From this we can get the rook-ceph mon pod, and troubleshoot further on why there are restarts

NAME                                            READY   STATUS    RESTARTS      AGE
rook-ceph-mon-b-b588cf9d6-skmn9                 2/2     Running   6 (23h ago)   45h
rook-ceph-mon-c-5b94dc7558-rhfbr                2/2     Running   6 (23h ago)   45h

In this case, describing the pod shows that there’s OOM event that prevent mon pods from reaching a quorum

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

We will have to modify the Rook add-on nodes setting with higher resources limit for the mon pods.

ref: CephCluster CRD - Rook Ceph Documentation