Managing nodes when the previous Rook version is in use might leave Ceph in an unhealthy state where mon pods are not rescheduled

After following the steps to remove a node in hosts and then add a new one you might face the following issue scenario when the host installation is using an old Rook version (i.e. Ceph version 14.2.0 installed with Rook 1.0.4).

NOTE: This specific scenario can be faced if you are using Rook 1.0.4 but it is not reproducible by using the latest versions. You can check that it is recommended to upgrade the Rook version in use as described in kURL docs.

If you checked that the Ceph is unhealthy with the out of quorum message:

By running the command kubectl -n rook-ceph exec deployment.apps/rook-ceph-operator -- ceph status, i.e.:

If you checked that the mon Rook pod for the purged node is still in usage:

By running kubectl -n rook-ceph get pod -l app=rook-ceph-mon, i.e.:

Therefore, in order to workaround this scenario and make Ceph came back to a health state you might able to achieve it by performing manual steps such as described in Mon is never rescheduled · Issue #2262 · rook/rook · GitHub by Rook maintainers. Please, ensure that you check the following guidance and steps.

How to sort out the Ceph warning state to upgrade

In this case, you can check that the info about the purged node still in the mapping spec by running kubectl -n rook-ceph describe configmaps rook-ceph-mon-endpoints | grep <node-name> . Therefore, you can:

  1. Stop the Rook Operator by running: kubectl -n rook-ceph scale --replicas=0 deployment.apps/rook-ceph-operator
  2. Edit the configmap rook-ceph-mon-endpoints to (carefully) remove the purged node info from the mapping with the command kubectl -n rook-ceph edit configmaps rook-ceph-mon-endpoints Following an example:

In this example, we purged the node example-ubuntu-2204-node-c but we can still checking it in the mapping:

In this way, to workaround the scenario and try to make the Ceph came back to a health state we will remove "c": ["Name": "example-ubuntu-2204. node-c". "Hostname" : "example-ubuntu-2204-node-c" "Address"': "10.154.0.21"},

  1. Delete the mon pod in Pending state. You can found its name by running kubectl -n rook-ceph get pod -l app=rook-ceph-mon and you can remove by running kubectl -n rook-ceph delete pod <mon-pod-name-in-pending-state>
  2. Then, rescale the Rook Operator by running: kubectl -n rook-ceph scale --replicas=1 deployment.apps/rook-ceph-operator

After performing the above fix, please ensure that the Ceph comes back to a health state

  1. Check that all Rook Mon pod(s) are running with kubectl -n rook-ceph get pod -l app=rook-ceph-mon, i.e.:

  1. Verify the Ceph status with kubectl -n rook-ceph exec deployment.apps/rook-ceph-operator -- ceph status to ensure that it is in a health state, i.e.

1 Like