Managing nodes when the previous Rook version is in use might leave Ceph in an unhealthy state where mon pods are not rescheduled

Camila_Macedo · January 24, 2023, 9:21am

After following the steps to remove a node in hosts and then add a new one you might face the following issue scenario when the host installation is using an old Rook version (i.e. Ceph version 14.2.0 installed with Rook 1.0.4).

NOTE: This specific scenario can be faced if you are using Rook 1.0.4 but it is not reproducible by using the latest versions. You can check that it is recommended to upgrade the Rook version in use as described in kURL docs.

If you checked that the Ceph is unhealthy with the out of quorum message:

By running the command kubectl -n rook-ceph exec deployment.apps/rook-ceph-operator -- ceph status, i.e.:

If you checked that the mon Rook pod for the purged node is still in usage:

By running kubectl -n rook-ceph get pod -l app=rook-ceph-mon, i.e.:

Therefore, in order to workaround this scenario and make Ceph came back to a health state you might able to achieve it by performing manual steps such as described in Mon is never rescheduled · Issue #2262 · rook/rook · GitHub by Rook maintainers. Please, ensure that you check the following guidance and steps.

How to sort out the Ceph warning state to upgrade

In this case, you can check that the info about the purged node still in the mapping spec by running kubectl -n rook-ceph describe configmaps rook-ceph-mon-endpoints | grep <node-name> . Therefore, you can:

Stop the Rook Operator by running: kubectl -n rook-ceph scale --replicas=0 deployment.apps/rook-ceph-operator
Edit the configmap rook-ceph-mon-endpoints to (carefully) remove the purged node info from the mapping with the command kubectl -n rook-ceph edit configmaps rook-ceph-mon-endpoints Following an example:

In this example, we purged the node example-ubuntu-2204-node-c but we can still checking it in the mapping:

In this way, to workaround the scenario and try to make the Ceph came back to a health state we will remove "c": ["Name": "example-ubuntu-2204. node-c". "Hostname" : "example-ubuntu-2204-node-c" "Address"': "10.154.0.21"},

Delete the mon pod in Pending state. You can found its name by running kubectl -n rook-ceph get pod -l app=rook-ceph-mon and you can remove by running kubectl -n rook-ceph delete pod <mon-pod-name-in-pending-state>
Then, rescale the Rook Operator by running: kubectl -n rook-ceph scale --replicas=1 deployment.apps/rook-ceph-operator

After performing the above fix, please ensure that the Ceph comes back to a health state

Check that all Rook Mon pod(s) are running with kubectl -n rook-ceph get pod -l app=rook-ceph-mon, i.e.:

Verify the Ceph status with kubectl -n rook-ceph exec deployment.apps/rook-ceph-operator -- ceph status to ensure that it is in a health state, i.e.

Topic		Replies	Views
How to safely resize a kURL cluster containing rook-ceph nodes Troubleshooting	0	1785	September 27, 2022
How to recover Rook-Ceph cluster when missing files under /var/lib/rook/exporter/ How do I? kurl	0	319	October 17, 2023
Troubleshoot OpenEBS to Rook-Ceph storage migrations Troubleshooting kurl , ceph , rook	0	19	April 17, 2025
Removing nodes from Kubernetes clusters Supporting your customers	1	3103	September 10, 2020
Error running installation script (Rook 1.0.4 is not compatible with Kubernetes 1.20+) Supporting your customers kurl	34	1584	January 31, 2022

Managing nodes when the previous Rook version is in use might leave Ceph in an unhealthy state where mon pods are not rescheduled

How to sort out the Ceph warning state to upgrade

After performing the above fix, please ensure that the Ceph comes back to a health state

Related topics