How to recover Rook-Ceph cluster when missing files under /var/lib/rook/exporter/

Once a Rook-Ceph cluster is running. It will have multiple files under /var/lib/rook/exporter/ of the host. If someone delete those files accidentally, it will cause some your pods under rook-ceph namespace in CrashLoopBackOff.

For example, you will have rook-ceph-mgr-b-xxx in crash, rook-ceph-osd-0-*** and “rook-ceph-mon-a-***” not ready.

When you use kubectl to describe those failling pods, you will find this error

admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

You can recover those pods and missing files under /var/lib/rook/exporter/ by following instructions.

Recover Steps

You need to run kubectl get deployment -n rook-ceph to list all the rook-ceph-osd-* deployment.

For example, you have

rook-ceph-osd-0                             0/1     1            0           3h36m

Then you can run

kubectl scale deployment -n rook-ceph rook-ceph-osd-0 --replicas 0

If you have multiple osd nodes, please make sure those deployments rook-ceph-osd-* scaled to 0 first.

After that, you need to run kubectl get pod -n rook-ceph | grep rook-ceph-mon-a to get the running pod name of rook-ceph-mon-a.

Use the pod name from previous command to restart the rook-ceph-mon-a pod

kubectl delete pod rook-ceph-mon-a-******** -n rook-ceph

After few minutes, you will be able to find both missing files of /var/lib/rook/exporter/ and pods back to normal.

Your osd deployment will scale by to the original number via rook ceph automatically.