Once a Rook-Ceph cluster is running. It will have multiple files under /var/lib/rook/exporter/
of the host. If someone delete those files accidentally, it will cause some your pods under rook-ceph namespace in CrashLoopBackOff
.
For example, you will have rook-ceph-mgr-b-xxx
in crash, rook-ceph-osd-0-***
and “rook-ceph-mon-a-***” not ready.
When you use kubectl to describe those failling pods, you will find this error
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
You can recover those pods and missing files under /var/lib/rook/exporter/
by following instructions.
Recover Steps
You need to run kubectl get deployment -n rook-ceph
to list all the rook-ceph-osd-*
deployment.
For example, you have
rook-ceph-osd-0 0/1 1 0 3h36m
Then you can run
kubectl scale deployment -n rook-ceph rook-ceph-osd-0 --replicas 0
If you have multiple osd nodes, please make sure those deployments rook-ceph-osd-*
scaled to 0 first.
After that, you need to run kubectl get pod -n rook-ceph | grep rook-ceph-mon-a
to get the running pod name of rook-ceph-mon-a
.
Use the pod name from previous command to restart the rook-ceph-mon-a
pod
kubectl delete pod rook-ceph-mon-a-******** -n rook-ceph
After few minutes, you will be able to find both missing files of /var/lib/rook/exporter/
and pods back to normal.
Your osd deployment will scale by to the original number via rook ceph automatically.