Recovering when Ceph OSDs are too full to operate

If you’re working with Rook Ceph and face issues related to backfill, this guide will walk you through the steps to resolve them.

You can see when this is an issue because Ceph will report that it is unhealthy, with messages like OSD_BACKFILLFULL: 1 backfillfull osd(s) and/or OSD_NEARFULL: 2 nearfull osd(s).

Note that by default, Ceph will switch to readonly when the OSD devices are 85% full.

1. Access the Rook Ceph Tools:

kubectl command

kubectl exec -ti rook-ceph-tools-xxxxxxx -n rook-ceph bash

2. Check the Current Full-Ratio Rate:
Use the following command to check the current values:

ceph osd dump | grep -i full

You might see output similar to this:

full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Make a note of these ratios as you’ll need to revert to these values later.

3. Temporarily Increase the Full Ratio:
To temporarily allow more space, modify the full ratios:

ceph osd set-full-ratio 0.97
ceph osd set-nearfull-ratio 0.95
ceph osd set-backfillfull-ratio 0.97

4. Confirm OSDs are not Full:
Ensure that all of the OSDs are not in a full state:

ceph osd status

5. Check Pool Usage:

To determine which pool is consuming the most space, you can use:

ceph df

or

rados df

Depending on the results, you can decide if cleanup is necessary.

6. Cleanup (if needed):
Here’s an example command to purge logs which might free up some space:

rados purge rook-ceph-store.rgw.log --yes-i-really-really-mean-it

7. Restart csi-rbdplugin or csi-cephfsplugin (if needed):

If you see error like GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-8d0ba728-0e17-11eb-a680-ce6eecc894de already exists.

The issue mostly exists in ceph cluster or network connectivity. If the issue is
in Provisioning the PVC Restarting the Provisioner pods help(for CephFS issue
restart `csi-cephfsplugin-provisioner-xxxxxx` CephFS Provisioner. For RBD, restart
`csi-rbdplugin-provisioner-xxxxxx` pod. If the issue is in mounting the PVC,
restarting the csi-rbdplugin-xxxxx for RBD issue and csi-cephfsplugin-xxxxx pod
for CephFS issue helps sometimes not always.

8. Reset the Full Ratios to their Original Values:
Once you’ve addressed the space issue, revert the full ratios to their original values:

ceph osd set-full-ratio [previous_record]
ceph osd set-nearfull-ratio [previous_record]
ceph osd set-backfillfull-ratio [previous_record]

By following these steps, you should be able to solve the backfill issues in Rook Ceph.

See the upstream docs for more information.