Recovering when Ceph OSDs are too full to operate

Dexter_Yan · October 9, 2023, 4:16am

If you’re working with Rook Ceph and face issues related to backfill, this guide will walk you through the steps to resolve them.

You can see when this is an issue because Ceph will report that it is unhealthy, with messages like OSD_BACKFILLFULL: 1 backfillfull osd(s) and/or OSD_NEARFULL: 2 nearfull osd(s).

Note that by default, Ceph will switch to readonly when the OSD devices are 85% full.

1. Access the Rook Ceph Tools:

kubectl command

kubectl exec -ti rook-ceph-tools-xxxxxxx -n rook-ceph bash

2. Check the Current Full-Ratio Rate:
Use the following command to check the current values:

ceph osd dump | grep -i full

You might see output similar to this:

full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Make a note of these ratios as you’ll need to revert to these values later.

3. Temporarily Increase the Full Ratio:
To temporarily allow more space, modify the full ratios:

ceph osd set-full-ratio 0.97
ceph osd set-nearfull-ratio 0.95
ceph osd set-backfillfull-ratio 0.97

4. Confirm OSDs are not Full:
Ensure that all of the OSDs are not in a full state:

ceph osd status

5. Check Pool Usage:

To determine which pool is consuming the most space, you can use:

ceph df

or

rados df

Depending on the results, you can decide if cleanup is necessary.

6. Cleanup (if needed):
Here’s an example command to purge logs which might free up some space:

rados purge rook-ceph-store.rgw.log --yes-i-really-really-mean-it

7. Restart csi-rbdplugin or csi-cephfsplugin (if needed):

If you see error like GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-8d0ba728-0e17-11eb-a680-ce6eecc894de already exists.

The issue mostly exists in ceph cluster or network connectivity. If the issue is
in Provisioning the PVC Restarting the Provisioner pods help(for CephFS issue
restart `csi-cephfsplugin-provisioner-xxxxxx` CephFS Provisioner. For RBD, restart
`csi-rbdplugin-provisioner-xxxxxx` pod. If the issue is in mounting the PVC,
restarting the csi-rbdplugin-xxxxx for RBD issue and csi-cephfsplugin-xxxxx pod
for CephFS issue helps sometimes not always.

8. Reset the Full Ratios to their Original Values:
Once you’ve addressed the space issue, revert the full ratios to their original values:

ceph osd set-full-ratio [previous_record]
ceph osd set-nearfull-ratio [previous_record]
ceph osd set-backfillfull-ratio [previous_record]

By following these steps, you should be able to solve the backfill issues in Rook Ceph.

See the upstream docs for more information.

Topic		Replies	Views
Ceph PersistentVolumeClaim Resizing Packaging an application	0	1105	March 24, 2020
How to recover Rook-Ceph cluster when missing files under /var/lib/rook/exporter/ How do I? kurl	0	347	October 17, 2023
How to safely resize a kURL cluster containing rook-ceph nodes Troubleshooting	0	1798	September 27, 2022
Flexvolume creates deadlock and Deployment enters into Crashloopbackoff on node reboot Packaging an application	4	729	July 29, 2020
KOTS: managed Kubernetes ceph rook not available by default Packaging an application	4	1245	August 28, 2020

Recovering when Ceph OSDs are too full to operate

Related topics