Safely Resizing a Rook-Ceph cluster
Once a Rook-Ceph cluster has grown to >3 OSDs, typically by adding additional manager or worker nodes to the Kubernetes cluster, it doesn’t tolerate being reduced to <2 OSDs because the Ceph cluster will lose quorum and it can become difficult to repair without knowledge of Ceph internals.
The most reliable way to change membership of a Rook-Ceph storage cluster is by growing the cluster, re-weighting, and then shrinking the cluster, this way we ensure that all the data held in Ceph gets replicated to new nodes safely.
This guide will outline that procedure for Rook-Ceph 1.4+
Scenario
Imagine a scenario where you have three Kubernetes nodes, and 1 block device attached to each node given to Rook-Ceph. Rook-Ceph cluster will contain OSDs [osd.0, osd.1, osd.2]
Let’s imagine that two of the nodes in the cluster need to be replaced due to hardware maintenance, but we want to maintain a number >3 OSDs in the Ceph cluster. Let’s identify nodes 01
and 02
as the nodes that should be removed.
This is the current layout of our Kubernetes and Ceph cluster:
node00.foo.com
has OSD osd.0
node01.foo.com
has OSD osd.1
node02.foo.com
has OSD osd.2
First, we will move to an intermediate configuration:
node00.foo.com
has OSD osd.0
node01.foo.com
has OSD osd.1
node02.foo.com
has OSD osd.2
node03.foo.com
has OSD osd.3
node04.foo.com
has OSD osd.4
and we eventually want to get the cluster to our desired configuration:
node00.foo.com
has OSD osd.0
node03.foo.com
has OSD osd.3
node04.foo.com
has OSD osd.4
Execution
Important Information
-
it is assumed that the advanced option
isBlockStorageEnabled
istrue
in your kURL spec. This is the default for rook-ceph versions 1.4+. -
this procedure has not been tested on versions of rook-ceph prior to 1.4
-
you’ll need access to the
ceph
CLI commands- this can be achieved by using
kubectl exec
to enter therook-ceph-operator
pod, where theceph
CLI is available, but for best results, you might want to try using therook-ceph-tools
pod instead. Here is a link to instructions for creating arook-ceph-tools
pod for rook-ceph 1.5. Make sure you’re using the same version of the rook-ceph toolbox as what’s installed in the cluster. - commands using the
ceph
CLI must be executed from inside a pod that can communicate with the Ceph cluster. You can either create an interactive shell to run multiple commands in a row, or precede each command withkubectl exec
:# a persistent interactive shell ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- /bin/bash [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# exit # one-off command invocation ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd tree
- this can be achieved by using
-
Kubernetes cluster node management is not covered by this doc. It is assumed that you already know how to add and remove nodes from a cluster managed by kURL.
- The documentation for adding and removing nodes is here
Procedure
Note: before each step making any change to the Ceph cluster, make sure to check
ceph status
to ensure that the Ceph cluster is healthy. Do not attempt to make changes to an unhealthy rook cluster or you risk data loss.
1) We have installed a kURL cluster using the following kURL spec and joined 2 additional nodes:
apiVersion: "cluster.kurl.sh/v1beta1"
kind: "Installer"
metadata:
name: "ada-sentry-rook"
spec:
kubernetes:
version: "1.23.x"
weave:
version: "2.6.x"
rook:
version: "1.5.x"
containerd:
version: "1.5.x"
kotsadm:
version: "latest"
ekco:
version: "latest"
And we have created a `rook-ceph-tools` pod using the manifests from the Rook docs. Currently, `ceph status` reports healthy:
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- /bin/bash
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 33m)
mgr: a(active, since 62m)
mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 35m), 3 in (since 35m)
rgw: 1 daemon active (rook.ceph.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 293 objects, 77 MiB
usage: 3.3 GiB used, 597 GiB / 600 GiB avail
pgs: 177 active+clean
io:
client: 1.2 KiB/s rd, 7.0 KiB/s wr, 2 op/s rd, 0 op/s wr
Notice `health: HEALTH_OK` and that there are three OSDs in the `up and in` state: `osd: 3 osds: 3 up (since 35m), 3 in (since 35m)` and `ceph osd tree` shows that we have 3 nodes, each with 1 OSD
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.97649 root default
-3 0.19530 host node00.foo.com
0 hdd 0.19530 osd.0 up 1.00000 1.00000
-7 0.19530 host node01.foo.com
2 hdd 0.19530 osd.1 up 1.00000 1.00000
-5 0.19530 host node02.foo.com
1 hdd 0.19530 osd.2 up 1.00000 1.00000
2) Add 2 additional nodes to the cluster and run ceph status
to check the progress of the replication to the new OSDs. For convenience, you can use the watch
command to watch ceph status
every few seconds. Some output has been omitted in this example to highlight the important information:
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# watch ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_WARN
Degraded data redundancy: 34/879 objects degraded (3.868%), 5 pgs degraded, 5 pgs undersized
...
progress:
Rebalancing after osd.4 marked in (2m)
[===========================.] (remaining: 4s)
Rebalancing after osd.3 marked in (2m)
[===========================.] (remaining: 4s)
Eventually, all the existing placement groups will be balanced to the new OSDs and `ceph status` will report healthy again:
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_OK
...
3) re-weight the OSDs on the two nodes that will be removed (osd.1
and osd.2
from node01 and node02) from the cluster using the ceph osd reweight <osd> 0
command. Remember you can use ceph osd tree
to examine the current layout of the OSDs:
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.97649 root default
-3 0.19530 host node00.foo.com
0 hdd 0.19530 osd.0 up 1.00000 1.00000
-7 0.19530 host node01.foo.com
2 hdd 0.19530 osd.1 up 1.00000 1.00000
-5 0.19530 host node02.foo.com
1 hdd 0.19530 osd.2 up 1.00000 1.00000
-9 0.19530 host node03.foo.com
3 hdd 0.19530 osd.3 up 1.00000 1.00000
-11 0.19530 host node04.foo.com
4 hdd 0.19530 osd.4 up 1.00000 1.00000
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd reweight osd.2 0
reweighted osd.2 to 0 (0)
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd reweight osd.1 0
reweighted osd.1 to 0 (0)
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.97649 root default
-3 0.19530 host node00.foo.com
0 hdd 0.19530 osd.0 up 1.00000 1.00000
-7 0.19530 host node01.foo.com
2 hdd 0.19530 osd.1 up 0 1.00000
-5 0.19530 host node02.foo.com
1 hdd 0.19530 osd.2 up 0 1.00000
-9 0.19530 host node03.foo.com
3 hdd 0.19530 osd.3 up 1.00000 1.00000
-11 0.19530 host node04.foo.com
4 hdd 0.19530 osd.4 up 1.00000 1.00000
4) Ceph cluster will begin rebalancing placement groups off of the two OSDs that we want to remove. Observe watch ceph status
until the rebalancing is complete:
```sh
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# watch ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_WARN
Degraded data redundancy: 1280/879 objects degraded (145.620%), 53 pgs degraded
...
progress:
Rebalancing after osd.2 marked out (15s)
[=====================.......] (remaining: 4s)
Rebalancing after osd.1 marked out (5s)
[=============...............] (remaining: 5s)
```
Once `ceph status` reports `HEALTH_OK` we can proceed to removing the OSDs from the Ceph cluster
5) Mark osd.1
and osd.2
“down” in the Ceph cluster:
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd down osd.1
marked down osd.1.
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd down osd.2
marked down osd.2.
Note: If you are using the interactive shell inside the
rook-ceph-tools
pod, now is a good time to exit back to your workstation’s shell because we will alternate betweenceph
andkubectl
commands
6) Scale the OSD deployments to 0 replicas:
ada@node00.foo.com:~$ kubectl scale deployment -n rook-ceph rook-ceph-osd-1 --replicas 0
deployment.apps/rook-ceph-osd-1 scaled
ada@node00.foo.com:~$ kubectl scale deployment -n rook-ceph rook-ceph-osd-2 --replicas 0
deployment.apps/rook-ceph-osd-2 scaled
7) Check ceph status
and ceph osd tree
one final time before purging the OSDs from Ceph
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 64m)
mgr: a(active, since 93m)
mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
osd: 5 osds: 3 up (since 6s), 3 in (since 7m)
rgw: 1 daemon active (rook.ceph.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 294 objects, 80 MiB
usage: 3.3 GiB used, 597 GiB / 600 GiB avail
pgs: 177 active+clean
io:
client: 1023 B/s rd, 1 op/s rd, 0 op/s wr
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.97649 root default
-3 0.19530 host node00.foo.com
0 hdd 0.19530 osd.0 up 1.00000 1.00000
-7 0.19530 host node01.foo.com
2 hdd 0.19530 osd.1 down 0 1.00000
-5 0.19530 host node02.foo.com
1 hdd 0.19530 osd.2 down 0 1.00000
-9 0.19530 host node03.foo.com
3 hdd 0.19530 osd.3 up 1.00000 1.00000
-11 0.19530 host node04.foo.com
4 hdd 0.19530 osd.4 up 1.00000 1.00000
```
If everything checks out, purge the OSDs from the Ceph cluster:
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd purge osd.1 --yes-i-really-mean-it
purged osd.1
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd purge osd.2 --yes-i-really-mean-it
purged osd.2
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status
cluster:
id: 5f0d6e3f-7388-424d-942b-4bab37f94395
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 65m)
mgr: a(active, since 95m)
mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 87s), 3 in (since 8m)
rgw: 1 daemon active (rook.ceph.store.a)
task status:
data:
pools: 11 pools, 177 pgs
objects: 294 objects, 80 MiB
usage: 3.3 GiB used, 597 GiB / 600 GiB avail
pgs: 177 active+clean
io:
client: 6.0 KiB/s rd, 682 B/s wr, 7 op/s rd, 4 op/s wr
And now node01
and node02
can be removed from the Kubernetes cluster and the kURL reset script can be used to remove kURL and Kubernetes assets from the node to prep it to re-join the cluster at a later time, or the VM can be destroyed and created again in the future.