How to safely resize a kURL cluster containing rook-ceph nodes

Safely Resizing a Rook-Ceph cluster

Once a Rook-Ceph cluster has grown to >3 OSDs, typically by adding additional manager or worker nodes to the Kubernetes cluster, it doesn’t tolerate being reduced to <2 OSDs because the Ceph cluster will lose quorum and it can become difficult to repair without knowledge of Ceph internals.

The most reliable way to change membership of a Rook-Ceph storage cluster is by growing the cluster, re-weighting, and then shrinking the cluster, this way we ensure that all the data held in Ceph gets replicated to new nodes safely.

This guide will outline that procedure for Rook-Ceph 1.4+

Scenario

Imagine a scenario where you have three Kubernetes nodes, and 1 block device attached to each node given to Rook-Ceph. Rook-Ceph cluster will contain OSDs [osd.0, osd.1, osd.2]

Let’s imagine that two of the nodes in the cluster need to be replaced due to hardware maintenance, but we want to maintain a number >3 OSDs in the Ceph cluster. Let’s identify nodes 01 and 02 as the nodes that should be removed.

This is the current layout of our Kubernetes and Ceph cluster:

node00.foo.com has OSD osd.0
node01.foo.com has OSD osd.1
node02.foo.com has OSD osd.2

First, we will move to an intermediate configuration:

node00.foo.com has OSD osd.0
node01.foo.com has OSD osd.1
node02.foo.com has OSD osd.2
node03.foo.com has OSD osd.3
node04.foo.com has OSD osd.4

and we eventually want to get the cluster to our desired configuration:

node00.foo.com has OSD osd.0
node03.foo.com has OSD osd.3
node04.foo.com has OSD osd.4

Execution

Important Information

  • it is assumed that the advanced option isBlockStorageEnabled is true in your kURL spec. This is the default for rook-ceph versions 1.4+.

  • this procedure has not been tested on versions of rook-ceph prior to 1.4

  • you’ll need access to the ceph CLI commands

    • this can be achieved by using kubectl exec to enter the rook-ceph-operator pod, where the ceph CLI is available, but for best results, you might want to try using the rook-ceph-tools pod instead. Here is a link to instructions for creating a rook-ceph-tools pod for rook-ceph 1.5. Make sure you’re using the same version of the rook-ceph toolbox as what’s installed in the cluster.
    • commands using the ceph CLI must be executed from inside a pod that can communicate with the Ceph cluster. You can either create an interactive shell to run multiple commands in a row, or precede each command with kubectl exec:
       # a persistent interactive shell
      ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- /bin/bash
      [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status
      [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
      [root@rook-ceph-tools-54ff78f9b6-gqsfm /]# exit
      
      # one-off command invocation
      ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status
      ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd tree
      
      
  • Kubernetes cluster node management is not covered by this doc. It is assumed that you already know how to add and remove nodes from a cluster managed by kURL.

Procedure

Note: before each step making any change to the Ceph cluster, make sure to check ceph status to ensure that the Ceph cluster is healthy. Do not attempt to make changes to an unhealthy rook cluster or you risk data loss.

1) We have installed a kURL cluster using the following kURL spec and joined 2 additional nodes:

apiVersion: "cluster.kurl.sh/v1beta1"
kind: "Installer"
metadata: 
  name: "ada-sentry-rook"
spec: 
  kubernetes: 
    version: "1.23.x"
  weave: 
    version: "2.6.x"
  rook: 
    version: "1.5.x"
  containerd: 
    version: "1.5.x"
  kotsadm: 
    version: "latest"
  ekco: 
    version: "latest"


And we have created a `rook-ceph-tools` pod using the manifests from the Rook docs.  Currently, `ceph status` reports healthy:

ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- /bin/bash
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status
cluster:
  id:     5f0d6e3f-7388-424d-942b-4bab37f94395
  health: HEALTH_OK

services:
  mon: 3 daemons, quorum a,b,c (age 33m)
  mgr: a(active, since 62m)
  mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
  osd: 3 osds: 3 up (since 35m), 3 in (since 35m)
  rgw: 1 daemon active (rook.ceph.store.a)

task status:

data:
  pools:   11 pools, 177 pgs
  objects: 293 objects, 77 MiB
  usage:   3.3 GiB used, 597 GiB / 600 GiB avail
  pgs:     177 active+clean

io:
  client:   1.2 KiB/s rd, 7.0 KiB/s wr, 2 op/s rd, 0 op/s wr


Notice `health: HEALTH_OK` and that there are three OSDs in the `up and in` state: `osd: 3 osds: 3 up (since 35m), 3 in (since 35m)` and `ceph osd tree` shows that we have 3 nodes, each with 1 OSD

[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
-1         0.97649  root default
-3         0.19530      host node00.foo.com
  0    hdd  0.19530          osd.0                       up   1.00000  1.00000
-7         0.19530      host node01.foo.com
  2    hdd  0.19530          osd.1                       up   1.00000  1.00000
-5         0.19530      host node02.foo.com
  1    hdd  0.19530          osd.2                       up   1.00000  1.00000

2) Add 2 additional nodes to the cluster and run ceph status to check the progress of the replication to the new OSDs. For convenience, you can use the watch command to watch ceph status every few seconds. Some output has been omitted in this example to highlight the important information:

[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# watch ceph status
  cluster:
    id:     5f0d6e3f-7388-424d-942b-4bab37f94395
    health: HEALTH_WARN
            Degraded data redundancy: 34/879 objects degraded (3.868%), 5 pgs degraded, 5 pgs undersized
...
  progress:
    Rebalancing after osd.4 marked in (2m)
      [===========================.] (remaining: 4s)
    Rebalancing after osd.3 marked in (2m)
      [===========================.] (remaining: 4s)


Eventually, all the existing placement groups will be balanced to the new OSDs and `ceph status` will report healthy again:

[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph status
  cluster:
    id:     5f0d6e3f-7388-424d-942b-4bab37f94395
    health: HEALTH_OK
  ...

3) re-weight the OSDs on the two nodes that will be removed (osd.1 and osd.2 from node01 and node02) from the cluster using the ceph osd reweight <osd> 0 command. Remember you can use ceph osd tree to examine the current layout of the OSDs:

[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
-1         0.97649  root default
-3         0.19530      host node00.foo.com
  0    hdd  0.19530          osd.0                       up   1.00000  1.00000
-7         0.19530      host node01.foo.com
  2    hdd  0.19530          osd.1                       up   1.00000  1.00000
-5         0.19530      host node02.foo.com
  1    hdd  0.19530          osd.2                       up   1.00000  1.00000
-9         0.19530      host node03.foo.com
  3    hdd  0.19530          osd.3                       up   1.00000  1.00000
-11         0.19530      host node04.foo.com
  4    hdd  0.19530          osd.4                       up   1.00000  1.00000


[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd reweight osd.2 0
reweighted osd.2 to 0 (0)
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd reweight osd.1 0
reweighted osd.1 to 0 (0)
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
-1         0.97649  root default
-3         0.19530      host node00.foo.com
  0    hdd  0.19530          osd.0                       up   1.00000  1.00000
-7         0.19530      host node01.foo.com
  2    hdd  0.19530          osd.1                       up         0  1.00000
-5         0.19530      host node02.foo.com
  1    hdd  0.19530          osd.2                       up         0  1.00000
-9         0.19530      host node03.foo.com
  3    hdd  0.19530          osd.3                       up   1.00000  1.00000
-11         0.19530      host node04.foo.com
  4    hdd  0.19530          osd.4                       up   1.00000  1.00000

4) Ceph cluster will begin rebalancing placement groups off of the two OSDs that we want to remove. Observe watch ceph status until the rebalancing is complete:

```sh
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# watch ceph status
  cluster:
    id:     5f0d6e3f-7388-424d-942b-4bab37f94395
    health: HEALTH_WARN
            Degraded data redundancy: 1280/879 objects degraded (145.620%), 53 pgs degraded
  ...
  progress:
    Rebalancing after osd.2 marked out (15s)
      [=====================.......] (remaining: 4s)
    Rebalancing after osd.1 marked out (5s)
      [=============...............] (remaining: 5s)
```

Once `ceph status` reports `HEALTH_OK` we can proceed to removing the OSDs from the Ceph cluster

5) Mark osd.1 and osd.2 “down” in the Ceph cluster:

[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd down osd.1
marked down osd.1.
[root@rook-ceph-tools-54ff78f9b6-gqsfm /]# ceph osd down osd.2
marked down osd.2.

Note: If you are using the interactive shell inside the rook-ceph-tools pod, now is a good time to exit back to your workstation’s shell because we will alternate between ceph and kubectl commands

6) Scale the OSD deployments to 0 replicas:

ada@node00.foo.com:~$ kubectl scale deployment -n rook-ceph rook-ceph-osd-1 --replicas 0
deployment.apps/rook-ceph-osd-1 scaled
ada@node00.foo.com:~$ kubectl scale deployment -n rook-ceph rook-ceph-osd-2 --replicas 0
deployment.apps/rook-ceph-osd-2 scaled

7) Check ceph status and ceph osd tree one final time before purging the OSDs from Ceph

ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status
  cluster:
    id:     5f0d6e3f-7388-424d-942b-4bab37f94395
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 64m)
    mgr: a(active, since 93m)
    mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
    osd: 5 osds: 3 up (since 6s), 3 in (since 7m)
    rgw: 1 daemon active (rook.ceph.store.a)

  task status:

  data:
    pools:   11 pools, 177 pgs
    objects: 294 objects, 80 MiB
    usage:   3.3 GiB used, 597 GiB / 600 GiB avail
    pgs:     177 active+clean

  io:
    client:   1023 B/s rd, 1 op/s rd, 0 op/s wr

ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                       STATUS  REWEIGHT  PRI-AFF
-1         0.97649  root default
-3         0.19530      host node00.foo.com
  0    hdd  0.19530          osd.0                       up   1.00000  1.00000
-7         0.19530      host node01.foo.com
  2    hdd  0.19530          osd.1                     down         0  1.00000
-5         0.19530      host node02.foo.com
  1    hdd  0.19530          osd.2                     down         0  1.00000
-9         0.19530      host node03.foo.com
  3    hdd  0.19530          osd.3                       up   1.00000  1.00000
-11         0.19530      host node04.foo.com
  4    hdd  0.19530          osd.4                       up   1.00000  1.00000
```

If everything checks out, purge the OSDs from the Ceph cluster:

ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd purge osd.1 --yes-i-really-mean-it
purged osd.1
ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph osd purge osd.2 --yes-i-really-mean-it
purged osd.2

ada@node00.foo.com:~$ kubectl exec -it -n rook-ceph rook-ceph-tools-54ff78f9b6-gqsfm -- ceph status
cluster:
  id:     5f0d6e3f-7388-424d-942b-4bab37f94395
  health: HEALTH_OK

services:
  mon: 3 daemons, quorum a,b,c (age 65m)
  mgr: a(active, since 95m)
  mds: rook-shared-fs:1 {0=rook-shared-fs-a=up:active} 1 up:standby-replay
  osd: 3 osds: 3 up (since 87s), 3 in (since 8m)
  rgw: 1 daemon active (rook.ceph.store.a)

task status:

data:
  pools:   11 pools, 177 pgs
  objects: 294 objects, 80 MiB
  usage:   3.3 GiB used, 597 GiB / 600 GiB avail
  pgs:     177 active+clean

io:
  client:   6.0 KiB/s rd, 682 B/s wr, 7 op/s rd, 4 op/s wr

And now node01 and node02 can be removed from the Kubernetes cluster and the kURL reset script can be used to remove kURL and Kubernetes assets from the node to prep it to re-join the cluster at a later time, or the VM can be destroyed and created again in the future.

1 Like