Embedded installer system requirements depending on selected add-ons

Philipp_Lukan · September 7, 2022, 2:28pm

We are currently supporting a customer installing an embedded cluster setup.
The kurl installer declaration we are using includes some outdated add-ons which we want to upgrade.

Specifically, we are investigating to replace Rook with Longhorn, however, it utilizes disk space differently and requests more CPU upfront.

Therefore since hardware resource usage appears to differ depending on certain Kurl add-ons being selected, such as Longhorn or Rook, is there maybe an official document or guideline that delves deeper into the system specs required and would help us to provide some clarification to our customer?

dex · September 7, 2022, 2:39pm

I don’t have a super hard-line answer here, but I don’t believe we have a way to calculate this today in advance. There’s the brute force “add up all the CPU and Memory requests across all pods” or even just “spin up an instance with some headroom and see what gets used”

As an aside though, I’m interested to learn why you’re looking to switch from Rook to Longhorn – my understanding is that more of our development effort is going into Rook these days and it has more recent supported versions available.

Philipp_Lukan · September 7, 2022, 3:36pm

Thanks for the input, we weren’t aware that Rook is receiving more focus.

Initially we were planning to upgrade Rook from 1.0.4, but found that newer versions (1.7.x) require an additional disk mount to be added specifically for Rook, otherwise the installer fails. We started looking into using Longhorn since it offers similar functionality to Rook but without this specific requirement, as well as the Longhorn Dashboard working out of the box.

Are there any significant advantages or recommendations we should be aware of to stick with Rook? Are there any long-term plans to discontinue Longhorn support?
Thank you for any feedback!

Chris_Sanders · September 19, 2022, 2:09pm

Philipp,

While Longhorn does allow you to use folders still as opposed to block devices they do recommend you use block devices. The reality is distributed storage is best run on block devices for both data stability and performance. Rook (Ceph) removed the option specifically because it generates more failures and support issues than the project deemed responsible.

We do not currently expect to maintain long term support of Longhorn, of course that could change in the future and we will support environments that have used it in the interim. For a bit of insight, our experience after deploying and using some Longhorn environments are that it has a number of failure conditions that cause support issues which have no identified root cause. Here are two examples of cases we’ve been working with upstream and not received a particularly impressive response to.

github.com/longhorn/longhorn

[BUG] Filesystem corruption in Longhorn

opened 08:27PM - 02 Jun 22 UTC

diamonwiggins

kind/bug area/engine priority/0 investigation-needed area/stability area/data-integrity

## Describe the bug We're working with a user who suffered filesystem corrupt…ion and data loss as a result when using Longhorn persistent volumes in their K8s cluster. At times the filesystem is mounted read only and requires a `fsck` for it be remounted. At other times there is data loss and a restore of the app's data needs to be done. Whenever the issue occurs we see the following in the kernel log on the host: ``` May 23 08:22:24 rbcdevesh593 kernel: sd 4:0:0:1: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=8s May 23 08:22:24 rbcdevesh593 kernel: sd 4:0:0:1: [sde] Sense Key : Medium Error [current] May 23 08:22:24 rbcdevesh593 kernel: sd 4:0:0:1: [sde] Add. Sense: Unrecovered read error May 23 08:22:24 rbcdevesh593 kernel: sd 4:0:0:1: [sde] CDB: Write(10) 2a 00 00 c0 02 50 00 00 08 00 May 23 08:22:24 rbcdevesh593 kernel: blk_update_request: critical medium error, dev sde, sector 12583504 May 23 08:22:24 rbcdevesh593 kernel: Buffer I/O error on dev sde, logical block 1572938, lost async page write ``` In the Engine Manager we see a number of: ``` 2022-05-23T17:36:20.394183717Z time="2022-05-23T17:36:20Z" level=error msg="Setting replica tcp://10.32.10.15:10015 to ERR due to: r/w timeout" 2022-05-23T17:36:20.394188199Z time="2022-05-23T17:36:20Z" level=info msg="Ignore set replica tcp://10.32.10.15:10015 to mode ERR due to it's ERR" 2022-05-23T17:36:20.394192727Z time="2022-05-23T17:36:20Z" level=error msg="Ignoring error because tcp://10.32.12.7:10000 is mode RW: tcp://10.32.10.15:10015: r/w timeout" ``` It's worth pointing out that in at least one instance we're able to correlate a spike in disk latency along with the volume becoming inoperable. Right now we think that the customer is suffering from these issues as a result of low disk performance. A benchmark we use for disk performance put p99 latency at around 13ms over a 2 minute period. No meaningful spikes in CPU/Mem have been observed. The customer is running Hadoop/HDFS in this cluster. We noticed in both outages that the node with the issue is where the `hdfs` pod was running, so that could be relevant here as well. If disk performance is agreed upon to be at the root cause, we'd love to know if there are any more detailed requirements beyond what is listed here - https://longhorn.io/docs/1.2.4/best-practices/#minimum-recommended-hardware. It would also be helpful to know if there's a better way to remediate and/or mitigate this. For Longhorn volumes with multiple replicas, should a single replica having a corruption issue impact the workload from resuming elsewhere? If no, how can we better configure our deployments to ensure this is the case? We'd love to get a second opinion as well as any help you can provide to determine RCA. A Longhorn support bundle is attached. ## To Reproduce Have not been able to reproduce yet ## Expected behavior For volumes configured with multiple replicas, we'd expect that the workload would automatically detach and restart on a node with a healthy replica ## Log or Support bundle [longhorn-support-bundle_c9abc40c-f534-41d1-b4c8-9b946beea73c_2022-06-01T12-35-09Z.zip](https://github.com/longhorn/longhorn/files/8827594/longhorn-support-bundle_c9abc40c-f534-41d1-b4c8-9b946beea73c_2022-06-01T12-35-09Z.zip) ## Environment - Longhorn version: 1.2.2 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl(kURL addon - https://kurl.sh/docs/add-ons/longhorn) - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: https://kurl.sh/docs/install-with-kurl/ - Number of management node in the cluster: 1 - Number of worker node in the cluster: 3 - Node config - OS type and version: Red Hat Enterprise Linux Server 7.9 - CPU per node: 8 - Memory per node: 32GB - Disk type(e.g. SSD/NVMe): User said they are in a class of "Performance", but we're still getting this information - Network bandwidth between the nodes: TBD - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare - Number of Longhorn volumes in the cluster: 10

github.com/longhorn/longhorn

[BUG] GET error for volume attachment on node reboot

opened 07:09PM - 29 Jun 22 UTC

diamonwiggins

kind/bug area/kubernetes investigation-needed area/upstream-issue area/csi area/stability

## Describe the bug After a reboot of a node in a 4 node cluster a user is se…eing the following: ``` Warning FailedMount 48s (x3 over 4m52s) kubelet MountVolume.WaitForAttach failed for volume "pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1" : volume pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1 has GET error for volume attachment csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172: volumeattachments.storage.k8s.io "csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172" not found ```` To recover, the user had to create the volumeattachment object manually for the Pod to mount its storage again ## To Reproduce I have not been able to reproduce this yet unfortunately ## Expected behavior A pod can successfully mount its storage despite a node reboot in the cluster ## Log or Support bundle [longhorn-support-bundle_a8118729-480f-4d38-9b91-26a755d2e0cc_2022-06-28T20-34-47Z.zip](https://github.com/longhorn/longhorn/files/9013424/longhorn-support-bundle_a8118729-480f-4d38-9b91-26a755d2e0cc_2022-06-28T20-34-47Z.zip) ## Environment - Longhorn version: 1.1.2 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl(kURL addon - https://kurl.sh/docs/add-ons/longhorn) - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: https://kurl.sh/docs/install-with-kurl/ - Number of management node in the cluster: 3 - Number of worker node in the cluster: 1 - Node config - OS type and version: Red Hat Enterprise Linux Server 7.9 (Maipo) - CPU per node: 8 - Memory per node: 64GB - Disk type(e.g. SSD/NVMe): (Unsure, but can gather this info if needed) - Network bandwidth between the nodes: (Unsure, but can gather this info if needed) - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): (Unsure, but can gather this info if needed) - Number of Longhorn volumes in the cluster: 5

Currently rebooting nodes deployed with Longhorn in the spec shows a high rate of failure after the reboot. Additionally we’ve seen volume corruption after excessive writes or reboots. We aren’t currently able to reliably recommend a resolution to either.

Our new default recommendation is OpenEBS local-pv for local and Rook for distributed storage, with Rook requiring dedicated block devices as you have stated. This is still a work in progress although you will see the new default spec has now made this change we haven’t yet recommended migrations for existing users.

I hope that helps clarify our experience with Longhorn and why we’ve decided to return to Rook although with increased requirements to ensure when it is used it is stable.

Topic		Replies	Views
Cannot upgrade kubernetes version to 1.127 as longhorn 1.3.1 doesn't support it Supporting your customers vendor , kurl	2	331	June 13, 2023
Longhorn is using too many CPUs Supporting your customers kurl	4	595	February 2, 2022
Error running installation script (Rook 1.0.4 is not compatible with Kubernetes 1.20+) Supporting your customers kurl	34	1584	January 31, 2022
KOTS: managed Kubernetes ceph rook not available by default Packaging an application	4	1238	August 28, 2020
Can Rook pick up an additional block storage device if it's encrypted by AWS? How does it work? kurl	2	32	January 28, 2025

Embedded installer system requirements depending on selected add-ons

Related topics