Pre-upgrade steps

Dom · September 22, 2022, 5:56am

Hi,

Problem
We found ourselves several times asking customers to execute manual steps (Bash/Shell) on node(s) before they take the latest release (upgrade).

Question
Is there any way in KOTS to define a piece of code to execute when customer hits ‘Deploy’ on the latest version, which would run simple bash on specified nodes (or at least all nodes) right before starting the upgrade process?

A real-life example (we all love good examples):
When rook-ceph released v1.10.0, they disabled the metrics service, but there was no functionality inside rook-ceph’s operator to kill the service. This caused us (in our test environments) and customers to have an extra unused service hanging in environments.
As a result, we had to instruct our customers to execute the following command on primary node;
kubectl delete service/csi-rbdplugin-metrics -n <AWESOME_NAMESPACE>.

Ideal solution (from my perspective)
Ability to define pre- and post-upgrade bash commands on specified nodes (i.e. by label selector). This would cover 99% of cases, as all we need is kubectl and some fancy bash, which we can come up with.

Thanks in advance,

jdewinne · October 3, 2022, 3:41pm

Hi @Dom, just a quick though: How about using a kubernetes Job for this?

Rob_Clenshaw · October 4, 2022, 4:27pm

You would have to deploy a job as part of the upgrade process. If you need something to run before the upgrade then a job won’t be sufficient.

Dom · November 14, 2022, 1:56am

Hey,
Any updates on this?

dex · November 14, 2022, 4:36pm

Hi @Dom - did you have a chance to look at Josh’s suggestion? If this is a true feature request, you should submit it to our product team via Replicated

Dom · March 2, 2023, 1:26pm

Sorry for taking a while to reply.

As @Rob_Clenshaw said the job wouldn’t be sufficient, because there is no way to specify an order of deployment in case where the job needs to run prior the actual upgrade.

dex · March 7, 2023, 5:43pm

update - reading the ticket again, the below control loop idea probably wont work, but leaving it up for posterity

Is there a reason this job needs to run before everything else? What is the concern with deleting the Metrics service either before, during, or even after the upgrade rolls out? Why does this need to be sequenced?

Running a command on a node in bash feels like the wrong solution here, when really all you need is a service account that can talk to K8s, which can run in cluster, would less brittle, and is probably more secure - is that right? Am I missing something?

Control Loops

I’ve always been an advocate of control loops here - that is, each service know when it’s safe to shut down, or when it’s safe to start up, either via a shell wrapper or something like an initContainer in every workload. For example, (and I know this is quite a buildout) I’ve seen vendors implement this as:

Every DB service reports the version of the schema that it’s on (git sha or semver)
Every thing that talks to the DB knows which version of the Schema it’s meant to run with (or which version ranges) - workloads, or migration jobs could fall in this category
Those downstream things sit in a watch loop before starting, waiting for the DB to be at the right version (you can build a single initContainer to use in many workloads, with a few params like expected version and Service name)
Downstream from direct-db components have the same, a user-facing API might poll a backend-api service to know what version is running, and sit in a watch loop until that backend api is ready and running the expected version (or version range)

This makes your app more resilient, almost like duck-typing, instead of relying on a single point of failure that sequences the rollouts.

Dom · March 16, 2023, 11:47am

The rook-ceph example above probably wasn’t the best one I could have given to you.

Here is another one, which hopefully will be more useful.
We moved CEPH to a different namespace in one of our releases. This is obviously rare, and shouldn’t happen a lot of times, but we then asked our customers to execute bash script which we produced (on the node), which will shut down ceph gracefully (i.e. remove any CRDs, which are currently using it, as well as reset the disks completely (0 it out)). Graceful ceph shutdown is a known “issue?”, and involves multiple steps, which has to be done in exact order.

It would have been nice, if we could have just said to customers “When you take a release, CEPH is going to be restarted and all your data will be lost”, instead of saying “Copy and paste this code on the node, execute the bash script, etc”.

Obviously the rook-ceph teardown had to happen before the new yamls are uploaded, otherwise the new rook-ceph wouldn’t be able to mount the disk, which is actively being 0’ed out.

Again, this isn’t something what is happening with every release, but we found ourselves at least a couple of times.

Not a blocker to us, but yeah, if you would have any suggestions for that it would be great. Otherwise we can continue with manual steps.

dex · March 16, 2023, 5:19pm

How about a Required Release plus a “Privileged nsenter” container or an SSH key mount?

Or, slightly more security-concious, maybe a strict preflight check that knows whether ceph is running, prints instructions for the ceph teardown, and prevents installing the new version until ceph is torn down?

Topic		Replies	Views
Post install script Feature Requests kurl	2	111	April 25, 2024
How do I perform an automated upgrade a KOTS application? Supporting your customers kots	1	578	June 29, 2021
Best Practices for Executing Upgrades of KOTS/kURL infrastructure Supporting your customers	2	515	May 2, 2023
Managing nodes when the previous Rook version is in use might leave Ceph in an unhealthy state where mon pods are not rescheduled Supporting your customers kurl , rook	0	512	January 24, 2023
Kubernetes Add-on / Upgrade Guidance (kURL installer) Packaging an application kurl , embedded-cluster , kubernetes-installer	1	304	September 19, 2022

Pre-upgrade steps

Related topics