Embedded Cluster Upgrade Process and Troubleshooting Guide

This guide explains how Embedded Cluster upgrades work and provides step-by-step troubleshooting when things go wrong.

How Upgrades Work

Understanding the upgrade process helps you troubleshoot issues more effectively. Here’s what happens behind the scenes:

Step 1: Initiating the Upgrade

When you start an upgrade through the Admin Console or CLI, the system creates a new Installation CustomResource with your target version. The Embedded Cluster operator detects this change and launches a dedicated upgrade job to handle the process.

Step 2: Preparing All Nodes

Before making any changes, the operator runs preparation jobs on every node in your cluster. These jobs ensure each node has everything needed for the upgrade:

  • The new application binary
  • All required Helm charts
  • All container images (pulled from upstream or copied from your local mirror in air-gapped environments)

These preparation jobs clean themselves up automatically once completed.

Step 3: Running the Upgrade

A single Kubernetes job pod performs the actual upgrade work. This job is designed to be fault-tolerant, it will restart and retry if it encounters failures.

Step 4: Core System Updates

The upgrade job handles several critical updates in sequence:

  • Kubernetes Cluster Upgrade: Uses k0s Autopilot to safely roll out the new Kubernetes version, updating control plane nodes first, then worker nodes.

  • Cluster Configuration: Updates internal settings like image references and network configurations to match the new version.

  • Add-ons and Extensions: Runs helm upgrade for built-in components (Calico, Velero, OpenEBS) and any custom extensions you’ve added.

Step 5: Completion and Error Handling

The upgrade job uses exponential backoff retry logic. If successful, your Installation CustomResource gets marked as Installed. If all retries fail, it’s marked as Failed so you can investigate the issue.

Troubleshooting Failed Upgrades

When an upgrade doesn’t complete successfully, work through these steps in order to identify the root cause.

1. Check the Embedded Cluster Operator

Start by verifying the operator itself is healthy:

kubectl get deploy -n embedded-cluster

You should see the operator deployment is ready and available. If not, check the operator logs:

bash

kubectl logs deploy/embedded-cluster-operator -n embedded-cluster

Look for reconciliation errors or other issues that might prevent the operator from processing upgrade requests.

2. Examine the Installation CustomResource

Check if your desired version was set correctly and view the current installation state:

kubectl get installations

This shows all installation attempts, their states, and versions. To get detailed information about a specific installation:

bash

kubectl describe installation <installation-name>

Pay attention to the Status field and any Events listed at the bottom. If the state shows Failed, proceed to check the upgrade job.

Key states to understand:

  • Installing: Upgrade is in progress
  • Installed: Upgrade completed successfully
  • Failed: Upgrade failed and requires investigation
  • Obsolete: Previous installation that has been superseded

3. Investigate the Upgrade Job

Find and examine the upgrade job that corresponds to your failed installation:

bash

kubectl get jobs -n kotsadm

Look for jobs with names prefixed embedded-cluster-upgrade-<timestamp>. Check the job’s logs:

kubectl logs job/embedded-cluster-upgrade-<timestamp> -n kotsadm

The logs will show you exactly where the upgrade process failed. Job pods from all upgrade attempts are preserved, making them an excellent source for understanding failures.

4. Check the k0s Autopilot Plan

If the upgrade job logs mention timeouts or issues with the Kubernetes cluster upgrade, examine the k0s Autopilot plan:

kubectl get plan autopilot -o yaml

Focus on the status field, which shows the progress of the Kubernetes cluster upgrade. Look for any nodes that failed to upgrade or are stuck in a particular state.

Common Issues and Solutions

Helm Upgrade Failures: Check for conflicting custom configurations or resources that might prevent add-on upgrades.

Upstream Connectivity Issues: Ensure online nodes can communicate with Replicated upstream during the upgrade process.

Getting Additional Help

If you’ve worked through all troubleshooting steps and still can’t resolve the issue, gather the following information before contacting Replicated support:

Having this information ready will help Replicated engineers diagnose your issue more quickly.