Embedded Cluster Upgrade Process and Troubleshooting Guide

GerardNguyen · May 26, 2025, 6:44am

This guide explains how Embedded Cluster upgrades work and provides step-by-step troubleshooting when things go wrong.

How Upgrades Work

Understanding the upgrade process helps you troubleshoot issues more effectively. Here’s what happens behind the scenes:

Step 1: Initiating the Upgrade

When you start an upgrade through the Admin Console or CLI, the system creates a new Installation CustomResource with your target version. The Embedded Cluster operator detects this change and launches a dedicated upgrade job to handle the process.

Step 2: Preparing All Nodes

Before making any changes, the operator runs preparation jobs on every node in your cluster. These jobs ensure each node has everything needed for the upgrade:

The new application binary
All required Helm charts
All container images (pulled from upstream or copied from your local mirror in air-gapped environments)

These preparation jobs clean themselves up automatically once completed.

Step 3: Running the Upgrade

A single Kubernetes job pod performs the actual upgrade work. This job is designed to be fault-tolerant, it will restart and retry if it encounters failures.

Step 4: Core System Updates

The upgrade job handles several critical updates in sequence:

Kubernetes Cluster Upgrade: Uses k0s Autopilot to safely roll out the new Kubernetes version, updating control plane nodes first, then worker nodes.
Cluster Configuration: Updates internal settings like image references and network configurations to match the new version.
Add-ons and Extensions: Runs helm upgrade for built-in components (Calico, Velero, OpenEBS) and any custom extensions you’ve added.

Step 5: Completion and Error Handling

The upgrade job uses exponential backoff retry logic. If successful, your Installation CustomResource gets marked as Installed. If all retries fail, it’s marked as Failed so you can investigate the issue.

Troubleshooting Failed Upgrades

When an upgrade doesn’t complete successfully, work through these steps in order to identify the root cause.

1. Check the Embedded Cluster Operator

Start by verifying the operator itself is healthy:

kubectl get deploy -n embedded-cluster

You should see the operator deployment is ready and available. If not, check the operator logs:

bash

kubectl logs deploy/embedded-cluster-operator -n embedded-cluster

Look for reconciliation errors or other issues that might prevent the operator from processing upgrade requests.

2. Examine the `Installation` CustomResource

Check if your desired version was set correctly and view the current installation state:

kubectl get installations

This shows all installation attempts, their states, and versions. To get detailed information about a specific installation:

bash

kubectl describe installation <installation-name>

Pay attention to the Status field and any Events listed at the bottom. If the state shows Failed, proceed to check the upgrade job.

Key states to understand:

Installing: Upgrade is in progress
Installed: Upgrade completed successfully
Failed: Upgrade failed and requires investigation
Obsolete: Previous installation that has been superseded

3. Investigate the Upgrade Job

Find and examine the upgrade job that corresponds to your failed installation:

bash

kubectl get jobs -n kotsadm

Look for jobs with names prefixed embedded-cluster-upgrade-<timestamp>. Check the job’s logs:

kubectl logs job/embedded-cluster-upgrade-<timestamp> -n kotsadm

The logs will show you exactly where the upgrade process failed. Job pods from all upgrade attempts are preserved, making them an excellent source for understanding failures.

4. Check the k0s Autopilot Plan

If the upgrade job logs mention timeouts or issues with the Kubernetes cluster upgrade, examine the k0s Autopilot plan:

kubectl get plan autopilot -o yaml

Focus on the status field, which shows the progress of the Kubernetes cluster upgrade. Look for any nodes that failed to upgrade or are stuck in a particular state.

5. View `k0scontroller` service logs

During a Kubernetes version upgrade, the API server may be temporarily unavailable. In such cases, you can inspect the k0scontroller service logs to monitor the cluster’s state. Each core component logs under its own identifier, so you can filter logs by component to quickly narrow down issues. Add -f to follow logs in real time.

# Node runtime & workload execution
journalctl -u k0scontroller --no-pager | grep component=containerd
journalctl -u k0scontroller --no-pager | grep component=kubelet

# Cluster lifecycle & addons
journalctl -u k0scontroller --no-pager | grep component=applier-helm
journalctl -u k0scontroller --no-pager | grep component=coredns

# Control plane core services
journalctl -u k0scontroller --no-pager | grep component=kube-controller-manager
journalctl -u k0scontroller --no-pager | grep component=kube-scheduler
journalctl -u k0scontroller --no-pager | grep component=kube-apiserver
journalctl -u k0scontroller --no-pager | grep component=etcd
journalctl -u k0scontroller --no-pager | grep component=k0s-control-api

Common Issues and Solutions

Helm Upgrade Failures: Check for conflicting custom configurations or resources that might prevent add-on upgrades.

Upstream Connectivity Issues: Ensure online nodes can communicate with Replicated upstream during the upgrade process.

Local Artifact Mirror (LAM) service not running: Ensure that port 50000/TCP is open and available.

Getting Additional Help

If you’ve worked through all troubleshooting steps and still can’t resolve the issue, gather the following information before contacting Replicated support:

Output from all steps above
A support bundle

Having this information ready will help Replicated engineers diagnose your issue more quickly.

Topic		Replies	Views
Best Practices for Executing Upgrades of KOTS/kURL infrastructure Supporting your customers	2	515	May 2, 2023
Kubernetes Add-on / Upgrade Guidance (kURL installer) Packaging an application kurl , embedded-cluster , kubernetes-installer	1	304	September 19, 2022
Preserving config in Embedded Cluster installation between updates How do I?	1	39	December 24, 2024
Pods Stuck in CreateContainerConfigError After Upgrading to Kubernetes v1.31 Supporting your customers support , embedded-cluster , k0s	0	54	July 2, 2025
Replicated upgrade from 2.38.0 to 2.49.0 leaves kubeadm and kubelet at version 1.15.3 Supporting your customers	6	847	January 19, 2021

Embedded Cluster Upgrade Process and Troubleshooting Guide

How Upgrades Work

Step 1: Initiating the Upgrade

Step 2: Preparing All Nodes

Step 3: Running the Upgrade

Step 4: Core System Updates

Step 5: Completion and Error Handling

Troubleshooting Failed Upgrades

1. Check the Embedded Cluster Operator

2. Examine the Installation CustomResource

3. Investigate the Upgrade Job

4. Check the k0s Autopilot Plan

5. View k0scontroller service logs

Common Issues and Solutions

Getting Additional Help

Related topics

2. Examine the `Installation` CustomResource

5. View `k0scontroller` service logs