This guide explains how Embedded Cluster upgrades work and provides step-by-step troubleshooting when things go wrong.
How Upgrades Work
Understanding the upgrade process helps you troubleshoot issues more effectively. Here’s what happens behind the scenes:
Step 1: Initiating the Upgrade
When you start an upgrade through the Admin Console or CLI, the system creates a new Installation
CustomResource with your target version. The Embedded Cluster operator detects this change and launches a dedicated upgrade job to handle the process.
Step 2: Preparing All Nodes
Before making any changes, the operator runs preparation jobs on every node in your cluster. These jobs ensure each node has everything needed for the upgrade:
- The new application binary
- All required Helm charts
- All container images (pulled from upstream or copied from your local mirror in air-gapped environments)
These preparation jobs clean themselves up automatically once completed.
Step 3: Running the Upgrade
A single Kubernetes job pod performs the actual upgrade work. This job is designed to be fault-tolerant, it will restart and retry if it encounters failures.
Step 4: Core System Updates
The upgrade job handles several critical updates in sequence:
-
Kubernetes Cluster Upgrade: Uses k0s Autopilot to safely roll out the new Kubernetes version, updating control plane nodes first, then worker nodes.
-
Cluster Configuration: Updates internal settings like image references and network configurations to match the new version.
-
Add-ons and Extensions: Runs
helm upgrade
for built-in components (Calico, Velero, OpenEBS) and any custom extensions you’ve added.
Step 5: Completion and Error Handling
The upgrade job uses exponential backoff retry logic. If successful, your Installation
CustomResource gets marked as Installed. If all retries fail, it’s marked as Failed so you can investigate the issue.
Troubleshooting Failed Upgrades
When an upgrade doesn’t complete successfully, work through these steps in order to identify the root cause.
1. Check the Embedded Cluster Operator
Start by verifying the operator itself is healthy:
kubectl get deploy -n embedded-cluster
You should see the operator deployment is ready and available. If not, check the operator logs:
bash
kubectl logs deploy/embedded-cluster-operator -n embedded-cluster
Look for reconciliation errors or other issues that might prevent the operator from processing upgrade requests.
2. Examine the Installation
CustomResource
Check if your desired version was set correctly and view the current installation state:
kubectl get installations
This shows all installation attempts, their states, and versions. To get detailed information about a specific installation:
bash
kubectl describe installation <installation-name>
Pay attention to the Status
field and any Events
listed at the bottom. If the state shows Failed
, proceed to check the upgrade job.
Key states to understand:
- Installing: Upgrade is in progress
- Installed: Upgrade completed successfully
- Failed: Upgrade failed and requires investigation
- Obsolete: Previous installation that has been superseded
3. Investigate the Upgrade Job
Find and examine the upgrade job that corresponds to your failed installation:
bash
kubectl get jobs -n kotsadm
Look for jobs with names prefixed embedded-cluster-upgrade-<timestamp>
. Check the job’s logs:
kubectl logs job/embedded-cluster-upgrade-<timestamp> -n kotsadm
The logs will show you exactly where the upgrade process failed. Job pods from all upgrade attempts are preserved, making them an excellent source for understanding failures.
4. Check the k0s Autopilot Plan
If the upgrade job logs mention timeouts or issues with the Kubernetes cluster upgrade, examine the k0s Autopilot plan:
kubectl get plan autopilot -o yaml
Focus on the status
field, which shows the progress of the Kubernetes cluster upgrade. Look for any nodes that failed to upgrade or are stuck in a particular state.
Common Issues and Solutions
Helm Upgrade Failures: Check for conflicting custom configurations or resources that might prevent add-on upgrades.
Upstream Connectivity Issues: Ensure online nodes can communicate with Replicated upstream during the upgrade process.
Getting Additional Help
If you’ve worked through all troubleshooting steps and still can’t resolve the issue, gather the following information before contacting Replicated support:
- Output from all steps above
- A support bundle
Having this information ready will help Replicated engineers diagnose your issue more quickly.