Recovering an Embedded Cluster with an expired Kubelet client certificate

Issue

An embedded cluster node (Replicated Embedded Cluster, k0s-based) fails to become Ready after the host has been powered off or unreachable during the time that certificates should rotate.

Symptoms

  • The node shows Ready status as Unknown and conditions such as MemoryPressure, DiskPressure, and PIDPressure are Unknown:
    • Reason: NodeStatusUnknown
    • Message: Kubelet stopped posting node status.
  • The kubelet logs show authentication errors similar to:
Unable to authenticate the request
err="x509: certificate has expired or is not yet valid: current time 2026-06-05T16:01:23Z is after 2026-05-14T18:09:59Z"
  • systemctl restart k0scontroller does not resolve the issue.

Root cause

The kubelet’s client certificate is stored at:

/var/lib/embedded-cluster/k0s/kubelet/pki/kubelet-client-current.pem

This certificate has a 1-year lifetime. Under normal operation, the kubelet rotates the certificate before it expires. However, if the host is offline past the certificate expiry date, automatic rotation cannot occur, because the renewal request itself uses the expired certificate to authenticate.

Restarting k0scontroller only regenerates the server-side certificates. The kubelet authentication kubeconfig (/var/lib/embedded-cluster/k0s/kubelet.conf) and the expired kubelet client certificate are left unchanged.

Resolution

The following procedure regenerates the kubelet client certificate. All existing kubelet configuration and PKI files are backed up first, so they can be restored if needed.

1. Stop the k0s controller

sudo systemctl stop k0scontroller

2. Back up the expired kubelet configuration

sudo mv /var/lib/embedded-cluster/k0s/kubelet.conf /tmp/kubelet.conf.expired
sudo cp -a /var/lib/embedded-cluster/k0s/kubelet/pki /tmp/kubelet-pki.expired-bak

3. Remove the expired kubelet client certificates

sudo rm -f /var/lib/embedded-cluster/k0s/kubelet/pki/kubelet-client-*

4. Restart the k0s controller and monitor the logs

sudo systemctl start k0scontroller
sudo journalctl -u k0scontroller --no-pager -f

Expected behavior

  • The node should return to Ready status within approximately 45 seconds.
  • All pods should be back to 1/1 Running within 3 to 4 minutes.
  • A small number of pods may restart once while their service account tokens refresh. This is expected after a long outage and resolves automatically.

5. Verify clock synchronization before resuming normal operations

After the cluster is healthy, ensure the host’s system clock is accurate and NTP is enabled before returning the node to production use.

Verification

Run the following commands to confirm the cluster is healthy:

sudo kubectl get nodes
sudo kubectl get pods -A

The node should report Ready and all pods should be Running.

Applies to

  • Replicated Embedded Cluster