Stories by Sigrid Jin on Medium

A Hands-On Guide to RKE2 and Cluster API for On-Premise Kubernetes

Sigrid Jin — Sun, 22 Feb 2026 20:01:04 GMT

If you’ve ever managed Kubernetes clusters on bare-metal or on-premise infrastructure, you know the pain: installing, upgrading, securing, and scaling clusters is a lot of manual work. Two tools from the Kubernetes ecosystem tackle these challenges head-on — RKE2 and Cluster API (CAPI).

In this post, we’ll get our hands dirty with both. But first, let’s understand why they exist and what problems they solve. RKE2 (also known as RKE Government) is a Kubernetes distribution developed by Rancher (now part of SUSE).

Here’s how simple the installation looks — A single script installs the rke2-server (or rke2-agent) systemd service, the binary, and all required container images. Compare that to the dozens of steps needed for a traditional kubeadm setup, the difference is night and day.

# Download and run the installer — that's it!
curl -sfL https://get.rke2.io --output install.sh
chmod +x install.sh
INSTALL_RKE2_CHANNEL=v1.33 ./install.sh

# Verify
rke2 --version
# rke2 version v1.33.7+rke2r3 (...)

RKE2 Architecture at a Glance

RKE2 follows a familiar server + agent model.

📖 For the official architecture diagram, see docs.rke2.io/architecture.

Boot sequence:

The agent (kubelet) starts up
Static pod manifests are written to disk
Control-plane pods (etcd, apiserver, etc.) are launched by kubelet

┌──────────────────────────────────────────────┐
│              RKE2 Server Node                │
│           (= Control Plane Node)             │
│                                              │
│  ┌─────────────────────────────────────────┐ │
│  │         RKE2 Supervisor Process         │ │
│  │  ┌──────────┐  ┌────────────────────┐   │ │
│  │  │ kubelet  │  │    containerd      │   │ │
│  │  └────┬─────┘  └────────────────────┘   │ │
│  │       │                                 │ │
│  │       ▼  watches pod-manifests/         │ │
│  │  ┌──────────────────────────────────┐   │ │
│  │  │  Static Pods:                    │   │ │
│  │  │  • etcd                          │   │ │
│  │  │  • kube-apiserver                │   │ │
│  │  │  • kube-controller-manager       │   │ │
│  │  │  • kube-scheduler                │   │ │
│  │  └──────────────────────────────────┘   │ │
│  └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘

┌──────────────────────────────────────────────┐
│              RKE2 Agent Node                 │
│            (= Worker Node)                   │
│                                              │
│  ┌─────────────────────────────────────────┐ │
│  │         RKE2 Agent Process              │ │
│  │  ┌──────────┐  ┌────────────────────┐   │ │
│  │  │ kubelet  │  │    containerd      │   │ │
│  │  └──────────┘  └────────────────────┘   │ │
│  └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘

What is Cluster API?

Now imagine you don’t just have one cluster to manage — you have tens or hundreds. Installing each one manually (even with RKE2) becomes a bottleneck. This is where Cluster API (CAPI) comes in.

Cluster API is a Kubernetes sub-project that lets you manage Kubernetes clusters themselves as Kubernetes resources. In other words: you use Kubernetes to create, configure, and upgrade Kubernetes clusters.

It introduces custom resources like below.

# These are just regular Kubernetes objects!
apiVersion: cluster.x-k8s.io/v1beta2
kind: Cluster          # Represents an entire K8s cluster

apiVersion: cluster.x-k8s.io/v1beta2
kind: Machine          # Represents a single node (VM or bare-metal)

apiVersion: cluster.x-k8s.io/v1beta2
kind: MachineDeployment  # Like a Deployment, but for Machines

The key concept is the separation between two types of clusters.

Management Cluster: The cluster where Cluster API is deployed and its controllers run. Think of it as the “control tower.”
Workload Cluster: A cluster that was provisioned through Cluster API. This is where your applications actually run.

┌─────────────────────────────────┐      ┌─────────────────────────────────┐
│      Management Cluster         │      │       Workload Cluster(s)       │
│                                 │      │                                 │
│  • Runs Cluster API controllers │─────▶│  • Provisioned BY Cluster API   │
│  • Stores cluster definitions   │      │  • Runs your actual workloads   │
│  • The "brain" of operations    │      │  • Can be upgraded declaratively│
│                                 │      │                                 │
└─────────────────────────────────┘      └─────────────────────────────────┘

So, here is the bottom line.

RKE2 solves the problem of installing and running a single, secure Kubernetes cluster with minimal effort.
Cluster API solves the problem of managing the lifecycle of many clusters at scale — creation, upgrades, scaling — all declaratively.

In this hands-on guide, we’ll walk through:

RKE2: Install a server + agent cluster from scratch, deploy a sample app, manage certificates, and perform both manual and automated upgrades.
Cluster API: Set up a management cluster, provision a full workload cluster (3 control-plane + 3 worker nodes), deploy apps, upgrade the cluster version with a single kubectl patch, and tear it all down.

RKE2 Overview — Architecture, Security-First Design, and How It Works

RKE2 is a Kubernetes distribution developed by Rancher (now part of SUSE). If you have ever worked with K3s, you will immediately notice a familiar philosophy here — keep things simple, ship a single binary, and get out of the operator’s way.

But RKE2 takes a slightly different path. While K3s was designed to be ultra-lightweight and perfect for edge and IoT scenarios, RKE2 was built with security and compliance as the top priority from day one. The name itself sometimes appears as “RKE Government” in older documentation, which gives you a hint about its original target audience — environments where passing security audits is not optional.

So what exactly makes RKE2 special? There are three pillars that define it.

The first pillar is security by default. Out of the box, with zero additional configuration, RKE2 is designed to pass the CIS (Center for Internet Security) Kubernetes Benchmark. If you have ever gone through a CIS hardening exercise on a vanilla kubeadm cluster, you know how painful that can be — dozens of flags to set, file permissions to tighten, audit policies to write. RKE2 handles most of this automatically. It ships with sane defaults for etcd encryption, Pod Security Standards, audit logging, and network policies. You still have the freedom to customize everything, but the starting point is already hardened.

The second pillar is operational simplicity. Just like K3s, RKE2 is distributed as a single binary. You do not need to separately install kubelet, kube-proxy, etcd, or any other control plane component by hand. A single install script pulls everything you need, registers a systemd service, and you are ready to go. Upgrades follow the same pattern — run the installer again with a newer channel, restart the service, and the cluster rolls forward. This dramatically reduces the surface area for human error during both initial deployment and ongoing maintenance.

The third pillar is a container-first architecture. RKE2 minimizes its dependency on the host operating system. Instead of running control plane components as bare processes directly on the host (the way kubeadm does by default), RKE2 runs everything on top of containerd. The kubelet, which is managed by the RKE2 supervisor process, watches a static pod manifest directory and launches etcd, kube-apiserver, kube-controller-manager, and kube-scheduler as static pods. This means the host OS only needs to provide a Linux kernel and containerd — everything else lives inside containers.

https://docs.rke2.io/architecture & https://blog03038.tistory.com/14

Let us talk about the architecture in more detail. If you look at the official architecture diagram at https://docs.rke2.io/architecture, you will see that an RKE2 cluster is composed of two types of nodes: RKE2 Server Nodes and RKE2 Agent Nodes.

A Server Node is what you would normally call a control plane node — it runs etcd, the API server, the controller manager, and the scheduler, in addition to a kubelet and kube-proxy. An Agent Node is simply a worker node — it runs a kubelet and kube-proxy, and its sole job is to run your application workloads.

The interesting part is what happens inside each node. There is a single RKE2 process (called the RKE2 Supervisor) that orchestrates everything. On a server node, this supervisor process manages containerd and the kubelet internally. It writes static pod manifests into a well-known directory, and the kubelet picks them up and starts the corresponding pods. So the boot sequence looks like this — the RKE2 supervisor starts, it brings up the internal kubelet (which also starts containerd), the kubelet detects the static pod manifests that the supervisor has written, and then the control plane pods — etcd, kube-apiserver, kube-controller-manager, kube-scheduler — come to life one by one. On an agent node the process is simpler — the supervisor starts, brings up kubelet and kube-proxy, and the node registers itself with the API server running on the server nodes.

This design has a subtle but important consequence. Because the control plane components run as static pods managed by the kubelet, you can inspect them with standard Kubernetes tooling. You can run kubectl get pods -n kube-system and see etcd-k8s-node1, kube-apiserver-k8s-node1, and so on, just like you would on a kubeadm cluster. But unlike kubeadm, you did not have to manually bootstrap any of this — the RKE2 supervisor took care of it.

Now let us see what this looks like in practice. When you install RKE2 on a fresh server, the install script is remarkably straightforward. You download it, make it executable, and run it with a channel flag to pin your desired Kubernetes minor version.

curl -sfL https://get.rke2.io --output install.sh
chmod +x install.sh
INSTALL_RKE2_CHANNEL=v1.33 ./install.sh

After the script finishes, you will see output confirming that several RPM packages have been installed (on RHEL/Rocky-based systems). For example, on a Rocky Linux 9 machine you might see something like below.

Installed:
  rke2-common-1.33.7~rke2r3-0.el9.aarch64
  rke2-selinux-0.22-1.el9.noarch
  rke2-server-1.33.7~rke2r3-0.el9.aarch64

You can verify the installed version immediately:

rke2 --version

rke2 version v1.33.7+rke2r3 (7e4fd1a82edf497cab91c220144619bbad659cf4)
go version go1.24.11 X:boringcrypto

Notice the “boringcrypto” tag in the Go version. This is not accidental. RKE2 is compiled with BoringCrypto, Google’s FIPS 140–2 validated cryptographic module. This means that all TLS operations within the RKE2 binary use FIPS-compliant cryptography — another nod to the security-first philosophy.

The install script also sets up YUM/DNF repositories so that future updates can be pulled in cleanly. You can confirm this by checking the repo list:

dnf repolist

rancher-rke2-1.33-stable       Rancher RKE2 1.33 (v1.33)
rancher-rke2-common-stable     Rancher RKE2 Common (v1.33)

And if you peek inside the repo configuration file, you will see that it points to Rancher’s official RPM mirror:

# /etc/yum.repos.d/rancher-rke2.repo

[rancher-rke2-common-stable]
name=Rancher RKE2 Common (v1.33)
baseurl=https://rpm.rancher.io/rke2/stable/common/centos/9/noarch
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

[rancher-rke2-1.33-stable]
name=Rancher RKE2 1.33 (v1.33)
baseurl=https://rpm.rancher.io/rke2/stable/1.33/centos/9/aarch64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

Both GPG signature checking and repository-level GPG checking are enabled by default. This ensures that the packages you pull have not been tampered with in transit — yet another security detail that RKE2 gets right without you having to think about it.

At this point, RKE2 is installed but not yet running. The binary supports two primary subcommands, which directly correspond to the two node roles we discussed earlier:

rke2 --help

server    Run management server
   agent     Run node agent

The “server” subcommand starts the full control plane (API server, etcd, scheduler, controller manager) along with a kubelet and kube-proxy. The “agent” subcommand starts only a kubelet and kube-proxy, which is all a worker node needs.

Before starting the server, you typically write a configuration file at /etc/rancher/rke2/config.yaml. This is where you customize the behavior of RKE2 — things like which CNI plugin to use, which IP addresses to bind to, and which built-in add-ons to disable. Here is an example from the hands-on lab:

# /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: "0644"
debug: true
cni: canal
bind-address: 192.168.10.11
advertise-address: 192.168.10.11
node-ip: 192.168.10.11
disable-cloud-controller: true
disable:
  - servicelb
  - rke2-coredns-autoscaler
  - rke2-ingress-nginx
  - rke2-snapshot-controller
  - rke2-snapshot-controller-crd
  - rke2-snapshot-validation-webhook

A few things worth calling out here. The “write-kubeconfig-mode” setting controls the file permissions of the generated kubeconfig file at /etc/rancher/rke2/rke2.yaml. Setting it to 0644 makes it world-readable, which is convenient in a lab but something you would lock down in production.

The “cni” field lets you choose between canal (the default, which combines Calico for network policy and Flannel for overlay networking), cilium, calico, or none. The “disable” list is particularly useful — RKE2 ships with several built-in Helm charts (ingress-nginx, metrics-server, CoreDNS autoscaler, etc.), and you can selectively disable any of them if you plan to bring your own.

RKE2 also supports per-chart customization through a mechanism called HelmChartConfig. If you want to override the default values of a built-in Helm chart, you simply drop a HelmChartConfig manifest into /var/lib/rancher/rke2/server/manifests/ before starting the server. For example, to tell the Canal CNI to use a specific network interface:

# /var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-canal
  namespace: kube-system
spec:
  valuesContent: |-
    flannel:
      iface: "enp0s9"

This is a very elegant pattern. Instead of requiring you to manually run helm install with custom values after the cluster is up, RKE2 lets you declare your desired configuration before the cluster even boots. The RKE2 Helm controller will pick up these manifests and apply them automatically during the initial startup sequence.

Once everything is configured, starting the server is a single systemctl command:

systemctl enable --now rke2-server.service

This typically takes about two minutes. During that time, the RKE2 supervisor starts containerd, generates TLS certificates for all control plane components, writes static pod manifests, and waits for the kubelet to bring up etcd and the API server. You can watch the progress in real time with:

journalctl -u rke2-server -f

After the server is up, you will find all the necessary binaries tucked inside /var/lib/rancher/rke2/bin/:

tree /var/lib/rancher/rke2/bin/
├── containerd
├── containerd-shim-runc-v2
├── crictl
├── ctr
├── kubectl
├── kubelet
└── runc

These are not installed into your system PATH by default. RKE2 keeps them isolated to avoid conflicts with any existing binaries on the host. If you want to use kubectl and other tools directly, the recommended approach is to create symbolic links:

ln -s /var/lib/rancher/rke2/bin/kubectl /usr/local/bin/kubectl
ln -s /var/lib/rancher/rke2/bin/crictl /usr/local/bin/crictl
ln -s /var/lib/rancher/rke2/bin/containerd /usr/local/bin/containerd
ln -s /var/lib/rancher/rke2/bin/runc /usr/local/bin/runc
ln -s /var/lib/rancher/rke2/bin/ctr /usr/local/bin/ctr
ln -s /var/lib/rancher/rke2/agent/etc/crictl.yaml /etc/crictl.yaml

Then you copy the generated kubeconfig to your home directory:

mkdir ~/.kube
cp /etc/rancher/rke2/rke2.yaml ~/.kube/config

And you are ready to interact with your cluster:

kubectl cluster-info

Kubernetes control plane is running at https://192.168.10.11:6443

kubectl get node -owide

NAME        STATUS   ROLES                AGE   VERSION          INTERNAL-IP     OS-IMAGE                      CONTAINER-RUNTIME
k8s-node1   Ready    control-plane,etcd   15m   v1.34.3+rke2r3   192.168.10.11   Rocky Linux 9.6 (Blue Onyx)   containerd://2.1.5-k3s1

If you check the running pods, you will see the control plane components we discussed — all running as pods in the kube-system namespace, exactly as the architecture diagram predicts:

kubectl get pod -A

NAMESPACE     NAME                                         READY   STATUS      AGE
kube-system   etcd-k8s-node1                               1/1     Running     2m
kube-system   kube-apiserver-k8s-node1                     1/1     Running     2m
kube-system   kube-controller-manager-k8s-node1            1/1     Running     2m
kube-system   kube-scheduler-k8s-node1                     1/1     Running     2m
kube-system   kube-proxy-k8s-node1                         1/1     Running     2m
kube-system   rke2-canal-dkw2n                             2/2     Running     2m
kube-system   rke2-coredns-rke2-coredns-784bcb7f4d-tpt2d   1/1     Running     2m
kube-system   rke2-metrics-server-7b59bd8854-m5w2c         1/1     Running     1m

You will also notice that RKE2 deployed several add-ons as Helm releases — Canal for networking, CoreDNS for DNS resolution, metrics-server for resource metrics, and a set of runtime classes:

helm list -A

NAME                NAMESPACE    STATUS    CHART                              APP VERSION
rke2-canal          kube-system  deployed  rke2-canal-v3.31.3-build2026011900 v3.31.3
rke2-coredns        kube-system  deployed  rke2-coredns-1.45.008              1.13.1
rke2-metrics-server kube-system  deployed  rke2-metrics-server-3.13.006       0.8.0
rke2-runtimeclasses kube-system  deployed  rke2-runtimeclasses-0.1.000        0.1.0

This Helm-based add-on management is one of the things that makes RKE2 so pleasant to operate. Every built-in component is a standard Helm chart, which means you can inspect its values, override them with HelmChartConfig, or disable them entirely and replace them with your own preferred solution. There is no magic, no proprietary packaging — just Helm charts running on your cluster.

To summarize the architecture: RKE2 gives you a production-grade, CIS-hardened Kubernetes cluster through a single binary and a single configuration file. The RKE2 supervisor process manages containerd and kubelet internally, launches control plane components as static pods, and deploys networking and DNS through Helm charts. The result is a cluster that is easy to install, easy to understand, and secure by default — without sacrificing any of the flexibility that Kubernetes operators expect.

RKE2 Hands-On

Before we get into the actual installation, let’s talk about what we’re building. Our lab environment consists of two nodes: k8s-node1 (which will serve as the control plane) and k8s-node2 (which will be our worker node). Both are running Rocky Linux 9.6, and we’re provisioning them with Vagrant. If you want to follow along, you can grab the Vagrantfile and init script from the repository and spin up the environment like this:

mkdir k8s-rke2
cd k8s-rke2

curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-rke2/Vagrantfile
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-rke2/init_cfg.sh

vagrant up
vagrant status

Once both VMs are up and running, we’re ready to begin.

Installing the Server (Control Plane)

Let’s SSH into our first node, which will become the RKE2 server, also known as the control plane node.

vagrant ssh k8s-node1

RKE2 provides a convenient installation script that sets everything up as a systemd service. We’re going to download this script, make it executable, and run it while specifying the v1.33 release channel. This tells the installer to pull the latest stable version from the 1.33 line.

curl -sfL https://get.rke2.io --output install.sh
chmod +x install.sh
INSTALL_RKE2_CHANNEL=v1.33 ./install.sh

When the installation completes, you’ll see output confirming that the rke2-common, rke2-selinux, and rke2-server packages have been installed. At this point, nothing is running yet. The installer simply placed the binary and the systemd unit files on your system.

You can verify the installed version by running the following command. You should see something like v1.33.7+rke2r3 along with the Go version used to compile it.

rke2 --version

The installer also added a couple of Rancher RPM repositories to your system. You can confirm this by checking the repo list and the repo configuration file. This is how future upgrades will be pulled.

dnf repolist
cat /etc/yum.repos.d/rancher-rke2.repo

Now comes the important part: configuring RKE2 before we start the service. RKE2 reads its configuration from /etc/rancher/rke2/config.yaml.

We’re going to create this file and set a few key options. We’ll make the kubeconfig file world-readable for convenience, enable debug logging to get more visibility, choose Canal as our CNI plugin, bind the API server and advertise address to our node’s static IP, disable the cloud controller since we’re on-prem, and also disable a handful of built-in add-ons that we either don’t need or want to manage ourselves (the built-in service load balancer, CoreDNS autoscaler, ingress-nginx, and the snapshot-related components).

cat << EOF > /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: "0644"

debug: true

cni: canal

bind-address: 192.168.10.11
advertise-address: 192.168.10.11
node-ip: 192.168.10.11

disable-cloud-controller: true

disable:
  - servicelb
  - rke2-coredns-autoscaler
  - rke2-ingress-nginx
  - rke2-snapshot-controller
  - rke2-snapshot-controller-crd
  - rke2-snapshot-validation-webhook
EOF

Since we’re using Canal as the CNI, and our Vagrant VMs have multiple network interfaces, we need to tell Flannel (which is the networking backend inside Canal) which interface to use. RKE2 manages its built-in add-ons through Helm charts, and we can customize them by placing HelmChartConfig manifests in the server manifests directory.

Let’s create that directory first and then drop in our Canal configuration.

mkdir -p /var/lib/rancher/rke2/server/manifests/

cat << EOF > /var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-canal
  namespace: kube-system
spec:
  valuesContent: |-
    flannel:
      iface: "enp0s9"
EOF

We’ll also create a config for CoreDNS to explicitly disable its autoscaler, since we already disabled the autoscaler add-on in the main config and we want to make sure both sides agree.

cat << EOF > /var/lib/rancher/rke2/server/manifests/rke2-coredns-config.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-coredns
  namespace: kube-system
spec:
  valuesContent: |-
    autoscaler:
      enabled: false
EOF

Before starting the service, it’s a great idea to open a second terminal and set up some monitoring so you can watch the boot process in real time. In that second terminal, run the following two commands. The first watches the process tree so you can see RKE2’s child processes spawn, and the second tails the journal for the rke2-server unit.

watch -d pstree -a
journalctl -u rke2-server -f

Now, back in your main terminal, let’s enable and start the RKE2 server service. This single command both enables the service to start on boot and starts it immediately. The initial startup takes roughly two minutes as it downloads container images and bootstraps the control plane. After the core components are up, you’ll need to wait another minute or two for CoreDNS pods to become healthy.

systemctl enable --now rke2-server.service
systemctl status rke2-server --no-pager

If you flip over to your monitoring terminal, you’ll see the process tree come alive. The rke2 process starts containerd and kubelet, and then kubelet picks up the static pod manifests and launches the etcd, kube-apiserver, kube-controller-manager, and kube-scheduler pods.

This is one of the elegant things about RKE2: the entire control plane boots up through the same static-pod mechanism that kubeadm uses, but it’s all orchestrated by a single RKE2 binary.

Once the service is running, we need to set up our kubeconfig so we can use kubectl. RKE2 writes its kubeconfig to /etc/rancher/rke2/rke2.yaml. Let’s copy it to the standard location.

mkdir ~/.kube
cp /etc/rancher/rke2/rke2.yaml ~/.kube/config

RKE2 installs its own set of binaries under /var/lib/rancher/rke2/bin/, including kubectl, crictl, containerd, ctr, and runc. Rather than modifying the PATH, a clean approach is to create symbolic links in /usr/local/bin/ so that these tools are available system-wide.

ln -s /var/lib/rancher/rke2/bin/containerd /usr/local/bin/containerd
ln -s /var/lib/rancher/rke2/bin/kubectl /usr/local/bin/kubectl
ln -s /var/lib/rancher/rke2/bin/crictl /usr/local/bin/crictl
ln -s /var/lib/rancher/rke2/bin/runc /usr/local/bin/runc
ln -s /var/lib/rancher/rke2/bin/ctr /usr/local/bin/ctr
ln -s /var/lib/rancher/rke2/agent/etc/crictl.yaml /etc/crictl.yaml

Let’s also set up shell completion and an alias for kubectl. This might seem minor, but trust me, it saves a lot of typing over the course of a long lab session.

source <(kubectl completion bash)
alias k=kubectl
complete -F __start_kubectl k
echo 'source <(kubectl completion bash)' >> /etc/profile
echo 'alias k=kubectl' >> /etc/profile
echo 'complete -F __start_kubectl k' >> /etc/profile

Now let’s verify everything is working. First, check that the cluster is reachable.

kubectl cluster-info -v=6

You should see “Kubernetes control plane is running at https://192.168.10.11:6443". Then check the node status and all the pods in the kube-system namespace.

kubectl get node -owide

You’ll see a single node, k8s-node1, in the Ready state with the roles control-plane and etcd, running the version v1.33.7+rke2r3 on Rocky Linux 9.6 with containerd as the container runtime.

If you check the Helm releases, you’ll see that RKE2 automatically deployed Canal, CoreDNS, the metrics server, and runtime classes as Helm charts. This is another nice feature of RKE2: it uses Helm under the hood to manage its add-ons, which gives you a familiar and powerful way to customize them.

helm list -A

And finally, let’s look at all the pods running in the cluster.

kubectl get pod -A

You should see etcd, kube-apiserver, kube-controller-manager, kube-scheduler, and kube-proxy running as static pods on k8s-node1, along with the Canal DaemonSet pod, the CoreDNS deployment pod, and the metrics server. There will also be a few completed Helm install jobs.

Everything should be in a Running or Completed state. At this point, your single-node RKE2 control plane is fully operational.

Joining an Agent (Worker Node)

Now that our server node is up, let’s add a worker. The first thing we need is the join token. RKE2 generates this token during the server bootstrap and stores it on disk. On k8s-node1, retrieve it like this:

cat /var/lib/rancher/rke2/server/node-token

You’ll get a long string that looks something like K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749. Copy this value because we’ll need it on the worker node.

You can also verify that the supervisor port is listening. This is port 9345, and it’s the dedicated management and bootstrap API that new nodes use to join the cluster.

ss -tnlp | grep 9345

Before switching to the worker node, let’s set up a monitoring loop on k8s-node1 so we can watch the new node appear in real time.

watch -d 'kubectl get node; echo; kubectl get pod -n kube-system'

Now SSH into k8s-node2 and let’s get the agent installed. The process is very similar to the server installation, but we pass INSTALL_RKE2_TYPE=”agent” to tell the installer we want the agent role, not the server role.

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" INSTALL_RKE2_CHANNEL=v1.33 sh -

Next, we create the agent configuration file. This is much simpler than the server config. It only needs two things: the URL of the server’s supervisor API (on port 9345) and the join token we copied earlier.

TOKEN=K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749

mkdir -p /etc/rancher/rke2/
cat << EOF > /etc/rancher/rke2/config.yaml
server: https://192.168.10.11:9345
token: $TOKEN
EOF

Then, just like the server, we enable and start the agent service. You can also tail the journal to watch the join process.

systemctl enable --now rke2-agent.service
journalctl -u rke2-agent -f

Flip back to your monitoring terminal on k8s-node1 and within a few seconds you’ll see k8s-node2 appear as a new node. Let’s verify the cluster now has two nodes.

kubectl get node -owide

You should see both k8s-node1 (with roles control-plane, etcd, and master) and k8s-node2 (with no specific role label, meaning it’s a pure worker). Both should be in the Ready state. If you check the pods in kube-system filtered by k8s-node2, you’ll see that only kube-proxy and the Canal DaemonSet pod were automatically deployed there, which is exactly what you’d expect for a worker node.

kubectl get pod -n kube-system -owide | grep k8s-node2

And that’s it. Two commands on the server (install + start), two commands on the agent (install + start), and a tiny config file on each side. The cluster is fully operational with a control plane and a worker node.

Deploying a Sample App

With our two-node cluster ready, let’s deploy a simple application to prove that everything is working end to end, including pod scheduling, networking, and service exposure.

We’re going to deploy a simple HTTP echo server (traefik/whoami) as a Deployment with two replicas, fronted by a NodePort Service on port 30000. We also add a pod anti-affinity rule to ensure the two replicas land on different nodes, which lets us verify that cross-node networking works.

cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - sample-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
    nodePort: 30000
  type: NodePort
EOF

Let’s check on the deployed resources to make sure everything is healthy.

kubectl get deploy,pod,svc,ep -owide

You should see the Deployment with 2/2 replicas available, two pods in the Running state (one on k8s-node1 and one on k8s-node2), the Service of type NodePort with port 30000 mapped, and the Endpoints object listing both pod IPs.

Now let’s hit the service from the host machine to verify connectivity. We’ll use a loop that calls the NodePort on k8s-node2 and prints the hostname returned by each response.

while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done

If everything is configured correctly, you’ll see the responses alternate between the two pod hostnames. This confirms that the NodePort service is load-balancing traffic across both pods, and that the Canal overlay network is successfully routing packets between nodes.

One interesting thing to notice here is that pods got scheduled on both nodes, including the control plane node. If you check the taints on k8s-node1, you’ll find that there are none.

kubectl describe node k8s-node1 | grep -i taints

This is a deliberate RKE2 design choice. Unlike kubeadm, which applies a NoSchedule taint to control plane nodes by default, RKE2 leaves them untainted so that workloads can run on all nodes out of the box. This is especially useful for smaller clusters or edge deployments where you want to maximize resource utilization. Of course, in a production environment with dedicated control plane nodes, you’d want to add taints yourself to keep workloads off the control plane.

At this point, we have a fully functional two-node RKE2 cluster running a sample application with verified cross-node networking. The installation was remarkably straightforward: a single binary, a simple config file, one systemd command on each node, and we were up and running.

RKE2 Day-2 Operations

Once your RKE2 cluster is up and running with a server node and an agent node joined, the real work begins.

Day-2 operations are about keeping your cluster healthy, secure, and up to date over time. In this section, we will walk through three essential tasks that every RKE2 operator needs to know: managing and rotating TLS certificates, performing a manual version upgrade from v1.33 to v1.34, and setting up fully automated upgrades using the Rancher System Upgrade Controller to take the cluster all the way to v1.35. We will go step by step, show every command, and explain what is happening behind the scenes so you can confidently apply these procedures to your own clusters.

Certificate Management and Rotation

Every component in your RKE2 cluster communicates over TLS. The API server, etcd, the kubelet, the scheduler, the controller manager — they all rely on certificates to authenticate and encrypt traffic between each other. RKE2 issues these certificates automatically during installation, and each client or server certificate is valid for 365 days from the date it was issued. The Certificate Authority (CA) certificates that sign them have a much longer lifetime of 10 years.

RKE2 has a built-in mechanism for certificate renewal. Every time the rke2-server or rke2-agent service starts, it checks the expiration dates of all certificates. If any certificate is within 120 days of expiring, RKE2 will automatically renew it during startup. This renewal process reuses the existing private keys, so the certificate identity stays the same — only the validity period gets extended. Additionally, when a certificate’s expiration date falls within 120 days, Kubernetes will emit a warning event of type CertificateExpirationWarning so you can be aware of upcoming expirations even if you have not manually checked.

Let us start by inspecting the certificates on both the server and agent nodes to understand what is installed and when everything expires.

On the server node (k8s-node1), run the following command to display all certificates in a table format:

rke2 certificate check --output table

You will see output like this, showing every certificate file, its subject, usage type, expiration date, residual time, and current status:

FILENAME                           SUBJECT                             USAGES                  EXPIRES                  RESIDUAL TIME   STATUS
--------                           -------                             ------                  -------                  -------------   ------
client-kube-apiserver.crt          system:apiserver                    ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-kube-apiserver.crt          rke2-client-ca@1770647369           CertSign                Feb 07, 2036 14:29 UTC   10 years        OK
serving-kube-apiserver.crt         kube-apiserver                      ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK
serving-kube-apiserver.crt         rke2-server-ca@1770647369           CertSign                Feb 07, 2036 14:29 UTC   10 years        OK
client-rke2-cloud-controller.crt   rke2-cloud-controller-manager       ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-scheduler.crt               system:kube-scheduler               ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
kube-scheduler.crt                 kube-scheduler                      ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-kube-proxy.crt              system:kube-proxy                   ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-kubelet.crt                 system:node:k8s-node1               ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
serving-kubelet.crt                k8s-node1                           ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-rke2-controller.crt         system:rke2-controller              ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-admin.crt                   system:admin                        ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-auth-proxy.crt              system:auth-proxy                   ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client-controller.crt              system:kube-controller-manager      ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
kube-controller-manager.crt        kube-controller-manager             ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK
client.crt                         etcd-client                         ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK
server-client.crt                  etcd-server                         ServerAuth,ClientAuth   Feb 09, 2027 14:29 UTC   1 year          OK
peer-server-client.crt             etcd-peer                           ServerAuth,ClientAuth   Feb 09, 2027 14:29 UTC   1 year          OK
client-supervisor.crt              system:rke2-supervisor              ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK

Notice how the server node has a large number of certificates. You have client authentication certificates for every major component (API server, scheduler, controller manager, kubelet, kube-proxy, etcd, and the RKE2 supervisor), as well as server authentication certificates for the API server, scheduler, controller manager, kubelet, and etcd.

Each leaf certificate expires in 1 year, while the CA certificates that signed them expire in 10 years.

On the agent node (k8s-node2), the certificate inventory is much smaller because the agent only needs certificates for its own kubelet, kube-proxy, and the RKE2 controller:

rke2 certificate check --output table

FILENAME                     SUBJECT                     USAGES       EXPIRES                  RESIDUAL TIME   STATUS
--------                     -------                     ------       -------                  -------------   ------
client-kubelet.crt           system:node:k8s-node2       ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK
serving-kubelet.crt          k8s-node2                   ServerAuth   Feb 14, 2027 14:21 UTC   1 year          OK
client-rke2-controller.crt   system:rke2-controller      ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK
client-kube-proxy.crt        system:kube-proxy           ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK

Now, there are situations where you might want to rotate certificates manually rather than waiting for the automatic renewal window. Perhaps you suspect a certificate has been compromised, or you are performing a planned security rotation as part of your organization’s compliance requirements. RKE2 provides a straightforward command for this.

The process requires three steps: stop the RKE2 server, rotate the certificates, and then start the server again. Here is the full procedure on the server node (k8s-node1):

First, stop the RKE2 server service:

systemctl stop rke2-server

Next, run the certificate rotation command:

rke2 certificate rotate

You will see output confirming that every certificate is being rotated:

INFO[0000] Server detected, rotating agent and server certificates
INFO[0000] Rotating dynamic listener certificate
INFO[0000] Rotating certificates for admin
INFO[0000] Rotating certificates for controller-manager
INFO[0000] Rotating certificates for kube-proxy
INFO[0000] Rotating certificates for kubelet
INFO[0000] Rotating certificates for rke2-controller
INFO[0000] Rotating certificates for api-server
INFO[0000] Rotating certificates for auth-proxy
INFO[0000] Rotating certificates for cloud-controller
INFO[0000] Rotating certificates for etcd
INFO[0000] Rotating certificates for scheduler
INFO[0000] Rotating certificates for supervisor
INFO[0000] Successfully backed up certificates to /var/lib/rancher/rke2/server/tls-1770651290, please restart rke2 server or agent to rotate certificates

Notice that RKE2 automatically backs up the old certificates to a timestamped directory under /var/lib/rancher/rke2/server/. This is a nice safety net — if anything goes wrong, you can restore the previous certificates.

You can verify the new certificate expiration dates before restarting:

rke2 certificate check --output table

Now start the server back up:

systemctl start rke2-server

After the server is running, verify everything looks good once more:

rke2 certificate check --output table

There is one important thing to remember after certificate rotation: the kubeconfig file at /etc/rancher/rke2/rke2.yaml gets regenerated with the new certificate data.

If you previously copied this file to ~/.kube/config, you need to update your copy. You can check if they differ and then copy the updated file:

diff /etc/rancher/rke2/rke2.yaml ~/.kube/config
yes | cp /etc/rancher/rke2/rke2.yaml ~/.kube/config ; echo
kubectl cluster-info

A great thing about RKE2’s certificate rotation is that you do not need to do anything special on the worker nodes. When the rke2-agent service on a worker node detects that the connection to the server has been interrupted (because the server’s certificates changed), it will automatically reconnect and receive new certificates from the server. This behavior is very similar to how kubeadm-based clusters handle certificate renewal on worker nodes — the worker simply re-bootstraps its credentials upon reconnection.

Manual Version Upgrade (v1.33 to v1.34)

Upgrading RKE2 manually is refreshingly simple compared to many other Kubernetes distributions. The process involves running the same installation script you used initially, but pointing it to a newer version channel. RKE2 will upgrade the binaries and restart the control plane components, often with minimal disruption to running workloads.

Before starting, let us set up some monitoring so we can observe what happens during the upgrade. Open several terminal windows on the server node and run these commands:

In one terminal, continuously curl your sample application to verify it stays accessible throughout the upgrade:

while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done

In another terminal, watch the kube-system pods to see which ones get recreated:

watch -d "kubectl get pod -n kube-system -owide --sort-by=.metadata.creationTimestamp | tac"

In a third terminal, watch the node status:

watch -d "kubectl get node"

And optionally, if you want to monitor etcd cluster health during the upgrade:

watch -d etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/client.key \
  member list --write-out=table

Now confirm your current version:

rke2 --version

rke2 version v1.33.7+rke2r3 (7e4fd1a82edf497cab91c220144619bbad659cf4)
go version go1.24.11 X:boringcrypto

You can also check what versions are available across different release channels by querying the RKE2 update API:

curl -s https://update.rke2.io/v1-release/channels | jq .data

This returns a JSON array with all available channels including stable, latest, and version-specific channels like v1.34 and v1.35. Each entry shows the latest version available in that channel.

Now let us upgrade the server node. On k8s-node1, run the RKE2 installer with the v1.34 channel specified:

curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.34 sh -

The installer will download and install the new RPM packages. You will see the transaction output showing the old v1.33 packages being replaced by v1.34:

Running transaction
  Preparing        :                                                        1/1
  Upgrading        : rke2-common-1.34.3~rke2r3-0.el9.aarch64               1/4
  Upgrading        : rke2-server-1.34.3~rke2r3-0.el9.aarch64               2/4
  Running scriptlet: rke2-server-1.34.3~rke2r3-0.el9.aarch64               2/4
  Running scriptlet: rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4
  Cleanup          : rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4
  Running scriptlet: rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4
  Running scriptlet: rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4
  Cleanup          : rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4
  Running scriptlet: rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4
  Verifying        : rke2-common-1.34.3~rke2r3-0.el9.aarch64               1/4
  Verifying        : rke2-common-1.33.7~rke2r3-0.el9.aarch64               2/4
  Verifying        : rke2-server-1.34.3~rke2r3-0.el9.aarch64               3/4
  Verifying        : rke2-server-1.33.7~rke2r3-0.el9.aarch64               4/4

What is remarkable here is that just running the installer script is enough to trigger the upgrade of the control plane components. During the scriptlet phase, the RKE2 process detects the new binaries and recreates the static pod manifests, which causes the kubelet to restart the control plane pods.

If you are watching your monitoring terminals, you will see the etcd, kube-apiserver, and kube-proxy pods get recreated first, followed by the scheduler and controller manager. And throughout this entire process, if you are watching your application curl loop, you should see that application traffic continues to flow without interruption.

Verify the new version:

rke2 --version

rke2 version v1.34.3+rke2r3 (7598946e0086a9131564ccbb3c142b3fa54516ad)
go version go1.24.11 X:boringcrypto

You can also confirm the yum repository was updated to point to the v1.34 channel:

dnf repolist

rancher-rke2-1.34-stable       Rancher RKE2 1.34 (v1.34)
rancher-rke2-common-stable     Rancher RKE2 Common (v1.34)

To see which container images each kube-system pod is running after the upgrade:

kubectl get pods -n kube-system \
  -o custom-columns=\
POD:.metadata.name,\
CONTAINERS:.spec.containers[*].name,\
IMAGES:.spec.containers[*].image

Now restart the RKE2 server service to ensure everything is fully reconciled:

systemctl restart rke2-server

Check the node versions to see the server is upgraded but the worker is still on the old version:

kubectl get node -owide

NAME        STATUS   ROLES                       AGE     VERSION          INTERNAL-IP     ...
k8s-node1   Ready    control-plane,etcd,master   18m     v1.34.3+rke2r3   192.168.10.11   ...
k8s-node2   Ready                          7m23s   v1.33.7+rke2r3   192.168.10.12   ...

This is the expected intermediate state. Kubernetes supports a version skew of up to two minor versions between the control plane and worker nodes, so having the server on v1.34 and the worker on v1.33 is perfectly fine temporarily.

Now let us upgrade the worker node. Switch to k8s-node2 and run the installer, this time specifying the agent type:

rke2 --version
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent INSTALL_RKE2_CHANNEL=v1.34 sh -

Verify the new version on the worker:

rke2 --version

rke2 version v1.34.3+rke2r3 (7598946e0086a9131564ccbb3c142b3fa54516ad)
go version go1.24.11 X:boringcrypto

Confirm the repository was updated:

dnf repolist

rancher-rke2-1.34-stable       Rancher RKE2 1.34 (v1.34)
rancher-rke2-common-stable     Rancher RKE2 Common (v1.34)

Restart the agent service:

systemctl restart rke2-agent

Now go back to the server node (k8s-node1) and verify that both nodes are running the same version:

kubectl get node -owide

NAME        STATUS   ROLES                       AGE   VERSION          INTERNAL-IP     ...
k8s-node1   Ready    control-plane,etcd,master   22m   v1.34.3+rke2r3   192.168.10.11   ...
k8s-node2   Ready                          11m   v1.34.3+rke2r3   192.168.10.12   ...

If you check the kube-system pods sorted by creation time, you will notice that only the kube-proxy pod on the worker node was recreated during the worker upgrade. The rest of the worker node pods (like canal) continued running unchanged:

kubectl get pod -n kube-system --sort-by=.metadata.creationTimestamp | tac

The entire manual upgrade process is remarkably smooth. The key takeaway is the order of operations: always upgrade the server (control plane) first, then upgrade the agent (worker) nodes one by one. This ensures the control plane is always at the same or newer version than the workers, which is a fundamental Kubernetes requirement.

Automated Upgrade with System Upgrade Controller

Manual upgrades work well for small clusters, but when you are managing many nodes or want to ensure upgrades happen reliably without human intervention, the Rancher System Upgrade Controller is the way to go.

This controller watches for custom Plan resources that declaratively specify which nodes should be upgraded, to which version, and in what order. It then creates Jobs that run on each target node to perform the actual upgrade.

Before we begin, let us set up our monitoring terminals again. On the server node (k8s-node1):

Watch the upgrade plans and their associated jobs and pods:

watch -d 'kubectl -n system-upgrade get plans -o wide; echo ; kubectl -n system-upgrade get jobs,pods'

Watch the node versions:

watch -d "kubectl get node"

Watch the kube-system pods:

watch -d "kubectl get pod -n kube-system -owide --sort-by=.metadata.creationTimestamp | tac"

And keep your application health check running:

while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done

The first step is to install the System Upgrade Controller itself. It consists of a CRD (the Plan resource definition) and a controller deployment that watches for Plan resources:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml \
  -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

This creates several resources: a system-upgrade namespace, a ServiceAccount, RBAC roles and bindings (including a special “drainer” ClusterRole that allows the controller to cordon and drain nodes), a ConfigMap with default environment variables, and the controller Deployment itself.

Verify the installation:

kubectl get deploy,pod,cm -n system-upgrade

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/system-upgrade-controller   1/1     1            1           12m

NAME                                             READY   STATUS    RESTARTS   AGE
pod/system-upgrade-controller-6f9f9b8cf4-46n82   1/1     Running   0          12m

NAME                               DATA   AGE
configmap/default-controller-env   10     12m

Also confirm the CRD was registered:

kubectl get crd | grep upgrade

$ plans.upgrade.cattle.io   2026-02-09T18:08:08Z

You can tail the controller logs to watch what happens in real time:

kubectl logs -n system-upgrade -l app.kubernetes.io/name=system-upgrade-controller -f

Now comes the interesting part — defining the upgrade plans. We will create two Plan resources: one for server nodes (control plane) and one for agent nodes (workers). The key design here is that the agent plan has a prepare step that references the server plan. This ensures that all server nodes are upgraded before any agent node begins its upgrade, which respects the Kubernetes version skew policy.

cat << EOF | kubectl apply -f -
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/rke2-upgrade
  channel: https://update.rke2.io/v1-release/channels/latest
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: DoesNotExist
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/rke2-upgrade
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/rke2-upgrade
  channel: https://update.rke2.io/v1-release/channels/latest
EOF

Let us break down what each field does. The concurrency field is set to 1, meaning only one node will be upgraded at a time. This is the safest approach — you can increase it if you have a large cluster and want faster rollouts, but for most on-premise environments, upgrading one node at a time is the prudent choice. The cordon field is set to true, which means each node will be cordoned (marked as unschedulable) before the upgrade begins, preventing new pods from being scheduled on it during the process.

The nodeSelector is what differentiates the two plans. The server plan targets nodes that have the node-role.kubernetes.io/control-plane label set to “true”, while the agent plan targets nodes where that label does not exist at all. This cleanly separates control plane nodes from worker nodes.

The channel field points to the RKE2 update API, specifically the “latest” channel. The controller will query this URL to determine what the latest available version is. You could alternatively use a version field with a specific version string like “v1.35.0+rke2r3” if you want to pin to an exact version rather than following the latest channel.

The prepare section in the agent plan is crucial for ordering. It tells the controller to run a preparation step using the rancher/rke2-upgrade image with the argument “prepare server-plan”. This step will wait until the server-plan has completed successfully on all matching nodes before proceeding with the agent upgrade. This is how the controller enforces the correct upgrade order.

As soon as you apply these Plan resources, the controller springs into action. It resolves the latest version from the channel URL, creates Jobs for each target node, and begins the upgrade process. You can watch the progress by checking the plan status:

kubectl -n system-upgrade get plans -o wide

NAME          IMAGE                  CHANNEL                                             VERSION   COMPLETE   MESSAGE   APPLYING
agent-plan    rancher/rke2-upgrade   https://update.rke2.io/v1-release/channels/latest             True
server-plan   rancher/rke2-upgrade   https://update.rke2.io/v1-release/channels/latest             True

Check the jobs that were created:

kubectl -n system-upgrade get jobs

NAME                                                              STATUS     COMPLETIONS   DURATION   AGE
apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-7dc30   Complete   1/1           2m5s       4m16s
apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-28ad1   Complete   1/1           55s        4m16s

And the pods that performed the actual upgrades:

kubectl get pod -n system-upgrade -owide

NAME                                                              READY   STATUS      RESTARTS   AGE     IP              NODE        ...
apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-54frb   0/1     Unknown     0          4m30s   192.168.10.12   k8s-node2   ...
apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-jrsnq   0/1     Completed   0          2m37s   192.168.10.12   k8s-node2   ...
apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-xg4jl   0/1     Unknown     0          4m30s   192.168.10.11   k8s-node1   ...
apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-msrt2   0/1     Completed   0          3m43s   192.168.10.11   k8s-node1   ...
system-upgrade-controller-6f9f9b8cf4-zksgq                        1/1     Running     0          8m20s   10.42.0.8       k8s-node1   ...

You might notice some pods with an “Unknown” status. These are the initial pods that were running on the node when it was being restarted during the upgrade — they lost connection and could not report their final status. The “Completed” pods are the ones that actually finished the upgrade successfully on a second attempt.

One very important thing to understand about these upgrade pods is the level of access they require. Because they need to modify the host operating system (installing new RKE2 binaries, restarting systemd services), they run with significant privileges. You can inspect this:

kubectl describe pod -n system-upgrade | grep ^Volumes: -A4

Volumes:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory

The upgrade pods mount the entire host root filesystem at /host with read and write permissions. They also use the host IPC, NET, and PID namespaces, and have the CAP_SYS_BOOT capability. This level of access is necessary for the upgrade to work, but it is something you should be aware of from a security perspective, especially in production environments. Make sure you review and understand these permissions before deploying the System Upgrade Controller in a sensitive environment.

You can also check the controller logs to see the full sequence of events:

kubectl logs -n system-upgrade -l app.kubernetes.io/name=system-upgrade-controller

object="system-upgrade/server-plan" ... type="Normal" reason="Resolved" message="Resolved latest version from Spec.Channel: v1.35.0-rke2r3"
object="system-upgrade/server-plan" ... type="Normal" reason="SyncJob" message="Jobs synced for version v1.35.0-rke2r3 on Nodes k8s-node1. Hash: db1bffd09b601fca4c7c067d987c4d368f9237f8219289438d8678e8"
object="system-upgrade/server-plan" ... type="Normal" reason="JobComplete" message="Job completed on Node k8s-node1"
object="system-upgrade/server-plan" ... type="Normal" reason="Complete" message="Jobs complete for version v1.35.0-rke2r3. Hash: db1bffd09b601fca4c7c067d987c4d368f9237f8219289438d8678e8"

The logs clearly show the sequence: the controller resolved the latest version as v1.35.0-rke2r3, synced upgrade jobs to the server node first, waited for completion, and then proceeded with the agent nodes.

Finally, verify that both nodes are now running the latest version:

kubectl get node -owide

NAME        STATUS   ROLES                       AGE    VERSION          INTERNAL-IP     ...
k8s-node1   Ready    control-plane,etcd,master   108m   v1.35.0+rke2r3   192.168.10.11   ...
k8s-node2   Ready                          97m    v1.35.0+rke2r3   192.168.10.12   ...

Both nodes are now on v1.35.0. The automated upgrade completed successfully, upgrading the server first and then the agent, exactly as specified in the Plan resources.

The beauty of this approach is that it is entirely declarative. You define your desired state (which version, which nodes, in what order), and the controller makes it happen. If you add new nodes to the cluster later, the controller will automatically apply the upgrade plan to them as well, as long as they match the nodeSelector. This makes the System Upgrade Controller an excellent choice for clusters that need to stay current with minimal operational overhead.

Cluster API Overview — Management vs. Workload clusters

Before we continue the hands-on part, let’s take a step back and understand what Cluster API actually is and why it exists. If you’ve ever managed more than a handful of Kubernetes clusters, you know the pain. Every cluster needs to be provisioned, configured, upgraded, and eventually decommissioned. Doing all of this manually or with a patchwork of scripts quickly becomes a nightmare as the number of clusters grows. Cluster API, often abbreviated as CAPI, was born out of this exact problem.

At its core, Cluster API is a Kubernetes-native tool that lets you create, configure, and manage Kubernetes clusters using Kubernetes itself. Yes, you read that right. You use Kubernetes custom resources to describe what your cluster should look like, and a set of controllers running inside a Kubernetes cluster takes care of making that a reality. Think of it as applying the declarative, reconciliation-driven model that Kubernetes uses for Pods and Deployments, but now applied to entire clusters, nodes, and control planes.

This brings us to the two most important concepts you need to understand before going any further: the Management Cluster and the Workload Cluster.

The Management Cluster is the Kubernetes cluster where Cluster API components are installed and running. It hosts all the CAPI controllers, the custom resource definitions (CRDs), and the provider-specific controllers. This is the cluster that watches your declarative cluster definitions and does the heavy lifting of provisioning infrastructure, bootstrapping nodes, and orchestrating upgrades. You can think of it as the “brain” that controls everything. In our hands-on lab, we use a simple KinD (Kubernetes in Docker) cluster as the management cluster, but in a real production environment, this would typically be a dedicated, highly available cluster that you treat with extra care.

The Workload Cluster, on the other hand, is the cluster that gets created and managed by the Management Cluster. This is where your actual applications run. You don’t install Cluster API components on workload clusters. Instead, the management cluster provisions them from scratch, including spinning up the underlying infrastructure (VMs, containers, or bare metal), bootstrapping each node with kubeadm, forming the control plane, and joining worker nodes. From the workload cluster’s perspective, it’s just a normal Kubernetes cluster. It has no idea that it was created and is being managed by another cluster.

Now let’s talk about how this is structured internally. Cluster API is designed around a pluggable provider model, and there are four types of providers that work together to make everything happen.

The first is the Core Provider, which lives in the capi-system namespace. This is the main Cluster API controller that manages the fundamental CRDs like Cluster, Machine, MachineSet, and MachineDeployment. It orchestrates the overall reconciliation loop. When you check the installed providers in our lab, you'll see it like this:

kubectl get providers.clusterctl.cluster.x-k8s.io -A

NAMESPACE                           NAME                    TYPE                     PROVIDER      VERSION
capi-system                         cluster-api             CoreProvider             cluster-api   v1.12.2

The second is the Bootstrap Provider. In our case, this is the Kubeadm Bootstrap Provider, running in the capi-kubeadm-bootstrap-system namespace. Its job is to generate the cloud-init or user-data configuration that each node needs to bootstrap itself into a functioning Kubernetes node. It creates the kubeadm init and kubeadm join configurations so that control plane nodes and worker nodes know how to set themselves up. You can verify it's running:

kubectl get providers -n capi-kubeadm-bootstrap-system bootstrap-kubeadm -o yaml

providerName: kubeadm
type: BootstrapProvider
version: v1.12.2

The third is the Control Plane Provider. This one, running in capi-kubeadm-control-plane-system, is specifically responsible for managing the lifecycle of control plane nodes. It handles the KubeadmControlPlane resource, which means it takes care of scaling the control plane up or down, rolling out upgrades to control plane nodes one at a time, and making sure etcd membership is correctly managed throughout. This is critical for safe, zero-downtime upgrades:

kubectl get providers -n capi-kubeadm-control-plane-system control-plane-kubeadm -o yaml

providerName: kubeadm
type: ControlPlaneProvider
version: v1.12.2

The fourth and final piece is the Infrastructure Provider. This is where things get environment-specific. The infrastructure provider is responsible for actually creating the underlying resources, whether that’s VMs on vSphere, instances on AWS, or in our lab’s case, Docker containers that simulate nodes. Our lab uses the Docker provider (CAPD), which runs in the capd-system namespace. It's important to note that the Docker provider is designed strictly for development and testing purposes, not for production. But it's perfect for learning because it lets you see the full Cluster API workflow without needing access to a cloud or hypervisor:

kubectl get providers -n capd-system infrastructure-docker -o yaml

providerName: docker
type: InfrastructureProvider
version: v1.12.2

When you put all four providers together and look at the pods running across the management cluster, you get a clear picture of the architecture:

kubectl get pod -A

NAMESPACE                           NAME                                                            READY   STATUS
capd-system                         capd-controller-manager-7c9d67ffdf-7npsd                        1/1     Running
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-bd5f89bbd-9c9ng       1/1     Running
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55c48d9b5-bckj5   1/1     Running
capi-system                         capi-controller-manager-6cc7b949c4-rmd7h                        1/1     Running
cert-manager                        cert-manager-598d877b78-9lkmd                                   1/1     Running
cert-manager                        cert-manager-cainjector-6b5777d564-7mfzz                        1/1     Running
cert-manager                        cert-manager-webhook-5d9fc6b4ff-slscg                           1/1     Running

You’ll also notice cert-manager is installed. This is a dependency that Cluster API installs automatically because the controllers use TLS certificates for webhook communication.

Another concept worth understanding is ClusterClass, which was introduced as an experimental feature and is what we use in our lab. ClusterClass lets you define a reusable “template” or “blueprint” for clusters. Instead of specifying every detail for each individual cluster, you define a ClusterClass once, covering things like what machine templates to use, how the control plane should be configured, and what the worker node pools look like. Then, when you create an actual Cluster resource, you simply reference the ClusterClass and provide a minimal set of overrides like the cluster name, Kubernetes version, and replica counts. This is enabled by setting the CLUSTER_TOPOLOGY feature gate to true before initializing the management cluster:

export CLUSTER_TOPOLOGY=true
clusterctl init --infrastructure docker

And when you look at the feature gates that the core controller is running with, you can confirm it:

kubectl describe -n capi-system deployment.apps/capi-controller-manager | grep feature-gates

--feature-gates=MachinePool=true,ClusterTopology=true,RuntimeSDK=false,...

With ClusterTopology enabled, the generated cluster manifest contains both a ClusterClass definition (the blueprint) and a Cluster resource that references it. If you inspect the generated YAML, the resource types make this clear:

cat capi-quickstart.yaml | grep -E '^apiVersion:|^kind:'

apiVersion: cluster.x-k8s.io/v1beta2
kind: ClusterClass
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerClusterTemplate
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: KubeadmControlPlaneTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachinePoolTemplate
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: KubeadmConfigTemplate
apiVersion: cluster.x-k8s.io/v1beta2
kind: Cluster

The ClusterClass at the top is the blueprint. The Cluster at the bottom is the actual workload cluster that references it. Everything in between, the DockerClusterTemplate, KubeadmControlPlaneTemplate, DockerMachineTemplates, and KubeadmConfigTemplate, are the building blocks that the ClusterClass stitches together.

One of the most powerful benefits of this architecture is how upgrades work. Because the management cluster controls the workload cluster declaratively, upgrading the Kubernetes version of an entire workload cluster is as simple as patching a single field on the Cluster resource. The controllers handle the rest: they roll out new control plane nodes one by one, wait for each to become healthy, drain and remove old nodes, and then do the same for worker nodes. There’s no need to SSH into machines or run manual commands. You just change the desired version, and the system converges to match it.

To summarize the mental model: the Management Cluster is the operator, and the Workload Cluster is what gets operated on. The Management Cluster runs CAPI controllers that watch custom resources describing desired cluster state. When you create, modify, or delete these resources, the controllers spring into action, provisioning or tearing down real infrastructure to match your intent. This separation of concerns means you can manage dozens or even hundreds of workload clusters from a single management cluster, all using the same familiar Kubernetes API and tooling you already know.

Cluster API Hands-On

In this section, we will walk through every step of setting up a management cluster, provisioning a full workload cluster with three control plane nodes and three workers, deploying a sample application onto it, and finally upgrading that workload cluster to a newer Kubernetes version. Everything here uses the Docker infrastructure provider, which means each “machine” is actually a Docker container pretending to be a node. This makes it perfect for learning and experimentation on your local machine without needing any cloud accounts or bare-metal servers.

Setting Up the Management Cluster (KinD + Docker Provider)

Before we can use Cluster API to provision workload clusters, we need a management cluster. This is a regular Kubernetes cluster that runs the Cluster API controllers. These controllers watch for custom resources like Cluster, Machine, MachineDeployment, and so on, and then reconcile the desired state by creating actual infrastructure. In our case, the infrastructure provider is Docker, so the controllers will create Docker containers to act as Kubernetes nodes.

Let us start by creating a working directory. This keeps all our files organized in one place.

mkdir capi-docker && cd capi-docker

Before proceeding, it is a good idea to check your Docker context and make sure there are no leftover containers from previous experiments that might interfere.

docker context ls
docker ps -a

Now we create a KinD (Kubernetes in Docker) cluster. This will serve as our management cluster. Notice that we mount the Docker socket into the KinD node. This is critical because the Cluster API Docker provider needs access to the host’s Docker daemon in order to create containers that will become our workload cluster nodes. We also set up a couple of extra port mappings for convenience, one for a sample application and one for kube-ops-view, a nice visual dashboard.

kind create cluster --name myk8s --image kindest/node:v1.35.0 --config - <
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraMounts:
  - hostPath: /var/run/docker.sock 
    containerPath: /var/run/docker.sock
  extraPortMappings:
  - containerPort: 30000
    hostPort: 30000
  - containerPort: 30001
    hostPort: 30001
EOF

Once the cluster is up, you can optionally install kube-ops-view on the management cluster. This gives you a nice visual representation of your cluster’s pods and nodes. It is not required, but it makes the experience much more enjoyable when you are watching things happen in real time.

helm repo add geek-cookbook https://geek-cookbook.github.io/charts/
helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 \
  --set service.main.type=NodePort,service.main.ports.http.nodePort=30001 \
  --set env.TZ="Asia/Seoul" --namespace kube-system

You can then open your browser and navigate to http://127.0.0.1:30001/#scale=1.5 to see the dashboard.

Next, we need the clusterctl command-line tool. This is the primary CLI for interacting with Cluster API. On macOS you can install it with Homebrew, and on Linux you can download the binary directly.

# macOS
brew install clusterctl  

# Linux amd64
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.12.2/clusterctl-linux-amd64 -o clusterctl
sudo install -o root -g root -m 0755 clusterctl /usr/local/bin/clusterctl

Verify the installation by checking the version.

clusterctl version -o json | jq

Now comes the important part. We initialize the management cluster by telling clusterctl to install all the necessary Cluster API components with the Docker infrastructure provider. We also enable the ClusterTopology experimental feature, which lets us use ClusterClass resources. ClusterClass is essentially a reusable template that defines how clusters should be structured, making it much easier to stamp out multiple clusters with consistent configurations.

export CLUSTER_TOPOLOGY=true
clusterctl init --infrastructure docker

This command does a lot behind the scenes. It installs cert-manager (which Cluster API depends on for webhook certificates), and then it deploys four sets of controllers into separate namespaces. You will see output indicating that each provider is being installed. Once it finishes, let us verify that everything landed correctly.

kubectl get pod -A

You should see pods running in several new namespaces. The capd-system namespace contains the Docker infrastructure provider controller. The capi-system namespace has the core Cluster API controller. The capi-kubeadm-bootstrap-system namespace runs the bootstrap provider, which is responsible for generating the cloud-init or kubeadm configuration that nodes use when they first boot. And capi-kubeadm-control-plane-system runs the control plane provider, which manages the lifecycle of control plane nodes including scaling and upgrades.

NAMESPACE                           NAME                                                            READY   STATUS
capd-system                         capd-controller-manager-7c9d67ffdf-7npsd                        1/1     Running
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-bd5f89bbd-9c9ng       1/1     Running
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55c48d9b5-bckj5   1/1     Running
capi-system                         capi-controller-manager-6cc7b949c4-rmd7h                        1/1     Running
cert-manager                        cert-manager-598d877b78-9lkmd                                   1/1     Running
cert-manager                        cert-manager-cainjector-6b5777d564-7mfzz                        1/1     Running
cert-manager                        cert-manager-webhook-5d9fc6b4ff-slscg                           1/1     Running

You can also inspect the installed providers using the providers custom resource. This is a handy way to confirm which providers are active and what version they are running.

kubectl get providers.clusterctl.cluster.x-k8s.io -A

The output tells you exactly what is installed: the core Cluster API provider, the kubeadm bootstrap provider, the kubeadm control plane provider, and the Docker infrastructure provider, all at version v1.12.2.

NAMESPACE                           NAME                    TYPE                     PROVIDER      VERSION
capd-system                         infrastructure-docker   InfrastructureProvider   docker        v1.12.2
capi-system                         cluster-api             CoreProvider             cluster-api   v1.12.2
capi-kubeadm-bootstrap-system       bootstrap-kubeadm       BootstrapProvider        kubeadm       v1.12.2
capi-kubeadm-control-plane-system   control-plane-kubeadm   ControlPlaneProvider     kubeadm       v1.12.2

A large number of Custom Resource Definitions (CRDs) have also been installed. These define all the resource types that Cluster API uses to represent clusters, machines, machine deployments, machine pools, and more. You can see them all by running the following command.

kubectl get crd | grep x-k8s

At this point, your management cluster is fully initialized and ready to provision workload clusters. The controllers are running, the CRDs are in place, and the Docker provider is standing by to create containers on demand.

Provisioning a Workload Cluster (3 CP + 3 Workers)

With the management cluster ready, we can now tell it to create a workload cluster. The first thing to do is set some environment variables that control the networking configuration of the new cluster. These values will be injected into the cluster manifest when we generate it.

export SERVICE_CIDR=["10.20.0.0/16"]
export POD_CIDR=["10.10.0.0/16"]
export SERVICE_DOMAIN="myk8s-1.local"
export POD_SECURITY_STANDARD_ENABLED="false"

The SERVICE_CIDR defines the IP range for Kubernetes Services. The POD_CIDR defines the IP range for Pods. The SERVICE_DOMAIN sets the internal DNS domain. And we disable Pod Security Standards for simplicity in this lab environment.

Now we use clusterctl to generate the cluster manifest. The development flavor is specifically designed for the Docker provider. We request Kubernetes v1.34.3 with three control plane nodes and three worker nodes.

clusterctl generate cluster capi-quickstart --flavor development \
  --kubernetes-version v1.34.3 \
  --control-plane-machine-count=3 \
  --worker-machine-count=3 \
  > capi-quickstart.yaml

This produces a YAML file containing all the Cluster API resources needed to describe our desired cluster. Let us peek at what kinds of resources are in there.

cat capi-quickstart.yaml | grep -E '^apiVersion:|^kind:'

The output reveals quite a few resource types, and understanding them helps demystify how Cluster API works.

apiVersion: cluster.x-k8s.io/v1beta2
kind: ClusterClass
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerClusterTemplate
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: KubeadmControlPlaneTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachineTemplate
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: DockerMachinePoolTemplate
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: KubeadmConfigTemplate
apiVersion: cluster.x-k8s.io/v1beta2
kind: Cluster

The ClusterClass is like a blueprint. It defines what a cluster should look like in terms of its infrastructure, control plane configuration, and worker node templates. The Cluster resource at the bottom is the actual instance that references this ClusterClass and specifies the concrete values like version and replica counts. The various template resources (DockerClusterTemplate, DockerMachineTemplate, KubeadmControlPlaneTemplate, KubeadmConfigTemplate) define how each piece of infrastructure and configuration should be created.

Now let us apply this manifest to the management cluster and watch the magic happen.

kubectl apply -f capi-quickstart.yaml

You should see output confirming that all the resources were created.

clusterclass.cluster.x-k8s.io/quick-start created
dockerclustertemplate.infrastructure.cluster.x-k8s.io/quick-start-cluster created
kubeadmcontrolplanetemplate.controlplane.cluster.x-k8s.io/quick-start-control-plane created
dockermachinetemplate.infrastructure.cluster.x-k8s.io/quick-start-control-plane created
dockermachinetemplate.infrastructure.cluster.x-k8s.io/quick-start-default-worker-machinetemplate created
dockermachinepooltemplate.infrastructure.cluster.x-k8s.io/quick-start-default-worker-machinepooltemplate created
kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/quick-start-default-worker-bootstraptemplate created
cluster.cluster.x-k8s.io/capi-quickstart created

The controllers immediately begin reconciling. The Docker provider starts creating containers, the bootstrap provider generates kubeadm configurations, and the control plane provider orchestrates the bring-up of each control plane node one by one. You can watch the progress in several ways. Opening a few terminal windows with monitoring commands makes this much more fun to observe.

watch -d "docker ps ; echo ; clusterctl describe cluster capi-quickstart"
watch -d kubectl get cluster -o wide
watch -d kubectl get machines

After a few minutes, if you run docker ps, you will see new containers appearing. These are the workload cluster nodes, each running the kindest/node image.

docker ps

You will notice several types of containers. There are three containers with port 6443 exposed, which are the control plane nodes. Three containers without exposed ports are the worker nodes. One container running kindest/haproxy serves as a load balancer in front of the control plane API servers. And the original myk8s-control-plane container is your management cluster.

CONTAINER ID   IMAGE                                COMMAND                  NAMES
d7622e825f15   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-hvt8l-plmp7
803b5917463f   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-hvt8l-j6tpk
051bcc807657   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-md-0-64l6n-9hjzs-lcjv4
46b27b6eb154   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-md-0-64l6n-9hjzs-8rz7j
a5271457c0d6   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-md-0-64l6n-9hjzs-qnd5z
1607db8af33f   kindest/node:v1.34.3                 "/usr/local/bin/entr…"   capi-quickstart-hvt8l-lpp75
6c1f5141492c   kindest/haproxy:v20230606-42a2262b   "haproxy -W -db -f /…"   capi-quickstart-lb
2f4d7bd4d2ad   kindest/node:v1.35.0                 "/usr/local/bin/entr…"   myk8s-control-plane

Now we need to get the kubeconfig for the workload cluster so we can interact with it directly.

clusterctl get kubeconfig capi-quickstart > capi-quickstart.kubeconfig

There is an important gotcha here that is worth mentioning. If you are running KinD on your local machine, the kubeconfig file may contain an internal Docker network IP address (something like 172.18.0.3) as the API server endpoint. This address is not reachable from your host machine.

You need to replace it with the host-accessible address. Check docker ps for the capi-quickstart-lb container’s port mapping. If it shows something like 0.0.0.0:55000->6443/tcp, then edit the kubeconfig to point to 127.0.0.1:55000 instead.

# Check the LB port mapping
docker ps | grep capi-quickstart-lb

# Edit the kubeconfig
vim capi-quickstart.kubeconfig
# Change server: https://172.18.0.3:6443
# To:    server: https://127.0.0.1:55000

Try connecting to the workload cluster.

kubectl --kubeconfig=capi-quickstart.kubeconfig get nodes -owide

At this point, all six nodes will show up but they will be in NotReady status. This is expected and completely normal. The nodes are not ready because there is no CNI (Container Network Interface) plugin installed yet. Without a CNI plugin, pods cannot communicate with each other, and the kubelet reports the node as not ready. Let us fix that by installing Calico.

kubectl --kubeconfig=capi-quickstart.kubeconfig apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml

Give it a minute or so, and then check the nodes again.

kubectl --kubeconfig=capi-quickstart.kubeconfig get nodes -owide

Now you should see all six nodes in Ready status, three control plane nodes and three workers, all running Kubernetes v1.34.3 on Debian Bookworm with containerd as the container runtime.

NAME                                     STATUS   ROLES           AGE     VERSION   INTERNAL-IP
capi-quickstart-md-0-lt794-jfjvx-9wrql   Ready              113s    v1.34.3   192.168.97.6
capi-quickstart-md-0-lt794-jfjvx-f4pr4   Ready              113s    v1.34.3   192.168.97.5
capi-quickstart-md-0-lt794-jfjvx-rjfz9   Ready              113s    v1.34.3   192.168.97.7
capi-quickstart-sn62k-6n9sd              Ready    control-plane   2m14s   v1.34.3   192.168.97.4
capi-quickstart-sn62k-pr4kp              Ready    control-plane   46s     v1.34.3   192.168.97.9
capi-quickstart-sn62k-xhmtw              Ready    control-plane   75s     v1.34.3   192.168.97.8

You can also use clusterctl describe to get a nice summary view of the entire cluster hierarchy.

clusterctl describe cluster capi-quickstart

The output shows the cluster, its infrastructure (DockerCluster), the control plane (KubeadmControlPlane) with its three machines, and the workers (MachineDeployment) with their three machines. Everything should show as Ready and Available.

NAME                                                           READY   STATUS
Cluster/capi-quickstart                                        True    Available
├─ClusterInfrastructure - DockerCluster/capi-quickstart-wcb2g  True    Ready
├─ControlPlane - KubeadmControlPlane/capi-quickstart-sn62k     True    Available
│ └─3 Machines...                                              True    Ready
└─Workers
  └─MachineDeployment/capi-quickstart-md-0-lt794               True    Available
    └─3 Machines...                                            True    Ready

Your workload cluster is now fully operational.

Deploying a Sample App

To make things more convenient for the rest of our work, let us set up a shell alias so we do not have to type the kubeconfig flag every time.

alias k8s1='kubectl --kubeconfig=capi-quickstart.kubeconfig'
k8s1 cluster-info

You should see the control plane endpoint and CoreDNS reported as running. Optionally, you can also install kube-ops-view on the workload cluster for visual monitoring. This time we use port-forward since we do not have a NodePort conveniently mapped.

helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 \
  --set env.TZ="Asia/Seoul" --namespace kube-system --kubeconfig=capi-quickstart.kubeconfig

k8s1 -n kube-system port-forward svc/kube-ops-view 8080:8080 &

Open http://127.0.0.1:8080/#scale=1.5 in your browser to see the workload cluster’s visual dashboard.

Now let us deploy a simple application to prove that the cluster works end to end. We will deploy a whoami web server using a Deployment with three replicas and expose it via a NodePort Service. The pod anti-affinity rule encourages the scheduler to spread the pods across different nodes.

cat << EOF | kubectl --kubeconfig=capi-quickstart.kubeconfig apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - sample-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
    nodePort: 30003
  type: NodePort
EOF

Check that the deployment, pods, service, and endpoints are all in good shape.

k8s1 get deploy,pod,svc,ep -owide

You should see three pods running, spread across your worker nodes, with the service exposing NodePort 30003. To actually test the connectivity, we can use the management cluster’s KinD node to reach into one of the workload cluster nodes on that NodePort. First, pick one of the control plane container names from docker ps, then curl it.

docker ps
CT1=capi-quickstart-hvt8l-hb4hb

docker exec -it myk8s-control-plane curl -s $CT1:30003

You can even run a loop to see load balancing in action across the three pods.

while true; do docker exec -it myk8s-control-plane curl -s $CT1:30003 | grep Hostname; date; sleep 1; done

Each request should return a different Hostname value as it hits different pods behind the service.

Since the workload cluster has a HAProxy load balancer in front of the three control plane API servers, you can also inspect its stats page. Check the LB container’s port mappings first.

docker inspect capi-quickstart-lb | jq | grep -i hostport

If you see a HostPort like 55001 mapped to port 8404, you can open http://127.0.0.1:55001/stats in your browser to see the HAProxy statistics dashboard showing the health status of all three backend API servers.

Upgrading the Workload Cluster (v1.34 to v1.35)

One of the most powerful features of Cluster API is declarative cluster lifecycle management. Upgrading a cluster from one Kubernetes version to another is as simple as changing a single field in the Cluster resource. The controllers handle all the complexity of rolling out new nodes, draining old ones, and ensuring the cluster remains available throughout the process.

Before starting the upgrade, open several terminal windows for monitoring so you can watch the process unfold in real time.

watch -d kubectl --kubeconfig=capi-quickstart.kubeconfig get node
watch -d "docker ps ; echo ; clusterctl describe cluster capi-quickstart"
kubectl --kubeconfig=capi-quickstart.kubeconfig get node -w

If you still have kube-ops-view running, keep an eye on that browser tab as well. It is quite satisfying to watch nodes appear and disappear during the rolling upgrade.

Now, because we used ClusterClass and Topology when we created the cluster, the upgrade is done by patching the Cluster resource’s topology version field. You do not touch the KubeadmControlPlane or MachineDeployment directly; the topology controller takes care of propagating the version change to all the right places.

kubectl patch cluster capi-quickstart --type merge -p '{"spec":{"topology":{"version":"v1.35.0"}}}'

The moment you run this command, the controllers spring into action. The upgrade proceeds in a careful, ordered fashion. First, the control plane nodes are upgraded one at a time. A new control plane node running v1.35.0 is created, it joins the cluster and the etcd ring, and then one of the old v1.34.3 control plane nodes is cordoned, drained, and removed. This repeats until all three control plane nodes have been replaced. Only after the entire control plane is upgraded does the controller move on to the worker nodes, following the same rolling replacement strategy.

You can watch the nodes cycling through by repeatedly checking the node list.

k8s1 get node

During the upgrade, you will see a mix of old and new nodes. At some point, you might see output like this, where the last node is being drained and removed.

NAME                                      STATUS                     ROLES           VERSION
capi-quickstart-74kjs-ncktx               Ready                      control-plane   v1.35.0
capi-quickstart-74kjs-npnjr               Ready                      control-plane   v1.35.0
capi-quickstart-74kjs-tdd8d               Ready                      control-plane   v1.35.0
capi-quickstart-md-0-npmnz-x7b2z-892h8   Ready                                v1.35.0
capi-quickstart-md-0-npmnz-x7b2z-fszrs   Ready                                v1.35.0
capi-quickstart-md-0-npmnz-x7b2z-lksqj   Ready                                v1.35.0
capi-quickstart-md-0-npmnz-x9jx6-qx8rf   Ready,SchedulingDisabled             v1.34.3

Notice the last worker node shows SchedulingDisabled. It has been cordoned and is being drained before removal.

If you check docker ps during the upgrade, you will see new containers being created with the kindest/node:v1.35.0 image while the old v1.34.3 containers are gradually removed.

docker ps

CONTAINER ID   IMAGE                                NAMES
9a79e524c198   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-lksqj
f5658e9d5385   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-fszrs
1a71ef5e651a   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-892h8
3baae4af04d4   kindest/node:v1.35.0                 capi-quickstart-74kjs-tdd8d
5e85d57bd948   kindest/node:v1.35.0                 capi-quickstart-74kjs-npnjr
e7742cd044b3   kindest/node:v1.35.0                 capi-quickstart-74kjs-ncktx
c125003ec5d7   kindest/haproxy:v20230606-42a2262b   capi-quickstart-lb
6419a7cf30ce   kindest/node:v1.35.0                 myk8s-control-plane

Every container is now running v1.35.0. The HAProxy load balancer container remains untouched since it only proxies traffic and does not run Kubernetes itself.

You can also verify that the load balancer’s backend configuration has been updated to point to the new control plane nodes by copying the HAProxy config out of the container.

docker cp capi-quickstart-lb:/usr/local/etc/haproxy/haproxy.cfg .
cat haproxy.cfg | grep capi-quickstart

The backend server entries should now reference the new control plane containers with their new internal IP addresses.

Once the upgrade completes, all nodes will be running v1.35.0 and the cluster is fully healthy again. The entire process was triggered by a single kubectl patch command. There was no manual SSH-ing into nodes, no running kubeadm upgrade on each machine, no worrying about the order of operations. Cluster API handled everything declaratively.

This is the real power of Cluster API. Whether you are managing two clusters or two hundred, the workflow is the same: define the desired state, apply it, and let the controllers do the work. Combined with GitOps tools, you can version-control your entire cluster fleet and roll out upgrades with a pull request.

Cleanup

When you’re done experimenting with everything, it’s important to tear down all the resources properly so nothing lingers on your machine. We have two separate environments to clean up here: the Cluster API lab with its KinD-based management and workload clusters, and the RKE2 lab with its Vagrant-based virtual machines. Let’s walk through each one carefully.

Starting with the Cluster API side, the first thing to remove is the workload cluster that was provisioned through Cluster API. This is one of the most elegant parts of the entire Cluster API experience. Because the workload cluster is represented as a single Cluster custom resource in the management cluster, deleting it triggers the full reconciliation loop in reverse. The controller will drain the nodes, delete the Machine objects, tear down the Docker containers that were acting as virtual machines, and remove the load balancer. You can chain the delete command with docker ps to immediately see the effect.

kubectl delete cluster capi-quickstart && docker ps

The output should look like this. Notice how all the workload cluster containers have disappeared, leaving only the original KinD management cluster container running.

cluster.cluster.x-k8s.io "capi-quickstart" deleted
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS                                                             NAMES
037b1d199fb7   kindest/node:v1.35.0   "/usr/local/bin/entr…"   57 minutes ago   Up 57 minutes   0.0.0.0:30000-30001->30000-30001/tcp, 127.0.0.1:57103->6443/tcp   myk8s-control-plane

Before this command, you had quite a few containers running. Three containers for control plane nodes, three for worker nodes, one HAProxy load balancer container, and the management cluster container itself. That’s eight containers in total. After the delete, only the management cluster’s KinD container survives. Everything else was cleaned up automatically by the Cluster API controllers. This is one of the most powerful aspects of the declarative model. You don’t have to manually SSH into machines, run drain commands, or stop services one by one. The controller handles the entire teardown sequence for you.

You should also clean up the kubeconfig file and any other artifacts that were generated during the lab. The capi-quickstart.kubeconfig file and the capi-quickstart.yaml manifest file are no longer needed.

rm -f capi-quickstart.kubeconfig capi-quickstart.yaml

Now that the workload cluster is completely gone, you can safely remove the KinD management cluster itself. This will destroy the last remaining Docker container and also clean up the kubeconfig context that KinD registered on your host machine.

kind delete cluster --name myk8s

The output confirms the deletion.

Deleting cluster "myk8s" ...
Deleted nodes: ["myk8s-control-plane"]

At this point, if you run docker ps, you should see no containers at all. Your Docker environment is back to a completely clean state. If you also installed clusterctl via Homebrew on macOS and no longer need it, you can optionally remove it as well.

brew uninstall clusterctl

You can also clean up the working directory you created at the beginning of the Cluster API lab.

cd ~
rm -rf capi-docker

That takes care of the entire Cluster API environment. Now let’s move on to the RKE2 lab.

The RKE2 lab was built on top of Vagrant with two virtual machines, k8s-node1 acting as the server (control plane) and k8s-node2 acting as the agent (worker node). Since all the RKE2 components, the systemd services, the binaries, the etcd data, the certificates, and everything else live entirely inside those VMs, the cleanest way to remove everything is to simply destroy the VMs through Vagrant. Navigate back to the directory where your Vagrantfile lives and run the destroy command with the force flag so it doesn’t prompt you for confirmation on each VM.

cd ~/k8s-rke2
vagrant destroy -f

Vagrant will stop both VMs and delete their disk images. You can confirm they’re gone by checking the status.

vagrant status

Both k8s-node1 and k8s-node2 should show as “not created,” which means they have been fully removed from your system.

If you want to clean up the Vagrantfile and init script as well since you won’t need them anymore, you can remove the entire working directory.

cd ~
rm -rf k8s-rke2

Now, there is one more thing worth mentioning for situations where you might want to remove RKE2 from a node without destroying the VM itself. Perhaps you’re working in a non-Vagrant environment, or you want to repurpose the machine for something else. RKE2 actually ships with uninstall scripts that get placed on the system during installation. On a server node, you would run the server uninstall script, and on an agent node, you would run the agent version. These scripts are thorough. They stop the rke2 services, kill any remaining containerd processes, remove the binaries from the data directory, delete the systemd unit files, clean up the yum repository configurations that were added during install, and wipe the data directories under /var/lib/rancher and /etc/rancher.

For a server node, the command would be:

/usr/bin/rke2-uninstall.sh

And for an agent node:

/usr/bin/rke2-agent-uninstall.sh

After running the appropriate script, the node is returned to a pre-RKE2 state. No Kubernetes components, no containerd, no certificates, and no leftover configuration. This is handy in production scenarios where you need to decommission a node or reinstall RKE2 from scratch without rebuilding the entire OS.

You should also remember to clean up the symbolic links we created during the RKE2 setup if you went the uninstall script route instead of destroying the VM. During the lab, we created several symlinks in /usr/local/bin to expose the RKE2 bundled binaries like kubectl, crictl, containerd, runc, and ctr to the system PATH. The uninstall script may or may not remove these depending on the version, so it’s good practice to verify.

rm -f /usr/local/bin/kubectl
rm -f /usr/local/bin/crictl
rm -f /usr/local/bin/containerd
rm -f /usr/local/bin/runc
rm -f /usr/local/bin/ctr
rm -f /etc/crictl.yaml

And the shell profile customizations we added for kubectl completion and the k alias:

sed -i '/source <(kubectl completion bash)/d' /etc/profile
sed -i '/alias k=kubectl/d' /etc/profile
sed -i '/complete -F __start_kubectl k/d' /etc/profile

Again, none of this manual cleanup is necessary if you’re simply destroying the Vagrant VMs, which is by far the simplest and most complete approach for a lab environment. But it’s good to know these details for real-world scenarios where the underlying machines are long-lived and managed independently of the Kubernetes installation.

For the system upgrade controller that we installed during the automated upgrade section, that was deployed inside the RKE2 cluster itself, so it gets destroyed along with the cluster when the VMs go away. But if you were running this on a persistent cluster and wanted to remove just the upgrade controller, you would delete the resources in reverse order: first the Plan custom resources, then the controller deployment and its associated RBAC and namespace.

kubectl delete plan -n system-upgrade server-plan agent-plan
kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml

This removes the system-upgrade namespace, the controller deployment, the ClusterRole and ClusterRoleBinding resources, the ServiceAccount, and the Plan CRD itself. Your cluster continues to run normally at whatever version it was last upgraded to, and no further automatic upgrades will be attempted.

At this point, both lab environments are fully cleaned up. Your host machine should have no leftover containers, no lingering VMs, and no stale configuration files. You’re ready to start fresh whenever you want to revisit these exercises or move on to new experiments.

Wrap Up & Key Takeaways

So let's wrap up everything we covered in this hands-on journey through RKE2 and Cluster API. This was a dense session, so it's worth taking a moment to reflect on what we actually did, what we learned, and why it matters.

We started with RKE2, Rancher's security-focused Kubernetes distribution. The very first thing you probably noticed is how simple the installation process is. A single curl command followed by running the install script is all it takes to get the binary on your machine.

curl -sfL https://get.rke2.io --output install.sh
chmod +x install.sh
INSTALL_RKE2_CHANNEL=v1.33 ./install.sh

From there, we wrote a configuration file and started the systemd service. That's it. No manually bootstrapping etcd, no setting up certificates by hand, no juggling multiple binaries. RKE2 handles all of that internally through its supervisor process. The control plane components like etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all come up as static pods, managed by the kubelet that RKE2 starts internally. You can confirm this after startup by checking the running pods in the kube-system namespace.

kubectl get pod -A

You will see etcd, kube-apiserver, kube-controller-manager, kube-scheduler, and kube-proxy all running as pods, alongside the Helm-installed components like Canal (the default CNI), CoreDNS, and the metrics server. This is a key architectural point. RKE2 does not install these components using raw manifests or direct binary execution on the host. Instead, it uses a combination of static pod manifests and Helm charts, giving you a consistent and manageable deployment model.

Adding a worker node was equally straightforward. On the agent node, you install the RKE2 agent binary, point it at the server's registration endpoint on port 9345, and provide the node token that the server generated during its initial startup.

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" INSTALL_RKE2_CHANNEL=v1.33 sh -

mkdir -p /etc/rancher/rke2/
cat << EOF > /etc/rancher/rke2/config.yaml
server: https://192.168.10.11:9345
token: K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749
EOF

systemctl enable --now rke2-agent.service

Once the agent joined, we verified it from the control plane and noticed something interesting. The control plane node had no taints applied by default. This means pods can be scheduled on the control plane node as well, which is different from what you might expect if you are coming from a kubeadm-based setup where control plane nodes typically have a NoSchedule taint. We confirmed this directly.

kubectl describe node k8s-node1 | grep -i taints
Taints:

This is a design choice in RKE2 that you should be aware of. In production, you may want to add taints to your control plane nodes manually if you want to keep workloads off them.

Moving on to Day-2 operations, we explored certificate management. RKE2 issues client and server certificates that are valid for 365 days. When a certificate is within 120 days of expiration, RKE2 automatically renews it on the next restart. You can inspect all certificates and their expiration dates at any time using the built-in certificate check command.

rke2 certificate check --output table

This gives you a clear table showing every certificate file, its subject, usage type, expiration date, residual time, and current status. For manual rotation, the process is stop the service, rotate, then start the service again.

Building a Kubernetes Cluster in an Air-Gapped Environment with Kubespray-Offline

Sigrid Jin — Sat, 14 Feb 2026 23:29:08 GMT

The comprehensive step-by-step guide on deploying a Kubernetes cluster in an air-gapped environment using kubespray-offline, covering the entire process from setting up essential infrastructure to the final cluster installation.

Introduction

Most Kubernetes tutorials assume one thing: your nodes can reach the internet. Pull an image from Docker Hub, download a binary from GitHub, install a package from the upstream repo — it all just works. Until it doesn’t.

In regulated industries — finance, defense, healthcare, government — production clusters often sit behind an air gap. No outbound internet access, period. Every container image, every OS package, every Python dependency has to be staged internally before a single pod can run. If you’ve never dealt with this constraint, it might sound like a minor inconvenience. In practice, it changes almost everything about how you plan and execute a Kubernetes deployment.

I went through this process recently as part of the CloudNet@ community’s K8S Deploy study group (Week 6), and this post documents the full journey: standing up the internal infrastructure (DNS, NTP, package mirrors, a private container registry), using the kubespray-offline tooling to pre-download all required artifacts, and finally running Kubespray’s Ansible playbook to bring up a working cluster — all without a single packet leaving the private network.

The lab environment is modest — one admin server and two k8s nodes running Rocky Linux 10 on VirtualBox — but the workflow maps directly to real-world air-gapped deployments. By the end, you’ll have a clear picture of what it takes to go from a blank, isolated network to a functioning Kubernetes cluster with Flannel networking, a private image registry, Helm OCI support, and internal package repositories.

Architecture & Lab Setup

Network Topology Overview

The lab mimics a typical air-gapped deployment pattern. Three virtual machines sit on a private network (192.168.10.0/24), and only one of them — the admin server — has a path to the outside world.

┌─────────────────────────────────────────────────┐
│                 Private Network                  │
│               192.168.10.0/24                    │
│                                                  │
│   ┌───────────┐                                  │
│   │   admin    │──── enp0s8 ───► Internet        │
│   │  .10.10   │     (NAT GW)                    │
│   │           │                                  │
│   │  - DNS Server (BIND, :53)                   │
│   │  - NTP Server (chrony)                      │
│   │  - Container Registry (:35000)              │
│   │  - Nginx File Server (:80)                  │
│   │  - YUM/DNF Repo Mirror                      │
│   │  - PyPI Mirror                              │
│   │  - NAT Gateway                              │
│   └─────┬─────┘                                  │
│         │ enp0s9                                 │
│         │                                        │
│   ┌─────┴─────┐     ┌───────────┐               │
│   │ k8s-node1 │     │ k8s-node2 │               │
│   │  .10.11   │     │  .10.12   │               │
│   │  (master) │     │  (worker) │               │
│   └───────────┘     └───────────┘               │
│     enp0s8 DOWN       enp0s8 DOWN               │
│     (no internet)     (no internet)              │
└─────────────────────────────────────────────────┘

The admin server wears many hats. In a production air-gapped environment, these responsibilities would typically be split across dedicated infrastructure — a DNS team managing BIND or CoreDNS, a networking team handling the NAT gateway, a platform team running Harbor for container images.

For this lab, one server does it all:

DNS Server (BIND) — resolves internal hostnames and forwards external domain queries
NTP Server (chrony) — keeps cluster time synchronized across all nodes
NAT Gateway — routes internal traffic to the internet through enp0s8 during the preparation phase, then gets disabled to simulate a true air gap
Container Registry (Docker Registry) — serves all Kubernetes container images on port 35000
Nginx File Server — hosts Kubernetes binaries, RPM packages, and Python packages over HTTP on port 80
YUM/DNF Repository Mirror — a full mirror of Rocky Linux BaseOS, AppStream, and Extras repos
PyPI Mirror — offline Python package index for Kubespray’s Ansible dependencies

The two k8s nodes have their external-facing enp0s8 interfaces disabled entirely. Their only network path is enp0s9, pointing at 192.168.10.10 as the default gateway. Even DNS queries go through the admin server. These nodes are, for all practical purposes, fully isolated from the internet.

Vagrant Environment

The whole setup runs on VirtualBox, managed by Vagrant. Each VM gets Rocky Linux 10, 4 vCPUs, and 2 GB of RAM. The admin server has one important difference: its primary disk is expanded to 120 GB, because it needs to hold all the mirrored packages, container images, and binaries — easily exceeding 20 GB in total.

BOX_IMAGE = "bento/rockylinux-10.0"
BOX_VERSION = "202510.26.0"
N = 2

Vagrant.configure("2") do |config|

  # k8s nodes
  (1..N).each do |i|
    config.vm.define "k8s-node#{i}" do |subconfig|
      subconfig.vm.box = BOX_IMAGE
      subconfig.vm.box_version = BOX_VERSION
      subconfig.vm.provider "virtualbox" do |vb|
        vb.customize ["modifyvm", :id, "--groups", "/Kubespary-offline-Lab"]
        vb.customize ["modifyvm", :id, "--nicpromisc2", "allow-all"]
        vb.name = "k8s-node#{i}"
        vb.cpus = 4
        vb.memory = 2048
        vb.linked_clone = true
      end
      subconfig.vm.host_name = "k8s-node#{i}"
      subconfig.vm.network "private_network", ip: "192.168.10.1#{i}"
      subconfig.vm.network "forwarded_port", guest: 22, host: "6000#{i}",
                           auto_correct: true, id: "ssh"
      subconfig.vm.synced_folder "./", "/vagrant", disabled: true
      subconfig.vm.provision "shell", path: "init_cfg.sh", args: [N]
    end
  end

  # Admin server
  config.vm.define "admin" do |subconfig|
    subconfig.vm.box = BOX_IMAGE
    subconfig.vm.box_version = BOX_VERSION
    subconfig.vm.provider "virtualbox" do |vb|
      vb.customize ["modifyvm", :id, "--groups", "/Kubespary-offline-Lab"]
      vb.customize ["modifyvm", :id, "--nicpromisc2", "allow-all"]
      vb.name = "admin"
      vb.cpus = 4
      vb.memory = 2048
      vb.linked_clone = true
    end
    subconfig.vm.host_name = "admin"
    subconfig.vm.network "private_network", ip: "192.168.10.10"
    subconfig.vm.network "forwarded_port", guest: 22, host: "60000",
                         auto_correct: true, id: "ssh"
    subconfig.vm.synced_folder "./", "/vagrant", disabled: true
    subconfig.vm.disk :disk, size: "120GB", primary: true
    subconfig.vm.provision "shell", path: "admin.sh", args: [N]
  end
end

The linked_clone = true setting keeps disk usage manageable by sharing a base image across VMs rather than fully duplicating it. Each k8s node provisions itself through init_cfg.sh, while the admin server runs a separate admin.sh script with additional setup.

Provisioning Scripts

admin.sh handles the admin server’s initial configuration — timezone, firewall/SELinux, IP forwarding, SSH key distribution to all nodes, Helm installation, and disk expansion:

echo ">>>> Initial Config Start <<<<"

echo "[TASK 1] Change Timezone and Enable NTP"
timedatectl set-local-rtc 0
timedatectl set-timezone Asia/Seoul

echo "[TASK 2] Disable firewalld and selinux"
systemctl disable --now firewalld >/dev/null 2>&1
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

echo "[TASK 3] Setting Local DNS Using Hosts file"
sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts
echo "192.168.10.10 admin" >> /etc/hosts
for (( i=1; i<=$1; i++ )); do
  echo "192.168.10.1$i k8s-node$i" >> /etc/hosts
done

echo "[TASK 4] Delete default routing - enp0s9 NIC"
nmcli connection modify enp0s9 ipv4.never-default yes
nmcli connection up enp0s9 >/dev/null 2>&1

echo "[TASK 5] Config net.ipv4.ip_forward"
cat << EOF > /etc/sysctl.d/99-ipforward.conf
net.ipv4.ip_forward = 1
EOF
sysctl --system >/dev/null 2>&1

echo "[TASK 6] Install packages"
dnf install -y python3-pip git sshpass cloud-utils-growpart >/dev/null 2>&1

echo "[TASK 7] Install Helm"
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \
  | DESIRED_VERSION=v3.20.0 bash >/dev/null 2>&1

echo "[TASK 8] Increase Disk Size"
growpart /dev/sda 3 >/dev/null 2>&1
xfs_growfs /dev/sda3 >/dev/null 2>&1

echo "[TASK 9] Setting SSHD"
echo "root:qwe123" | chpasswd
cat << EOF >> /etc/ssh/sshd_config
PermitRootLogin yes
PasswordAuthentication yes
EOF
systemctl restart sshd >/dev/null 2>&1

echo "[TASK 10] Setting SSH Key"
ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa >/dev/null 2>&1
sshpass -p 'qwe123' ssh-copy-id -o StrictHostKeyChecking=no \
  root@192.168.10.10 >/dev/null 2>&1
for (( i=1; i<=$1; i++ )); do
  sshpass -p 'qwe123' ssh-copy-id -o StrictHostKeyChecking=no \
    root@192.168.10.1$i >/dev/null 2>&1
done
for (( i=1; i<=$1; i++ )); do
  ssh -o StrictHostKeyChecking=no root@k8s-node$i hostname >/dev/null 2>&1
done

echo "[TASK 11] Install K9s"
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
wget -P /tmp https://github.com/derailed/k9s/releases/latest/download/k9s_linux_${CLI_ARCH}.tar.gz >/dev/null 2>&1
tar -xzf /tmp/k9s_linux_${CLI_ARCH}.tar.gz -C /tmp
chown root:root /tmp/k9s
mv /tmp/k9s /usr/local/bin/
chmod +x /usr/local/bin/k9s

echo "[TASK 12] ETC"
echo "sudo su -" >> /home/vagrant/.bashrc

echo ">>>> Initial Config End <<<<"

Task 5 is particularly important — net.ipv4.ip_forward = 1 enables the admin server to function as a NAT gateway later, forwarding packets from the isolated k8s nodes to the internet. Task 8 uses growpart and xfs_growfs to expand the disk from the default size to the 120 GB specified in the Vagrantfile. Task 10 distributes SSH keys to all nodes, which is essential for Ansible (and therefore Kubespray) to work without password prompts.

init_cfg.sh runs on each k8s node and handles Kubernetes-specific prerequisites:

echo ">>>> Initial Config Start <<<<"

echo "[TASK 1] Change Timezone and Enable NTP"
timedatectl set-local-rtc 0
timedatectl set-timezone Asia/Seoul

echo "[TASK 2] Disable firewalld and selinux"
systemctl disable --now firewalld >/dev/null 2>&1
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

echo "[TASK 3] Disable and turn off SWAP & Delete swap partitions"
swapoff -a
sed -i '/swap/d' /etc/fstab
sfdisk --delete /dev/sda 2 >/dev/null 2>&1
partprobe /dev/sda >/dev/null 2>&1

echo "[TASK 4] Config kernel & module"
cat << EOF > /etc/modules-load.d/k8s.conf
overlay
br_netfilter
vxlan
EOF
modprobe overlay >/dev/null 2>&1
modprobe br_netfilter >/dev/null 2>&1

cat << EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sysctl --system >/dev/null 2>&1

echo "[TASK 5] Setting Local DNS Using Hosts file"
sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts
echo "192.168.10.10 admin" >> /etc/hosts
for (( i=1; i<=$1; i++ )); do
  echo "192.168.10.1$i k8s-node$i" >> /etc/hosts
done

echo "[TASK 6] Delete default routing - enp0s9 NIC"
nmcli connection modify enp0s9 ipv4.never-default yes
nmcli connection up enp0s9 >/dev/null 2>&1

echo "[TASK 7] Setting SSHD"
echo "root:qwe123" | chpasswd
cat << EOF >> /etc/ssh/sshd_config
PermitRootLogin yes
PasswordAuthentication yes
EOF
systemctl restart sshd >/dev/null 2>&1

echo "[TASK 8] Install packages"
dnf install -y python3-pip git >/dev/null 2>&1

echo "[TASK 9] ETC"
echo "sudo su -" >> /home/vagrant/.bashrc

echo ">>>> Initial Config End <<<<"

Task 3 is a hard Kubernetes requirement — kubelet refuses to start if swap is enabled, so the script disables it, removes the fstab entry, and deletes the swap partition entirely. Task 4 loads three kernel modules that Kubernetes networking depends on: overlay for the container filesystem, br_netfilter so bridge traffic passes through iptables (required by kube-proxy and most CNI plugins), and vxlan for overlay network tunneling. The sysctl parameters in the same task ensure that bridged IPv4/IPv6 traffic gets processed by iptables and that IP forwarding is active.

Bringing It Up

mkdir k8s-offline && cd k8s-offline

curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/Vagrantfile
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/admin.sh
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/init_cfg.sh

vagrant up
vagrant status

Once all three VMs are running, SSH into each one:

ssh root@192.168.10.10   # admin  (password: qwe123)
ssh root@192.168.10.11   # k8s-node1
ssh root@192.168.10.12   # k8s-node2

At this point the VMs are up, the network is in place, and the k8s nodes still have internet access through their enp0s8 interfaces. The admin server has already distributed SSH keys to every node during provisioning, so Ansible can reach them without password prompts. The next step is to cut off that external access and start building the internal services that will replace it.

Core Infrastructure Services

With the VMs running, the next step is cutting off the k8s nodes from the internet and building the internal services they’ll depend on instead. Three pieces need to be in place before anything Kubernetes-related can happen: a network gateway with NAT, time synchronization via NTP, and DNS resolution.

Network Gateway & NAT

The goal here is simple: make it so the k8s nodes have zero direct internet access, and their only path to the outside world (when needed) runs through the admin server.

This involves two things — disabling the external interface on each node and pointing their default route at the admin server.

On each k8s node, bring down enp0s8 (the NAT-attached interface that Vagrant uses for internet access) and set enp0s9 (the private network interface) as the sole network path:

# Disable external interface immediately
nmcli connection down enp0s8
nmcli connection modify enp0s8 connection.autoconnect no

# Add default route through admin server (priority 200)
nmcli connection modify enp0s9 +ipv4.routes "0.0.0.0/0 192.168.10.10 200"
nmcli connection up enp0s9

You can verify the routing table looks correct:

ip route

On k8s-node1:

default via 192.168.10.10 dev enp0s9 proto static metric 200
192.168.10.0/24 dev enp0s9 proto kernel scope link src 192.168.10.11 metric 100

On k8s-node2:

default via 192.168.10.10 dev enp0s9 proto static metric 200
192.168.10.0/24 dev enp0s9 proto kernel scope link src 192.168.10.12 metric 100

The autoconnect no setting persists across reboots. You can confirm it by checking the connection file:

cat /etc/NetworkManager/system-connections/enp0s8.nmconnection

[connection]
id=enp0s8
uuid=7f94e839-e070-4bfe-9330-07090381d89f
type=ethernet
autoconnect=false
...

The enp0s9 connection file now carries the new route:

[ipv4]
address1=192.168.10.11/24
method=manual
never-default=true
route1=0.0.0.0/0,192.168.10.10

At this point, the nodes can’t reach anything outside the private network. A quick test confirms it:

ping -c 1 -w 1 -W 1 8.8.8.8
# 1 packets transmitted, 0 received, 100% packet loss

curl www.google.com
# curl: (6) Could not resolve host: www.google.com

Even manually setting DNS servers doesn’t help — the packets simply have nowhere to go without a NAT gateway:

cat << EOF > /etc/resolv.conf
nameserver 168.126.63.1
nameserver 8.8.8.8
EOF

curl www.google.com
# still fails — no route to the internet

On the admin server, enable NAT so internal traffic can reach the internet through enp0s8. IP forwarding was already enabled by admin.sh during provisioning, but you can verify and set it explicitly:

sysctl -w net.ipv4.ip_forward=1

cat < /etc/sysctl.d/99-ipforward.conf
net.ipv4.ip_forward = 1
EOF
sysctl --system

Now add the MASQUERADE rule:

iptables -t nat -A POSTROUTING -o enp0s8 -j MASQUERADE

Verify it:

iptables -t nat -L -n -v

Chain POSTROUTING (policy ACCEPT 1 packets, 120 bytes)
 pkts bytes target     prot opt in     out     source               destination
    2   168 MASQUERADE  all  --  *      enp0s8  0.0.0.0/0            0.0.0.0/0

Go back to any k8s node and test — external connectivity should now work through the admin server’s NAT.

The important part comes next. Remove the NAT rule to simulate the actual air-gapped state:

iptables -t nat -D POSTROUTING -o enp0s8 -j MASQUERADE

Back on the k8s nodes, internet access is gone again. This toggle — NAT on during preparation, NAT off during installation — is how we prove the cluster deployment works entirely offline. For the rest of this guide, the NAT rule stays off unless explicitly noted otherwise.

NTP Server & Client (chrony)

Kubernetes components are sensitive to clock skew. Certificate validation, etcd leader election, log correlation — all of these break or behave unpredictably when nodes disagree about what time it is. In an air-gapped environment, the nodes can’t reach public NTP pools, so the admin server needs to act as the internal time source.

On the admin server, start by checking the current chrony status and configuration:

systemctl status chronyd.service --no-pager
grep "^[^#]" /etc/chrony.conf

The default Rocky Linux configuration looks like this:

pool 2.rocky.pool.ntp.org iburst
sourcedir /run/chrony-dhcp
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
ntsdumpdir /var/lib/chrony
logdir /var/log/chrony

A few of these settings are worth understanding. The iburst option sends a burst of packets right after startup to synchronize quickly instead of waiting for the normal polling interval.

The driftfile records how much the local hardware clock drifts from real time — chrony uses this to compensate even when the network is unavailable. The makestep 1.0 3 directive tells chrony to force an immediate time jump (rather than gradual slewing) if the offset exceeds 1 second, but only during the first 3 updates after startup.

And rtcsync periodically copies the system clock to the hardware RTC, so the time stays reasonably accurate across reboots.

You can check which NTP sources chrony is currently using:

chronyc sources -v
dig +short 2.rocky.pool.ntp.org

Now replace the configuration with one suited for our air-gapped setup:

cp /etc/chrony.conf /etc/chrony.bak

cat << EOF > /etc/chrony.conf
# External NTP servers (used during preparation phase)
server pool.ntp.org iburst
server kr.pool.ntp.org iburst

# Allow internal network clients to sync from this server
allow 192.168.10.0/24

# If external servers become unreachable, serve local time
local stratum 10

logdir /var/log/chrony
EOF

systemctl restart chronyd.service

The local stratum 10 line is critical for air-gapped operation. Without it, chrony would stop serving time to clients once it loses contact with upstream servers. With this setting, the admin server falls back to its own clock at stratum 10 — not perfectly accurate, but good enough to keep the cluster running.

Verify the configuration:

timedatectl status

Local time: Wed 2026-02-11 22:36:16 KST
           Universal time: Wed 2026-02-11 13:36:16 UTC
                 RTC time: Wed 2026-02-11 14:09:31
                Time zone: Asia/Seoul (KST, +0900)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

chronyc sources -v

On each k8s node, point chrony at the admin server:

cp /etc/chrony.conf /etc/chrony.bak

cat << EOF > /etc/chrony.conf
server 192.168.10.10 iburst
logdir /var/log/chrony
EOF

systemctl restart chronyd.service

Check that synchronization is working:

timedatectl status

Local time: Wed 2026-02-11 22:38:19 KST
           Universal time: Wed 2026-02-11 13:38:19 UTC
                 RTC time: Wed 2026-02-11 14:11:34
                Time zone: Asia/Seoul (KST, +0900)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

chronyc sources -v

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* admin                         0   7     0     -     +0ns[   +0ns] +/-    0ns

The ^* prefix indicates this source is currently selected and synchronized. Back on the admin server, you can confirm both nodes are connecting:

chronyc clients

Hostname                      NTP   Drop Int IntL Last     Cmd   Drop Int  Last
===============================================================================
k8s-node1                       3      0   1   -     1       0      0   -     -
k8s-node2                       2      0   1   -     0       0      0   -     -

Both nodes are syncing their clocks through the admin server. Even after the NAT rule is removed and the admin server itself can’t reach public NTP pools, the local stratum 10 fallback keeps time flowing to the cluster.

DNS Server & Client (BIND)

The k8s nodes need to resolve hostnames — both internal names like admin and k8s-node1, and external names like registry.k8s.io (which appear in container image references even if the actual pull happens from the private registry). BIND on the admin server handles both: it resolves internal queries directly and forwards everything else to upstream DNS servers (when the NAT is active) or returns cached results (when offline).

On the admin server, install BIND:

dnf install -y bind bind-utils

Write the full /etc/named.conf:

cp /etc/named.conf /etc/named.bak

cat < /etc/named.conf
options {
        listen-on port 53 { any; };
        listen-on-v6 port 53 { ::1; };
        directory       "/var/named";
        dump-file       "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
        secroots-file   "/var/named/data/named.secroots";
        recursing-file  "/var/named/data/named.recursing";
        allow-query     { 127.0.0.1; 192.168.10.0/24; };
        allow-recursion { 127.0.0.1; 192.168.10.0/24; };

        forwarders {
                168.126.63.1;
                8.8.8.8;
        };

        recursion yes;

        dnssec-validation auto;

        managed-keys-directory "/var/named/dynamic";
        geoip-directory "/usr/share/GeoIP";

        pid-file "/run/named/named.pid";
        session-keyfile "/run/named/session.key";

        include "/etc/crypto-policies/back-ends/bind.config";
};

logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};

zone "." IN {
        type hint;
        file "named.ca";
};

include "/etc/named.rfc1912.zones";
include "/etc/named.root.key";
EOF

The key options here: listen-on port 53 { any; } makes BIND listen on all interfaces, so the k8s nodes can reach it. allow-query and allow-recursion restrict access to localhost and the private network. The forwarders block points to KT's public DNS (168.126.63.1) and Google's (8.8.8.8) — these handle any external domain lookups when the NAT gateway is active.

The dnssec-validation auto setting enables DNSSEC validation; if you run into resolution failures related to DNSSEC, you might need to set this to no as a workaround.

Validate and start the service:

named-checkconf /etc/named.conf   # no output means no errors
systemctl enable --now named

Set the admin server itself to use its own DNS:

echo "nameserver 192.168.10.10" > /etc/resolv.conf

Test it:

dig +short google.com @192.168.10.10
# 142.250.183.110

dig +short google.com
# 142.250.183.110

On each k8s node, there’s an extra step needed. NetworkManager likes to overwrite /etc/resolv.conf on every connection change or reboot. To prevent that:

cat << EOF > /etc/NetworkManager/conf.d/99-dns-none.conf
[main]
dns=none
EOF

systemctl restart NetworkManager

Now set the DNS to point at the admin server:

echo "nameserver 192.168.10.10" > /etc/resolv.conf

Test DNS resolution — and this is the interesting part. Even though the k8s nodes have no internet access (the NAT rule is off), DNS queries still work because BIND on the admin server forwards them through its own enp0s8 interface:

dig +short google.com @192.168.10.10
# 142.250.183.110

dig +short google.com
# 142.250.183.110

The query reaches the admin server’s BIND instance, which forwards it to the upstream DNS servers through its internet-connected interface, and returns the answer to the k8s node. The k8s node itself never touches the internet directly. Once the admin server’s NAT is also removed (for the fully air-gapped installation phase), BIND serves cached results for any domains it has previously resolved. New external domains won’t resolve, but that’s fine — by that point, everything the cluster needs is already available locally.

With NAT, NTP, and DNS in place, the admin server is ready to take on its next set of responsibilities: hosting the package repositories, container images, and Python packages that Kubespray will need during the offline installation.

Offline Repositories

Before touching kubespray-offline, it helps to understand the three categories of dependencies that Kubernetes needs — and how to serve each of them internally. The admin server has internet access during the preparation phase, so we use that window to mirror everything into local repositories. Once the mirror is ready and validated, the external route can be removed entirely.

kubespray-offline will automate most of this in a later step (Section 5), but knowing how each piece works makes troubleshooting far easier when something inevitably goes wrong.

Local YUM/DNF Mirror

Kubernetes nodes need OS-level packages — container runtimes, SELinux policy modules, socat, conntrack, ipvsadm, and many others. In an air-gapped network, dnf install has nowhere to go unless you provide a local repository.

On the admin server, dnf reposync pulls the entire upstream repository tree down to a local directory:

dnf reposync --repoid=baseos --download-metadata -p /data/repos/rocky/10
dnf reposync --repoid=appstream --download-metadata -p /data/repos/rocky/10
dnf reposync --repoid=extras --download-metadata -p /data/repos/rocky/10

This took about 12 minutes in the lab. The baseos repo came in at 6.2 GB, appstream at 14 GB, and extras at a modest 67 MB — roughly 20 GB total. Once synced, nginx serves the directory over HTTP at http://192.168.10.10/rpms/rocky/10/.

On each k8s node, the default Rocky Linux repo files get replaced with a single internal-rocky.repo pointing all three channels to the admin server's URL. After that, dnf install and dnf update work exactly as they would on a connected system — the nodes just don't know (or care) that the packages are coming from a machine one hop away.

One thing to watch out for: reposync copies the RPMs, but you also need the repository metadata. The --download-metadata flag handles this, but if you ever re-sync and the metadata gets stale, run createrepo --update on the directory to regenerate it. Stale metadata leads to dependency resolution failures that look completely unrelated to the actual problem.

Private Container Registry

Every container image that the cluster needs — kube-apiserver, kube-proxy, coredns, flannel, pause, etcd, metrics-server, and more — must be available from an internal registry. In this lab, we run Docker Registry v3.0.0 on port 35000:

nerdctl run -d --name registry --network host \
  -e REGISTRY_HTTP_ADDR=0.0.0.0:35000 \
  -v /var/lib/registry:/var/lib/registry \
  registry:3.0.0

The --network host flag avoids NAT overhead and keeps things simple for a single-node registry. The data directory at /var/lib/registry is mounted as a volume, so registry contents survive container restarts.

The workflow for populating it is straightforward: pull an image on a machine with internet access, tag it with the 192.168.10.10:35000 prefix, then push. For the 55 container images that Kubespray requires, this process gets automated by load-push-all-images.sh in the kubespray-offline toolchain — but the principle is the same whether you push one image or five hundred.

Since this is an HTTP registry (not HTTPS), every node’s containerd configuration must include it as an insecure registry. Miss this step and you’ll get TLS handshake errors that can be confusing if you’re not expecting them.

Private PyPI Mirror

This one catches people off guard. Kubespray is an Ansible project, and Ansible has Python dependencies — jinja2, PyYAML, netaddr, cryptography, and a handful of others. If the nodes can't reach pypi.org, pip install fails silently or throws cryptic SSL errors.

Two approaches work here. The first is devpi-server, which runs a full PyPI-compatible index at http://192.168.10.10:3141. It caches packages as they're requested, which is convenient but requires the server itself to have had internet access at least once to prime the cache. The second approach uses pypi-mirror to generate a static file-based index served directly by nginx at http://192.168.10.10/pypi/. This is simpler to reason about — it's just files on disk — and pairs naturally with the nginx instance that's already serving RPMs.

Either way, global pip configuration goes into /etc/pip.conf:

[global]
index-url = http://192.168.10.10/pypi/
trusted-host = 192.168.10.10

With this in place, every pip install on the network resolves against the internal mirror. The trusted-host line is necessary because we're serving over plain HTTP — without it, pip refuses the connection by default.

Worth noting: kubespray-offline’s pypi-mirror.sh script handles the package download during the preparation phase, and setup-offline.sh writes the pip configuration automatically during deployment. Understanding the manual setup here mostly pays off when you need to add Python packages later, after the initial deployment is done and you're operating the cluster day-to-day.

Kubespray-Offline: Download & Prepare

With the core infrastructure in place — NAT gateway, DNS, NTP, and local repos all running on the admin server — it’s time to tackle the big question: how do you get everything Kubernetes needs onto a machine that can’t reach the internet?

https://devlos.tistory.com/120

The answer is kubespray-offline. This project wraps around Kubespray (the popular Ansible-based Kubernetes installer) and adds a complete offline preparation layer. You run a single master script on an internet-connected machine, it pulls down every binary, container image, RPM package, and Python dependency that Kubespray will need, and packages them into a portable outputs/ directory. Copy that directory to your air-gapped admin server, run a few setup scripts, and you've got a fully self-contained deployment environment.

In my case, the entire download took about 17 minutes and produced roughly 3.3 GB of artifacts.

git clone https://github.com/kubespray-offline/kubespray-offline.git
cd kubespray-offline

How download-all.sh Works

The entry point is download-all.sh, and it's essentially a runner that calls 10 sub-scripts in sequence. Each one handles a specific category of dependencies. Here's what happens when you execute it:

cd /root/kubespray-offline
./download-all.sh

The first script to run is config.sh. This is where every version number lives — Kubespray itself, containerd, runc, the CNI plugins, nginx, the registry image, and so on. In my setup it looked like this:

# config.sh (key variables)
KUBESPRAY_VERSION=2.30.0
CONTAINERD_VERSION=2.2.1
RUNC_VERSION=1.3.4
CNI_VERSION=1.8.0
NGINX_VERSION=1.29.4
REGISTRY_VERSION=3.0.0

Getting these version numbers right matters more than you might expect. If config.sh says containerd 2.2.1 but you later tell Kubespray to use a different version in your inventory, the playbook will try to download a file that doesn't exist on your local nginx server — and in an air-gapped network, there's nowhere to fall back to.

After config.sh, the scripts run in this order:

precheck.sh validates that podman or docker is installed and checks SELinux status. Nothing fancy, but it catches the obvious "you forgot to install podman" mistake early.

prepare-pkgs.sh installs the system-level tools needed for the rest of the process: rsync, gcc, libffi-devel, createrepo, git, podman, and python3. These all come from the standard Rocky Linux repos, so this script needs internet access.

prepare-py.sh creates a Python virtual environment at ~/.venv/3.12 and installs the Ansible dependencies into it. Kubespray is fundamentally a collection of Ansible playbooks, so having the right Python environment is a prerequisite for everything that follows.

get-kubespray.sh downloads the kubespray-2.30.0.tar.gz release from GitHub and extracts it into the cache directory.

pypi-mirror.sh is where things get interesting for the offline story. It downloads every Python package that Kubespray's Ansible roles will need — pip, setuptools, ansible-core, PyYAML, jinja2, and dozens of others — into a local directory that will later be served as a static PyPI mirror.

download-kubespray-files.sh does the heaviest lifting. It runs Kubespray's own contrib/offline/generate_list.sh to produce two critical files: files.list and images.list. The files list contained 27 URLs pointing to binaries like kubectl, kubelet, kubeadm, etcd, containerd, crictl, and the CNI plugins. The images list had 55 container image references — everything from registry.k8s.io/kube-apiserver to docker.io/flannel/flannel to the CoreDNS and metrics-server images. The script then downloads every one of these.

# Generated lists (under outputs/)
outputs/files/files.list    # 27 binary URLs
outputs/images/images.list  # 55 container images

download-additional-containers.sh pulls two extra images that aren't part of Kubernetes itself but are needed for the offline infrastructure: nginx:1.29.4 (to serve files over HTTP) and registry:3.0.0 (the private container registry).

create-repo.sh builds a local RPM repository using createrepo and adds modulemd metadata for RHEL 8/9 compatibility. This repo gets bundled into the outputs directory so that target nodes can install OS packages without touching the internet.

Finally, copy-target-scripts.sh copies the deployment scripts into outputs/, making the whole directory self-contained and ready to transfer.

When download-all.sh finishes, the outputs/ directory has everything:

outputs/
├── files/          # Kubernetes binaries (kubectl, kubelet, etcd, containerd, etc.)
├── images/         # Container images as .tar.gz archives
├── rpms/           # RPM packages with repo metadata
├── pypi/           # Python packages for offline pip
├── kubespray-2.30.0/   # Kubespray source with playbooks
├── setup-container.sh
├── start-nginx.sh
├── start-registry.sh
├── load-push-all-images.sh
├── setup-offline.sh
└── setup-py.sh

At this point, you copy the entire outputs/ directory to your air-gapped admin server (via USB, SCP through a bastion, or whatever transfer method your security policy allows) and move on to deployment.

Deploying the Nginx File Server and Registry

On the air-gapped admin server, the deployment scripts in outputs/ are meant to be run in a specific order. Skipping a step or running them out of sequence will break things downstream.

The first script is setup-container.sh. It installs containerd, runc, nerdctl, and the CNI plugins from the local files — no package manager involved, just direct binary installation from the archives that were downloaded earlier. After this runs, you have a working container runtime on the admin server.

cd /root/kubespray-offline/outputs
./setup-container.sh

Next, start-nginx.sh launches an nginx container on port 80. This nginx instance serves the entire outputs/ directory over HTTP — the binaries in files/, the container image tarballs in images/, the Python packages in pypi/, and the RPM repository in rpms/. Any node on the 192.168.10.0/24 network can now fetch these artifacts from http://192.168.10.10/.

./start-nginx.sh
# nginx:1.29.4 now running on port 80
# Serves: files/, images/, pypi/, rpms/ over HTTP

Then start-registry.sh brings up the Docker Registry v3.0.0 container, listening on port 35000. At this point the registry is empty — it's just waiting for images to be pushed into it.

./start-registry.sh
# registry:3.0.0 now running on port 35000

One thing worth noting: the registry runs with --network host and binds directly to 0.0.0.0:35000, so there's no port mapping or container networking complexity to debug. If port 35000 is reachable, the registry is reachable.

Loading and Pushing Images

This is the step that takes the longest in the deployment phase, and it’s also the one most likely to reveal problems if something went wrong during the download phase.

load-push-all-images.sh iterates through every .tar.gz file in the images/ directory. For each image, it loads the archive into the local containerd image store, re-tags it with the 192.168.10.10:35000 prefix, and pushes it to the private registry.

./load-push-all-images.sh

Under the hood, each image goes through three operations:

# What happens for each image (conceptually):
nerdctl load -i images/kube-apiserver-v1.34.3.tar.gz
nerdctl tag registry.k8s.io/kube-apiserver:v1.34.3 192.168.10.10:35000/kube-apiserver:v1.34.3
nerdctl push 192.168.10.10:35000/kube-apiserver:v1.34.3 --insecure-registry

With 55 images to process, this took about 2–3 minutes in my lab. On slower hardware or with larger image sets, expect it to take longer.

Once the script finishes, you can verify everything landed correctly:

curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool

This should return a JSON list of all 55+ repositories in the registry. If any image is missing here, Kubespray will fail when it tries to pull that image during cluster installation — and the error message won’t always make it obvious that the root cause is a missing image in your local registry.

After the images are pushed, two more setup scripts finalize the environment:

setup-offline.sh configures the admin server to use its own local repositories. It writes /etc/yum.repos.d/offline.repo pointing to the local RPM mirror, and creates ~/.config/pip/pip.conf pointing pip at the local PyPI mirror. From this point forward, even dnf install and pip install on the admin server itself go through the local copies.

setup-py.sh installs Python 3.12 from the offline RPM repository and prepares the Ansible environment. Kubespray's playbooks need a specific set of Python packages, and this script makes sure they're all available without any network calls.

./setup-offline.sh    # Configure local repos for dnf and pip
./setup-py.sh         # Install Python 3.12 + Ansible from offline packages

At this point, the admin server is fully self-sufficient. It has a file server, a container registry populated with every required image, local package repositories for both RPMs and Python packages, and a working Ansible installation. The cluster deployment can proceed entirely offline.

The whole deployment phase — from running setup-container.sh to having a populated registry and configured repos — took about 3 minutes in my lab environment. Most of that time was spent in load-push-all-images.sh. The other scripts finished in seconds.

Installing & Verifying the Cluster

At this point, the admin server is doing a lot of heavy lifting. It’s serving OS packages over HTTP, hosting a container registry on port 35000, mirroring PyPI, and running an nginx file server with all the Kubernetes binaries staged and ready. The k8s nodes can reach all of these services over the internal 192.168.10.0/24 network, and nothing else.

Now comes the part where all that preparation pays off.

Preparing the Nodes for Offline Installation

Before Kubespray touches the k8s nodes, each one needs a few things in place: a working container runtime, access to the offline package repo, and a pip configuration pointing at the internal mirror. The kubespray-offline project ships a set of scripts that handle this sequentially.

On the admin server, from the outputs directory:

cd /root/kubespray-offline/outputs

# Install containerd, runc, nerdctl, and CNI plugins from local files
./setup-container.sh

# Start nginx on port 80 — serves binaries, images, PyPI, and RPMs
./start-nginx.sh

# Start the private registry on port 35000
./start-registry.sh

# Load every .tar.gz image, re-tag with the 192.168.10.10:35000 prefix, and push
./load-push-all-images.sh

# Point yum and pip at the internal servers
./setup-offline.sh

# Install Python 3.12 from the offline repo
./setup-py.sh

The load-push-all-images.sh step takes the longest. It walks through every image archive in the outputs/images/ directory, loads each one into the local container runtime, re-tags it with the 192.168.10.10:35000 prefix, and pushes it to the private registry. For 55 images, expect this to take a couple of minutes.

After setup-offline.sh runs, you can verify that the nodes are pointed at the right repos:

# Check yum is configured for the internal mirror
cat /etc/yum.repos.d/offline.repo

# Check pip is configured for the internal PyPI
cat ~/.config/pip/pip.conf

Kubespray Configuration

Kubespray lives inside the outputs directory as a tarball that was extracted during the download phase. The inventory and group variables need a few edits before anything can run.

cd /root/kubespray-offline/outputs/kubespray-2.30.0

The Inventory File

The inventory defines which nodes play which roles. For this two-node setup, k8s-node1 serves as the control plane and etcd host, while k8s-node2 is a pure worker.

# inventory/mycluster/inventory.ini

[all]
k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11
k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12

[kube_control_plane]
k8s-node1

[etcd]
k8s-node1

[kube_node]
k8s-node1
k8s-node2

[k8s_cluster:children]
kube_control_plane
kube_node

Nothing unusual here — a minimal two-node topology. In production you’d want three etcd members and separate control plane nodes, but for validating the offline workflow, this is enough.

The offline.yml File

This is the file that makes the entire air-gap setup work. It tells Kubespray where to find every binary and container image, replacing all the default upstream URLs with internal equivalents.

# inventory/mycluster/group_vars/all/offline.yml

http_server: "http://192.168.10.10"
registry_host: "192.168.10.10:35000"

# Binaries
kubeadm_download_url: "{{ http_server }}/kubernetes/kubeadm-{{ kube_version }}-linux-{{ image_arch }}"
kubectl_download_url: "{{ http_server }}/kubernetes/kubectl-{{ kube_version }}-linux-{{ image_arch }}"
kubelet_download_url: "{{ http_server }}/kubernetes/kubelet-{{ kube_version }}-linux-{{ image_arch }}"

# This one bit me — the default has "linux-amd64" hardcoded.
# If you're on ARM64 or want portability, use the template variable.
etcd_download_url: "{{ http_server }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"

containerd_download_url: "{{ http_server }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
runc_download_url: "{{ http_server }}/runc.{{ image_arch }}"
cni_download_url: "{{ http_server }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz"

# Container images — all pulled from the private registry
kube_image_repo: "{{ registry_host }}"
gcr_image_repo: "{{ registry_host }}"
docker_image_repo: "{{ registry_host }}"
quay_image_repo: "{{ registry_host }}"

# Registry mirror configuration for containerd
containerd_registries_mirrors:
  - prefix: docker.io
    mirrors:
      - host: "http://192.168.10.10:35000"
  - prefix: quay.io
    mirrors:
      - host: "http://192.168.10.10:35000"
  - prefix: registry.k8s.io
    mirrors:
      - host: "http://192.168.10.10:35000"

A word on the etcd_download_url line. The default value shipped by kubespray-offline had linux-amd64 hardcoded instead of linux-{{ image_arch }}. On an x86_64 machine this works fine and you'd never notice.

But the moment you try to run this on an ARM64 host, the download fails with a 404 because the file simply doesn't exist at that path. Swapping in {{ image_arch }} makes the template portable across architectures.

Cluster Variables

Beyond the offline-specific settings, a few cluster-level variables need attention. These go in inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml or can be passed as extra vars:

# Network plugin — Flannel is lightweight and works well for lab environments
kube_network_plugin: flannel
flannel_interface: enp0s9

# Proxy mode
kube_proxy_mode: iptables

# Disable nodelocaldns — one less thing to debug in a lab
enable_nodelocaldns: false

# Enable Helm and metrics-server
helm_enabled: true
metrics_server_enabled: true

The flannel_interface setting matters more than it might look. VirtualBox VMs typically have multiple network interfaces — enp0s3 for NAT, enp0s8 for the host-only network, enp0s9 for the internal network.

Flannel needs to bind to the right one, and if you don't specify it, it might pick the NAT interface and nothing will route correctly. I spent an annoying amount of time on this the first time around.

Running the Playbook

With the configuration in place, the actual deployment happens in two playbook runs.

The first playbook sets up the offline repository configuration on all target nodes — making sure yum and pip point to the admin server:

cd /root/kubespray-offline/outputs/kubespray-2.30.0

ansible-playbook -i inventory/mycluster/inventory.ini \
  offline-repo/playbook/offline-repo.yml

This is a short run. It pushes the repo files and pip configuration to each node, and verifies that packages can be resolved from the internal mirror.

The second playbook is the main event — the full Kubernetes cluster deployment:

ansible-playbook -i inventory/mycluster/inventory.ini \
  cluster.yml \
  -e kube_version="1.34.3"

Most of the time goes to the containerd installation, etcd bootstrap, and control plane initialization. Flannel comes up quickly once the kubelet is running.

If something fails mid-run — and in an air-gap environment, something probably will on your first attempt — Ansible’s idempotency means you can fix the issue and re-run the same command. It’ll skip the tasks that already completed successfully.

A few things that tend to go wrong on the first try:

The flannel timeout. If flannel can’t find its subnet environment file, it usually means the default route is missing on the internal interface. The fix is to add one before running the playbook:

# On each k8s node
nmcli connection modify enp0s9 +ipv4.routes "0.0.0.0/0 192.168.10.10 200"
nmcli connection up enp0s9

The metric 200 keeps this from conflicting with any existing default route on another interface.

Image pull failures. If you see failed to pull image: no route to host, the image is either missing from the private registry or containerd's mirror configuration hasn't been applied yet. Quick check:

# List everything in the registry
curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool

# Verify containerd mirror config on the node
cat /etc/containerd/certs.d/docker.io/hosts.toml

Post-Install Verification

Once the playbook finishes without errors, SSH into the control plane node and run through the basics.

kubectl get nodes -o wide

You should see both nodes in Ready state:

NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP      OS-IMAGE
k8s-node1   Ready    control-plane   5m    v1.34.3   192.168.10.11    Rocky Linux 10
k8s-node2   Ready              4m    v1.34.3   192.168.10.12    Rocky Linux 10

Next, check that all system pods are running:

kubectl get pods -A

The output should show healthy pods for coredns, flannel, kube-apiserver, kube-controller-manager, kube-scheduler, kube-proxy, etcd, and metrics-server. Every single one of these images should have been pulled from 192.168.10.10:35000 — that's the whole point.

You can confirm the image source on any deployment:

kubectl get deploy -n kube-system -o wide

Every image field should show the 192.168.10.10:35000 prefix. If any image references an external registry like registry.k8s.io or docker.io directly, something in the offline.yml configuration was missed.

A quick smoke test to make sure workloads actually schedule and run:

kubectl run test-nginx --image=192.168.10.10:35000/library/nginx:alpine --port=80
kubectl get pod test-nginx -w

Wait for it to reach Running status, then clean up:

kubectl delete pod test-nginx

If Helm was enabled, verify that too:

helm version
helm list -A

At this point, you have a working Kubernetes cluster that was deployed entirely from local resources. No internet traffic left the 192.168.10.0/24 network during the installation. The registry holds all 55 container images, the nginx file server has every binary, and the nodes are configured to resolve everything internally.

The planning and infrastructure setup that came before this section took the most time. The actual Kubespray run is the easy part; getting the offline supply chain right is where the real work lives.

Day-2 Operations

The cluster is up. Pods are running, kubectl get nodes shows everything Ready, and you've confirmed that all system images were pulled from your internal registry at 192.168.10.10:35000. Now what?

A Kubernetes cluster that can only run its own system components isn’t very useful. You need to deploy actual workloads — your applications, third-party tools, monitoring stacks — and all of those images still need to come from somewhere inside your network. Same goes for Helm charts if you’re using them. This section covers the three approaches I worked through for getting application images into the cluster, plus how to set up Helm in an OCI-native way that plays nicely with your existing registry infrastructure.

Deploying Applications via the Private Registry

The most straightforward approach: pull the image you need on a machine that has internet access (or use a previously downloaded tarball), tag it with your internal registry prefix, push it, and reference the full path in your Kubernetes manifests.

On the admin server, where podman is already installed:

# Pull from Docker Hub (admin server has internet via NAT)
podman pull nginx:alpine

# Tag it for the internal registry
podman tag nginx:alpine 192.168.10.10:35000/library/nginx:alpine

# Push to the private registry
podman push 192.168.10.10:35000/library/nginx:alpine --tls-verify=false

You can verify the image landed correctly:

curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool

{
    "repositories": [
        "library/nginx",
        "flannel/flannel",
        "flannel/flannel-cni-plugin",
        "coredns/coredns",
        ...
    ]
}

To check which tags exist for a given image:

curl -s http://192.168.10.10:35000/v2/library/nginx/tags/list

{
    "name": "library/nginx",
    "tags": ["alpine"]
}

Now, when you write a Deployment manifest, you reference the full internal path:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-nginx
  template:
    metadata:
      labels:
        app: my-nginx
    spec:
      containers:
      - name: nginx
        image: 192.168.10.10:35000/library/nginx:alpine
        ports:
        - containerPort: 80

This works, and it’s easy to reason about. Every manifest explicitly states where the image comes from. The downside is obvious, though — every image reference in every manifest, every Helm values file, every quick kubectl run command needs the full 192.168.10.10:35000/... prefix. If you ever migrate to a different registry or change the port, you're doing a find-and-replace across your entire deployment inventory.

Containerd Registry Mirror Configuration

A better approach for most teams: configure containerd on each node to automatically redirect image pulls to your internal registry. When a pod spec says image: nginx:alpine, containerd checks the internal registry first before (failing to) reach Docker Hub.

The configuration lives in containerd’s certs.d directory. For each upstream registry you want to mirror, you create a hosts.toml file:

/etc/containerd/certs.d/
├── docker.io/
│   └── hosts.toml
├── quay.io/
│   └── hosts.toml
├── registry.k8s.io/
│   └── hosts.toml
└── gcr.io/
    └── hosts.toml

Here’s the hosts.toml for Docker Hub:

# /etc/containerd/certs.d/docker.io/hosts.toml
server = "https://docker.io"

[host."http://192.168.10.10:35000"]
  capabilities = ["pull", "resolve"]
  skip_verify = true

For quay.io:

# /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"

[host."http://192.168.10.10:35000"]
  capabilities = ["pull", "resolve"]
  skip_verify = true

Same pattern for registry.k8s.io and gcr.io.

After placing these files, restart containerd on each node:

systemctl restart containerd

From this point on, a pod spec that says image: nginx:alpine will resolve through the mirror. Containerd sees the pull request for docker.io/library/nginx:alpine, checks hosts.toml, finds the mirror entry pointing to 192.168.10.10:35000, and pulls from there instead. Your manifests stay clean — no registry prefixes, no special paths.

One thing to be aware of: the image still needs to exist in the internal registry under the correct repository path. When you pushed library/nginx:alpine earlier, that matches the default Docker Hub library path. For non-library images (say, grafana/grafana:latest), you need to push with the matching path: 192.168.10.10:35000/grafana/grafana:latest.

You can set this up manually on each node, but that’s tedious. The better way is to let Kubespray handle it.

Automating Mirror Setup with Kubespray

In your inventory’s group_vars/all/offline.yml, add the containerd_registries_mirrors block:

# inventory/mycluster/group_vars/all/offline.yml (append)

containerd_registries_mirrors:
  - prefix: docker.io
    mirrors:
      - host: "http://192.168.10.10:35000"
        capabilities: ["pull", "resolve"]
        skip_verify: true
  - prefix: quay.io
    mirrors:
      - host: "http://192.168.10.10:35000"
        capabilities: ["pull", "resolve"]
        skip_verify: true
  - prefix: registry.k8s.io
    mirrors:
      - host: "http://192.168.10.10:35000"
        capabilities: ["pull", "resolve"]
        skip_verify: true
  - prefix: gcr.io
    mirrors:
      - host: "http://192.168.10.10:35000"
        capabilities: ["pull", "resolve"]
        skip_verify: true

Apply it across the cluster with a targeted Ansible run:

cd /root/kubespray-offline/outputs/kubespray-2.30.0

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  --tags containerd

This regenerates the hosts.toml files on every node and restarts containerd automatically. Much cleaner than SSH-ing into each machine.

To verify the mirror is working, try deploying something with a plain Docker Hub image reference:

kubectl run test-nginx --image=nginx:alpine --restart=Never

kubectl describe pod test-nginx | grep "Successfully pulled"

You should see the pull succeed even though the nodes have no internet access. The image is transparently served from 192.168.10.10:35000.

Helm Chart Management with OCI Registry

Helm has supported OCI registries as chart storage since Helm 3.8, and it’s been GA since 3.12. This is a big win for air-gapped environments because you don’t need a separate chart repository server like ChartMuseum — your existing container registry at 192.168.10.10:35000 can store both container images and Helm charts.

The workflow for getting a chart into the registry:

# Say you have a chart directory at /root/nginx-chart/
# Package it into a .tgz archive
helm package /root/nginx-chart

# Output: Successfully packaged chart and saved it to:
#         /root/nginx-chart-1.0.0.tgz

Push the packaged chart to the OCI registry:

helm push nginx-chart-1.0.0.tgz oci://192.168.10.10:35000/helm-charts

If your registry is running plain HTTP (no TLS), Helm will refuse the push by default. You need to tell it the registry is insecure. Create or edit ~/.config/helm/registries.json:

{
  "insecureRegistries": ["192.168.10.10:35000"]
}

Or you can set it per-command (Helm 3.13+):

helm push nginx-chart-1.0.0.tgz oci://192.168.10.10:35000/helm-charts \
  --insecure-skip-tls-verify

Verify the chart is stored:

curl -s http://192.168.10.10:35000/v2/helm-charts/nginx-chart/tags/list

{
    "name": "helm-charts/nginx-chart",
    "tags": ["1.0.0"]
}

Install the chart directly from the OCI registry:

helm install my-nginx oci://192.168.10.10:35000/helm-charts/nginx-chart \
  --version 1.0.0

You can also pull the chart locally first if you want to inspect it before installing:

helm pull oci://192.168.10.10:35000/helm-charts/nginx-chart --version 1.0.0

# This downloads nginx-chart-1.0.0.tgz to the current directory
tar -tzf nginx-chart-1.0.0.tgz | head -10

helm install my-nginx ./nginx-chart-1.0.0.tgz

For teams that use helm template to render manifests before applying them (a common pattern in GitOps workflows), you can template directly from OCI:

helm template my-nginx oci://192.168.10.10:35000/helm-charts/nginx-chart \
  --version 1.0.0 \
  --set replicaCount=3 \
  > rendered-nginx.yaml

kubectl apply -f rendered-nginx.yaml

Migrating Public Charts to the Internal Registry

In practice, you’ll want to pull popular charts from the internet (on a connected machine), then push them into your internal registry for air-gapped consumption. Here’s the pattern I used:

# On a machine with internet access:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Pull the chart as a .tgz file
helm pull bitnami/postgresql --version 16.4.1

# Transfer postgresql-16.4.1.tgz to the admin server (USB, scp, whatever)
# Then on the admin server:
helm push postgresql-16.4.1.tgz oci://192.168.10.10:35000/helm-charts

Now anyone in the cluster can install PostgreSQL:

helm install my-pg oci://192.168.10.10:35000/helm-charts/postgresql \
  --version 16.4.1 \
  --set auth.postgresPassword=secretpassword

One gotcha: the chart itself might reference container images from Docker Hub or other public registries. If you’ve configured the containerd mirrors from section 7.2, those pulls will redirect to your internal registry automatically — but only if you’ve already pushed those images. For a PostgreSQL Helm chart, that means you need bitnami/postgresql, bitnami/postgres-exporter, and whatever else the chart's values.yaml references. Check the default values before deploying:

helm show values oci://192.168.10.10:35000/helm-charts/postgresql \
  --version 16.4.1 | grep -i "image:" -A 3

This tells you exactly which images to stage in your registry.

Why Not ChartMuseum?

You might be wondering about ChartMuseum, the traditional standalone Helm repository server. It still works fine, and the setup is quick:

podman run -d --name chartmuseum -p 8080:8080 \
  -v /data/chartmuseum/charts:/charts \
  -e STORAGE=local -e STORAGE_LOCAL_ROOTDIR=/charts \
  ghcr.io/helm/chartmuseum:v0.16.4

helm repo add internal http://192.168.10.10:8080

But the OCI approach has a clear advantage here: one fewer service to maintain. Your container registry is already running, already backed up (you are backing up /var/lib/registry, right?), and already monitored. Adding Helm charts to it is a zero-infrastructure-cost operation. ChartMuseum is one more container to keep alive, one more port to open, one more thing to troubleshoot at 2 AM when a deployment pipeline fails.

For new setups in 2025, I’d go OCI-native unless you have a specific reason not to.

Troubleshooting & Conclusion

Flannel Subnet Timeout

This one cost me a solid hour. The cluster.yml playbook ran smoothly for about fifteen minutes, then hung at a task called "Wait for flannel subnet.env." After the timeout, the play failed and left the cluster in a half-configured state.

The error looked like this in the Ansible output:

TASK [network_plugin/flannel : Wait for flannel subnet.env file presence] ******
fatal: [k8s-node1]: FAILED! => {"changed": false, "msg": "Timeout when waiting for file /run/flannel/subnet.env"}
fatal: [k8s-node2]: FAILED! => {"changed": false, "msg": "Timeout when waiting for file /run/flannel/subnet.env"}

Flannel’s logs on the node told the real story:

journalctl -u containerd | grep flannel

E0214 03:22:18.123456  Failed to find any valid interface: no compatible interfaces found

The root cause: Flannel needs a default route to determine which network interface to bind to. In this lab, the k8s nodes have three interfaces — enp0s3 (VirtualBox NAT), enp0s8 (host-only), and enp0s9 (the internal 192.168.10.0/24 network where the cluster actually lives). The default route pointed at enp0s3, which is the VirtualBox NAT adapter and has nothing to do with cluster traffic. Flannel picked up that interface, couldn't reach other nodes through it, and gave up.

Two things fixed this. First, I told Flannel explicitly which interface to use by setting flannel_interface in the inventory:

# inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml
flannel_interface: enp0s9

Second, I added a default route through the internal network on each node so that Flannel’s subnet acquisition would work:

nmcli connection modify enp0s9 +ipv4.routes "0.0.0.0/0 192.168.10.10 200"
nmcli connection up enp0s9

The metric value 200 matters here — it keeps this route as a lower-priority fallback so it doesn't interfere with the existing default route on enp0s3. After making both changes, I re-ran the playbook and Flannel came up within seconds.

One more thing: if you’re re-running cluster.yml after a failed attempt, it's worth doing a reset.yml first to clean up the partial state. Otherwise you can end up with stale CNI configurations that cause even more confusing errors.

ansible-playbook -i inventory/mycluster/inventory.ini reset.yml
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml

etcd Download Fails with 404

This one is subtle and only shows up if you’re not running on x86_64. The symptom:

TASK [download : Download_file | Download item] *******************************
fatal: [k8s-node1]: FAILED! => {
    "msg": "HTTP Error 404: Not Found",
    "url": "http://192.168.10.10/kubernetes/etcd/etcd-v3.5.26-linux-amd64.tar.gz"
}

The file doesn’t exist on the nginx file server because the actual binary staged during the download phase was etcd-v3.5.26-linux-arm64.tar.gz (or whatever your architecture is). The problem is in offline.yml, where the etcd download URL is hardcoded with amd64:

# The problematic line in offline.yml:
etcd_download_url: "{{ files_repo }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-amd64.tar.gz"

The fix is to replace amd64 with Kubespray's image_arch variable, which resolves to the correct architecture at runtime:

# Corrected:
etcd_download_url: "{{ files_repo }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"

This pattern — hardcoded amd64 where {{ image_arch }} should be — might appear in other URLs too. It's worth doing a quick grep across your offline.yml before running the playbook:

grep -n "amd64" inventory/mycluster/group_vars/all/offline.yml

If any lines show up that aren’t comments, replace them with {{ image_arch }}. On a pure x86_64 lab like the one in this post, you won't hit this issue because amd64 happens to be correct. But if you're planning to reuse the same offline.yml in a mixed-architecture or ARM-based environment, fix it now.

Image Pull Failures After Cluster Setup

After the cluster was running, I tried deploying a test workload and hit this:

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Warning  Failed     12s   kubelet            Failed to pull image "nginx:alpine":
           failed to resolve reference "docker.io/library/nginx:alpine":
           failed to do request: dial tcp 104.18.xx.xx:443: connect: no route to host

The kubelet on the node tried to pull nginx:alpine from Docker Hub and failed, because of course there's no internet. This means one of two things: either the image isn't in the private registry, or the containerd mirror isn't configured on that node.

Debugging this is a two-step process. First, check whether the image exists in the registry:

curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool

If library/nginx doesn't show up in the repository list, the image was never pushed. Go back to section 7.1, push it, and try again.

If the image is there, check the mirror configuration on the failing node:

cat /etc/containerd/certs.d/docker.io/hosts.toml

The file should exist and point to http://192.168.10.10:35000. If it's missing or has the wrong content, either re-apply the Kubespray containerd tags (section 7.2.1) or create the file manually.

After fixing the configuration, restart containerd and delete the failed pod so the kubelet retries the pull:

systemctl restart containerd
kubectl delete pod

The new pod should come up with the image pulled from the internal registry. You can confirm by checking the events:

kubectl describe pod  | grep "Successfully pulled"

Normal  Pulled  3s  kubelet  Successfully pulled image "nginx:alpine" in 1.2s

SELinux and Firewall Considerations

Rocky Linux 10 ships with SELinux in enforcing mode and firewalld active by default. During this lab setup, both were disabled early on to reduce variables — setenforce 0 and systemctl stop firewalld. That's fine for a lab. For anything closer to production, here's what you need to keep in mind.

Containerd and kubelet both work with SELinux in enforcing mode, but you need the right policies installed. Kubespray handles most of this automatically when selinux_state is set in the inventory. The container-selinux package (pulled from the local RPM mirror if you've synced it) provides the base policies. Where things get tricky is with custom volume mounts — if your pods mount host paths that SELinux doesn't have a context for, you'll see Permission denied errors that look like filesystem issues but are actually policy violations. The audit2allow tool is your friend here, or you can label the directories with chcon -Rt svirt_sandbox_file_t /path/to/data before mounting.

For firewalld, the key ports that need to be open on the control plane node are 6443 (API server), 2379–2380 (etcd), and 10250 (kubelet). Worker nodes need 10250 and whatever NodePort range you’ve configured (default 30000–32767). Flannel’s VXLAN traffic uses UDP port 8472. If you’re running the private registry on the admin server, port 35000 needs to be reachable from all nodes.

Rather than listing every firewall-cmd invocation, here's the practical advice: if you're in a lab, turn firewalld off and focus on getting the cluster working. Once everything is stable, turn it back on and add rules one service at a time, testing after each change. The Kubespray documentation has a full port matrix you can reference.

Conclusion

Setting up Kubernetes in an air-gapped environment is, frankly, a lot of work. What takes twenty minutes with kubeadm init on an internet-connected machine turns into a multi-day project when every binary, every image, and every package has to be pre-staged.

That said, kubespray-offline takes the worst part out of the equation. The download-all.sh script and its ten sub-scripts handle the tedious work of figuring out which files, images, and packages are needed, downloading them, and organizing the output directory structure. Without it, you'd be manually reading Kubespray's source code to build those lists yourself.

The pieces that still require hands-on attention come down to a few specific areas. Network interface selection is one — Flannel needs to know which interface to bind to, and in multi-NIC VMs that’s never automatic. Architecture strings in URL templates is another — the amd64 vs {{ image_arch }} issue is easy to miss until the playbook fails halfway through. And the post-install work of setting up containerd mirrors and populating the registry with application images is something you'll keep doing for the lifetime of the cluster.

If I were doing this again from scratch, I’d change a few things about the order of operations. I’d set up the containerd registry mirrors as part of the initial offline.yml configuration rather than adding them after the fact. I'd also script the "push all application images" step more aggressively — have a text file listing every image your workloads need, and a loop that pulls, tags, and pushes each one in batch. That way, when a new team member needs to deploy something, the image is already waiting in the registry.

Kubespray Skills Reference

GitHub - sigridjineth/kubespray-skills: Kubespray skills for Kubernetes cluster management

While writing this post, I distilled the entire workflow into a set of reusable Claude Code skills for the kubespray-skills project. The idea is simple: instead of re-reading this 5,000-line article every time you need to recall a specific command or configuration block, you can point Claude Code at the relevant skill file and get accurate, context-aware guidance on the spot.

Three skills came directly out of this air-gap work:

kubespray-airgap covers the end-to-end offline deployment pipeline — the download-all.sh workflow, offline.yml configuration, containerd registry mirrors, the image load-and-push sequence, and the most common failure modes. This one existed before in a minimal form, but the rewrite now includes the full kubespray-offline tooling that we walked through in sections 5 and 6.

kubespray-offline-infra is dedicated to the supporting infrastructure that an air-gapped cluster depends on: BIND for DNS, chrony for NTP, iptables NAT for the network gateway, reposync + nginx for the YUM/DNF mirror, and devpi for the PyPI mirror. Everything from section 3 and 4 of this post lives here. If your DNS forwarding breaks or your nodes can't sync time, this is the skill to consult.

kubespray-helm-airgap handles Helm chart management in offline environments — packaging charts into .tgz archives, pushing them to an OCI registry, setting up ChartMuseum as an alternative, and the often-overlooked step of staging the container images that the charts reference. Section 7.3 of this post maps directly to this skill.

These three join seven other skills in the project — kubespray-lab-setup, kubespray-deployment, kubespray-operations, kubespray-ha-configuration, kubespray-certificates, kubespray-monitoring, and kubespray-troubleshooting — bringing the total to ten. Together they cover the complete Kubernetes cluster lifecycle with Kubespray, from spinning up a Vagrant lab to upgrading a production HA cluster.

The skills are available in the above linked github repository. Each one follows the same structure: YAML frontmatter with a trigger-based description, then the actual reference content with working code examples, configuration blocks, and troubleshooting tables.

Final Thoughts

The air-gap constraint forces you to think carefully about supply chain and dependency management in a way that internet-connected clusters let you ignore. Every image has a provenance. Every binary has a version. Nothing appears magically from the internet at deploy time. That discipline is worth carrying forward even if your next cluster does have internet access.

Kubespray HA & Upgrade: A Hands-On Guide from v1.32 to v1.34

Sigrid Jin — Sat, 07 Feb 2026 23:39:38 GMT

Production Kubernetes with Kubespray: HA Configuration and Rolling Upgrades in Practice

Why HA and Upgrade Strategy Matter

Everyone loves the “Day 1” excitement of spinning up a new Kubernetes cluster. The terminal logs fly by, the nodes report Ready, and you feel like you have successfully built a modern infrastructure. But the real work begins on “Day 2.” That is when the network flakes out, a disk fills up, or a critical security vulnerability forces an immediate patch.

In a production environment, stability is the only metric that truly counts. If you run a single Control Plane node, you are living dangerously. I recall an incident early in my career where a simple OS security update required a server reboot. Because we lacked High Availability (HA), that reboot meant the API server vanished. Existing pods kept running, but we lost the ability to deploy fixes, scale up during a traffic spike, or even query the cluster status. We were effectively flying blind until the node came back online.

To sleep soundly at night, you generally need a minimum of three Control Plane nodes. This number exists for a specific reason: maintaining the etcd quorum.

Etcd is the brain of your cluster, storing the state of everything. If you lose quorum, the cluster goes into read-only mode or stops functioning entirely. With three nodes, you can lose one and still maintain a majority (two), keeping the cluster operational.

Here is a simplified view of what a robust Kubespray inventory looks like compared to a fragile one. We define multiple control plane nodes to distribute the risk.

# A fragile, single-point-of-failure setup
[kube_control_plane]
node1

# A production-ready HA setup
[kube_control_plane]
node1
node2
node3

[etcd]
node1
node2
node3

Beyond just surviving hardware failures, HA is the prerequisite for a sane upgrade strategy. Kubernetes evolves rapidly. New versions arrive every few months, bringing performance improvements and security patches. If you treat your cluster as a static monument that should never be touched, it quickly becomes a liability.

I have seen teams paralyzed by fear, sticking to End-of-Life (EOL) versions because they worry an upgrade will break everything.

A proper HA setup changes that dynamic. It allows you to perform Rolling Upgrades. You can drain one node, upgrade it, reboot it, and bring it back online — all while the other nodes handle the traffic. The users never notice a thing. This turns upgrades from a terrifying event into a routine maintenance task.

In this guide, we use Kubespray because it handles this complexity for us. It automates the distribution of components and manages the rolling upgrade logic, ensuring that we adhere to these best practices without having to manually wire every connection.

What This Guide Covers

Running Kubernetes in production is rarely about the initial kubeadm init. It’s about what happens three months later — when certificates expire, a node crashes, or a security patch forces a version upgrade.

This guide moves beyond the “Hello World” of cluster creation. We are building a High Availability (HA) cluster using Kubespray, and more importantly, we are going to break it, fix it, and upgrade it. The goal is to simulate a real-world environment where uptime matters, even when you need to swap out a control plane node.

We will start by provisioning a local lab environment using Vagrant and VirtualBox. This isn’t a single-node Minikube setup; we are spinning up a multi-node architecture running Rocky Linux 10. The infrastructure looks like this:

1 Admin/LoadBalancer Node: Runs HAProxy and executes Ansible.
3 Control Plane Nodes: Ensures the API server and etcd have redundancy.
2 Worker Nodes: Runs the actual workloads.

Here is the target inventory structure we will build in inventory.ini.

[kube_control_plane]
k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11 etcd_member_name=etcd1
k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12 etcd_member_name=etcd2
k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13 etcd_member_name=etcd3

[etcd:children]
kube_control_plane

[kube_node]
k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14
k8s-node5 ansible_host=192.168.10.15 ip=192.168.10.15

Once the cluster is up, we shift focus to Day-2 Operations. This is where the real learning happens. We will configure an external HAProxy to manage API traffic and test if the cluster survives when a control plane node goes dark.

You will also see how to handle lifecycle events using Kubespray’s playbooks. We aren’t just running the installation script once; we will use specific playbooks to scale the cluster and remove faulty nodes safely.

# Example: Scaling the cluster by adding a new worker
ansible-playbook -i inventory/mycluster/inventory.ini -v scale.yml --limit=k8s-node5

Finally, we tackle the task that makes most operators nervous: Upgrades. We will walk through a rolling upgrade from Kubernetes v1.32.9 to v1.32.10, and eventually to v1.34.3, ensuring that workloads stay running while the underlying infrastructure shifts beneath them. By the end, you should feel comfortable managing the full lifecycle of a production-grade Kubernetes cluster.

Here is the drafted content for Section 2. Kubernetes HA Fundamentals, written in English. I’ve focused on a natural, practical tone that avoids robotic “AI-isms” while keeping technical precision.

Control Plane Components and Their Roles

Setting up a single-node cluster on a laptop is satisfying. You run a script, see “Ready,” and everything just works. But moving that setup to production is a different story. The moment you start managing real traffic, the fragility of a single control plane becomes a liability you can’t afford.

This section covers why High Availability (HA) isn’t just a “nice-to-have” feature but a baseline requirement for any serious Kubernetes environment.

Before we talk about redundancy, we need to agree on what we are actually replicating. The Control Plane isn’t a monolith; it’s a collection of specific processes that coordinate the entire cluster.

When you run a command like kubectl apply -f deployment.yaml, you aren’t talking to the cluster as a whole — you are talking to these specific components:

kube-apiserver: The front door. It handles all REST requests, validates them, and updates the state in etcd. It is the only component that talks directly to the database.
etcd: The brain. This is a consistent, distributed key-value store where all cluster data lives. If you lose etcd, you lose the cluster.
kube-scheduler: The decision maker. It watches for new Pods with no assigned node and selects the best one based on resources and constraints.
kube-controller-manager: The reconciler. It runs controller loops (like the Node Controller or ReplicaSet Controller) to ensure the current state matches the desired state.

You can see these components running as static pods on your control plane node. If you are logged into a control plane node, a quick check looks like this.

# Checking the static pod manifests
$ ls /etc/kubernetes/manifests/
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

# Verifying they are running
$ crictl ps --name kube
CONTAINER ID   IMAGE                    NAME                      STATE
a1b2c3d4e5f6   .../kube-apiserver       kube-apiserver            Running
b2c3d4e5f6g7   .../etcd                 etcd                      Running
c3d4e5f6g7h8   .../kube-scheduler       kube-scheduler            Running
d4e5f6g7h8i9   .../kube-controller-mgr  kube-controller-manager   Running

Why Single Control Plane Is a Risk

In a single control plane setup, all the components listed above run on one machine. This creates a classic Single Point of Failure (SPOF).

If that one machine goes down, the impact is immediate and severe:

The API goes silent: You cannot run kubectl commands. CI/CD pipelines fail. Monitoring tools that query the API stop gathering data.
Scheduling stops: If a worker node crashes while the control plane is down, the Pods on that worker are gone. No scheduler means no new Pods are created to replace them.
State is locked: You cannot scale up, change configurations, or roll back deployments.

The applications already running on healthy worker nodes might keep running for a while, but the cluster effectively becomes a “zombie” — functioning but brain-dead. You are one disk failure or one bad OS patch away from a total outage.

Recommended HA Architecture

To fix this, we don’t just add a backup; we create a cluster of control planes. The standard production architecture involves three control plane nodes.

Why three? It comes down to etcd.

Etcd uses the Raft consensus algorithm, which requires a majority (quorum) to write data.

1 node: Quorum is 1. If it fails, you stop.
2 nodes: Quorum is 2 (Majority of 2 is 2). If one fails, you only have 1 left. You lose quorum. This is actually worse than a single node because you have twice the hardware risk for zero gain in availability.
3 nodes: Quorum is 2. You can lose 1 node and still have 2 left to form a majority. The cluster keeps writing data.

Here is what the topology typically looks like in a inventory.ini file for Kubespray.

[kube_control_plane]
control-node-01 ansible_host=192.168.10.11
control-node-02 ansible_host=192.168.10.12
control-node-03 ansible_host=192.168.10.13

[etcd]
control-node-01
control-node-02
control-node-03

In this setup, we usually place a Load Balancer (like HAProxy or a cloud LB) in front of the three API servers. The worker nodes and your local kubectl talk to the Load Balancer, which distributes traffic to the healthy control plane nodes.

Worker Node HA: Built-in but Dependent on Control Plane

People often confuse Worker HA with Control Plane HA. Kubernetes handles Worker HA natively. If a worker node fails, the ReplicaSet controller notices that the Pods are gone and spins up replacements on other available nodes.

But there is a catch: This relies entirely on the Control Plane.

If your Control Plane is down, the cluster has no way of knowing a worker node has failed. The logic that says “Current Replicas < Desired Replicas” lives in the Controller Manager on the master node.

So, while Kubernetes is designed to heal worker failures automatically, that self-healing capability is only as reliable as your Control Plane. Without a robust CP setup, your Worker HA guarantees are effectively null and void during a master outage.

Common Failure Scenarios Without HA

I have seen clusters break in many creative ways. Without HA, minor operational tasks turn into major downtime events.

OS Patching & Reboots: You need to apply a security patch to the kernel. On a single master, rebooting means taking the entire API offline for 5–10 minutes. With HA, you can roll through the nodes one by one without anyone noticing.
Disk Filling Up: Logs or backups fill the root partition. Etcd is extremely sensitive to disk latency and space. If the disk fills up on a single master, etcd panics and goes read-only or crashes.
Expired Certificates: If the API server certificates expire on a single node, you are locked out. In an HA setup, if one node has an issue, you can still access the others to troubleshoot and rotate certificates.

The goal of High Availability is to make day-to-day maintenance boring and predictable.

3. Day-2 Operations: The Case for Regular Upgrades

Building a Kubernetes cluster is often the easy part. Keeping it secure and stable over the next two years — that is the real challenge.

In the world of Kubernetes, “Day-2” refers to everything that happens after the initial install. Among these tasks, upgrades are often the most dreaded. However, treating a cluster as a “set it and forget it” artifact is a guaranteed recipe for future instability.

Kubernetes as a Continuously Evolving Platform

Kubernetes moves fast. The project releases a new minor version roughly every four months. Unlike a traditional Linux distribution where you might sit on a Long Term Support (LTS) release for five years, Kubernetes forces you to keep moving.

Support windows are short. Generally, only the latest three minor versions are supported (N-2 policy). For example, if you are running v1.28 and the current version is v1.32, you are already falling out of the support window. This means no security patches for critical components like kube-apiserver or kubelet.

New versions bring more than just patches; they introduce essential performance improvements and security fixes. Stagnation holds your infrastructure back from becoming more efficient and secure.

Technical Debt from Deferred Upgrades

The longer you wait to upgrade, the harder it gets. I often compare it to skipping dental appointments; what could have been a routine cleaning eventually turns into a root canal.

If you try to jump across multiple versions at once — say, from v1.30 straight to v1.34 — you run into a minefield of breaking changes. Kubernetes frequently retires old APIs. If you haven’t updated your manifests incrementally, your deployments might simply fail during a major jump.

Fixing one deprecated API in a manifest is easy. Fixing hundreds across fifty repositories while your production upgrade is stalled? That is the technical debt you want to avoid. Regular, incremental upgrades are the only way to keep this debt manageable.

Prerequisites for Zero-Downtime Upgrades

You cannot just replace a binary and hope for the best. A production upgrade must be seamless to the end user.

To achieve this, your cluster needs to meet specific conditions before you even think about running an upgrade playbook:

High Availability (HA): As discussed in Section 2, you need multiple Control Plane nodes. Upgrading the API server involves restarting it. If you have only one, your API goes down. If you have three, you can upgrade them one by one (Rolling Update) without losing availability.
Spare Capacity: When a worker node is being upgraded, it must be drained of all running workloads. You need enough resource headroom on other nodes to accept these displaced Pods.
Pod Disruption Budgets (PDB): You should define how many replicas of an application can be down at once. This prevents the upgrade process from accidentally taking down all instances of a critical service.

The upgrade process conceptually involves isolating a node (cordon), moving its workload elsewhere (drain), upgrading the software, and bringing it back (uncordon). Kubespray automates this entire dance, but the underlying infrastructure must be ready to support it. We will see this in action in Section 9.

Declarative Cluster Management with Kubespray and Git

The most stable way to manage upgrades is through Infrastructure as Code (IaC). We avoid making ad-hoc changes directly on the servers. Instead, we define the desired state of the cluster in configuration files.

Kubespray fits perfectly into this workflow. Your cluster configuration — inventory, variables, and versions — lives in text files. Ideally, you manage these files in a Git repository. When you want to upgrade, you modify the kube_version variable in a file like group_vars/k8s_cluster/k8s-cluster.yml and commit the change.

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

# 🔴 OLD STATE
# kube_version: v1.32.9

# 🟢 NEW STATE
kube_version: v1.32.10

This declarative approach provides a permanent record of when an upgrade happened and who triggered it.

Note: In our lab environment later, we will sometimes use the command line (-e kube_version=…) to override this version for speed and simplicity. However, in a real production environment, updating the inventory file via Git is the recommended best practice.

This draft focuses on resolving the critical issues around self-referential SSH, Python 3.12 package management, and script reliability, while restoring descriptive depth where needed.

Lab Environment Setup

Before we deploy Kubernetes, we need a solid foundation. In a real data center, this would involve racking servers and cabling switches. In our lab, we will simulate this infrastructure using VirtualBox and Vagrant.

This section walks through building a 6-node cluster environment. By the end, you will have a fully functional lab that mimics a production-grade HA setup.

Prerequisites: Ensure you have VirtualBox 7.2.4+ and Vagrant 2.4.9+ installed.

Architecture Overview

We are building a standard high-availability topology. To make it realistic, we separate the “management” layer from the cluster itself.

                    ┌─────────────┐
                    │  admin-lb   │
                    │ HAProxy/NFS │
                    │ .10.10      │
                    └──────┬──────┘
                           │ :6443 (API Load Balancer)
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │k8s-node1 │ │k8s-node2 │ │k8s-node3 │
        │  CP+etcd │ │  CP+etcd │ │  CP+etcd │
        │ .10.11   │ │ .10.12   │ │ .10.13   │
        └──────────┘ └──────────┘ └──────────┘
              ┌────────────┴────────────┐
              ▼                         ▼
        ┌──────────┐             ┌──────────┐
        │k8s-node4 │             │k8s-node5 │
        │  Worker  │             │ (Standby)│
        │ .10.14   │             │ .10.15   │
        └──────────┘             └──────────┘

Admin/LB Node (1EA): This node wears multiple hats. It acts as the Ansible controller (to run Kubespray), the external Load Balancer (HAProxy) for the API server, an NFS server for shared storage, and a jump host with management tools (kubectl, k9s, helm).
Control Plane Nodes (3EA): These three nodes run the core Kubernetes components (kube-apiserver, etcd, controller-manager, scheduler). Three is the magic number for etcd quorum; it allows the cluster to survive the loss of one node.
Worker Nodes (2EA): These nodes run your actual workloads (Pods). k8s-node4 will be active initially, while k8s-node5 will be provisioned as a VM but added to the cluster later in Section 7.

VM Specifications and Network Layout

We will use Rocky Linux 10 as the base OS. Each VM needs two network interfaces:

NAT Network (NIC1, typically enp0s3): For internet access (downloading packages) and Vagrant management.
Host-Only Network (NIC2, typically enp0s9): For internal cluster communication. We will use this interface (enp0s9) for the CNI (Flannel) and API traffic.

4.3 Vagrantfile Walkthrough

The Vagrantfile defines our entire infrastructure as code. It loops through a configuration to create all 6 nodes at once.

Key points to note:

We use bento/rockylinux-10 as the base box.
We set nicpromisc2 to allow-all. This enables promiscuous mode on the host-only adapter, which is critical for CNI overlay traffic (like Flannel VXLAN) to reach its destination properly.
We use linked_clone = true to save disk space by sharing the base image across VMs, rather than duplicating it for each node.

# Vagrantfile
BOX_IMAGE = "bento/rockylinux-10"
N = 5 # Number of k8s nodes (3 CP + 2 Workers)

Vagrant.configure("2") do |config|
  
  # 1. Define K8s Nodes (k8s-node1 to k8s-node5)
  (1..N).each do |i|
    config.vm.define "k8s-node#{i}" do |subconfig|
      subconfig.vm.box = BOX_IMAGE
      subconfig.vm.provider "virtualbox" do |vb|
        vb.name = "k8s-node#{i}"
        vb.cpus = 4
        vb.memory = 2048
        # Enable promiscuous mode for CNI traffic
        vb.customize ["modifyvm", :id, "--nicpromisc2", "allow-all"]
        vb.linked_clone = true
      end
      subconfig.vm.hostname = "k8s-node#{i}"
      subconfig.vm.network "private_network", ip: "192.168.10.1#{i}"
      subconfig.vm.provision "shell", path: "init_cfg.sh"
    end
  end

  # 2. Define Admin/LB Node
  config.vm.define "admin-lb" do |subconfig|
    subconfig.vm.box = BOX_IMAGE
    subconfig.vm.provider "virtualbox" do |vb|
      vb.name = "admin-lb"
      vb.cpus = 2
      vb.memory = 1024
      vb.linked_clone = true
    end
    subconfig.vm.hostname = "admin-lb"
    subconfig.vm.network "private_network", ip: "192.168.10.10"
    subconfig.vm.provision "shell", path: "admin-lb.sh"
  end
end

Admin-LB Node Bootstrap Script (admin-lb.sh)

This script configures the admin node. It installs HAProxy for load balancing, sets up NFS for storage, prepares SSH keys for Ansible, and installs cluster management tools.

#!/bin/bash
# admin-lb.sh

echo "[TASK 1] OS Prep (Timezone, Firewall, SELinux, SSH)"
timedatectl set-timezone Asia/Seoul
systemctl disable --now firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

# Enable Root Login for consistency (optional for admin-lb, critical for nodes)
echo "root:qwe123" | chpasswd
sed -i 's/^#PermitRootLogin.*/PermitRootLogin yes/' /etc/ssh/sshd_config
sed -i 's/^PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd

echo "[TASK 2] Local DNS Setup"
# Clean up cloud-init entries that might conflict with our static IPs
sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts
cat <> /etc/hosts
192.168.10.10 k8s-api-srv.admin-lb.com admin-lb
192.168.10.11 k8s-node1
192.168.10.12 k8s-node2
192.168.10.13 k8s-node3
192.168.10.14 k8s-node4
192.168.10.15 k8s-node5
EOF

echo "[TASK 3] Install HAProxy"
dnf install -y haproxy

# HAProxy Config: API LB + Stats + Prometheus Metrics
cat < /etc/haproxy/haproxy.cfg
global
    log 127.0.0.1 local2
    stats socket /var/lib/haproxy/stats

defaults
    mode tcp
    timeout connect 10s
    timeout client 1m
    timeout server 1m

# Kubernetes API Server LB
frontend k8s-api
    bind *:6443
    default_backend k8s-api-backend

backend k8s-api-backend
    option tcp-check
    balance roundrobin
    server k8s-node1 192.168.10.11:6443 check
    server k8s-node2 192.168.10.12:6443 check
    server k8s-node3 192.168.10.13:6443 check

# HAProxy Stats Dashboard (http://:9000/haproxy_stats)
listen stats
    bind *:9000
    mode http
    stats enable
    stats uri /haproxy_stats

# Prometheus Metrics Exporter (http://:8405/metrics)
frontend prometheus
    bind *:8405
    mode http
    http-request use-service prometheus-exporter if { path /metrics }
    no log
EOF
systemctl enable --now haproxy

echo "[TASK 4] Install NFS Server"
dnf install -y nfs-utils
mkdir -p /srv/nfs/share
chown nobody:nobody /srv/nfs/share
chmod 755 /srv/nfs/share
# Note: 'async' improves performance for lab environments but risks data loss in production
echo '/srv/nfs/share *(rw,async,no_root_squash,no_subtree_check)' > /etc/exports
systemctl enable --now nfs-server
exportfs -rav

echo "[TASK 5] Install Tools (kubectl, k9s, helm)"
# Install kubectl
cat << EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key
exclude=kubectl
EOF
dnf install -y -q kubectl --disableexcludes=kubernetes

# Install k9s (using curl since wget might be missing)
curl -fsSL -o /tmp/k9s_linux_amd64.tar.gz https://github.com/derailed/k9s/releases/latest/download/k9s_linux_amd64.tar.gz
tar -xzf /tmp/k9s_linux_amd64.tar.gz -C /usr/local/bin/ k9s
chmod +x /usr/local/bin/k9s

# Install Helm
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | DESIRED_VERSION=v3.16.2 bash

echo "[TASK 6] SSH Key Distribution"
dnf install -y python3-pip git sshpass
ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa
# Distribute key to all k8s nodes (admin-lb itself doesn't need to be in inventory)
for i in {1..5}; do
  sshpass -p 'qwe123' ssh-copy-id -o StrictHostKeyChecking=no root@192.168.10.1$i
done

echo "[TASK 7] Clone Kubespray"
git clone -b v2.29.1 https://github.com/kubernetes-sigs/kubespray.git /root/kubespray
# Use --break-system-packages for Rocky 10 (Python 3.12+ PEP 668)
pip3 install --break-system-packages -r /root/kubespray/requirements.txt

4.5 Worker/Control Plane Node Init Script (init_cfg.sh)

This script prepares the cluster nodes. It disables swap, loads required kernel modules, sets up local DNS, and configures SSH to allow Ansible connections.

#!/bin/bash
# init_cfg.sh

echo "[TASK 1] Timezone & Swap"
timedatectl set-timezone Asia/Seoul
swapoff -a
sed -i '/swap/d' /etc/fstab

echo "[TASK 2] Disable Firewalld & SELinux"
systemctl disable --now firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

echo "[TASK 3] Kernel Modules & Sysctl"
cat < /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
modprobe overlay
modprobe br_netfilter

cat < /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sysctl --system

echo "[TASK 4] Local DNS"
# Remove cloud-init generated hostname entries that conflict with our static mappings
sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts

# We also add the admin node alias 'k8s-api-srv.admin-lb.com' for use in Section 6 (External LB)
cat <> /etc/hosts
192.168.10.10 k8s-api-srv.admin-lb.com admin-lb
192.168.10.11 k8s-node1
192.168.10.12 k8s-node2
192.168.10.13 k8s-node3
192.168.10.14 k8s-node4
192.168.10.15 k8s-node5
EOF

echo "[TASK 5] SSH Config (Allow Root Login for Ansible)"
echo "root:qwe123" | chpasswd
# Use sed to safely modify existing config without duplication
sed -i 's/^#PermitRootLogin.*/PermitRootLogin yes/' /etc/ssh/sshd_config
sed -i 's/^PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd

Deploying the Environment with Vagrant

With the three files (Vagrantfile, admin-lb.sh, init_cfg.sh) in the same directory, we are ready to launch.

Start the VMs

$ vagrant up

This process will take a few minutes as it downloads the OS image and provisions all 6 nodes.
Verify the Status:
Run vagrant status to confirm all machines are up.

$ vagrant status
Current machine states:

k8s-node1                 running (virtualbox)
k8s-node2                 running (virtualbox)
k8s-node3                 running (virtualbox)
k8s-node4                 running (virtualbox)
k8s-node5                 running (virtualbox)
admin-lb                  running (virtualbox)

Log in to the Admin Node:
All subsequent commands for deploying Kubernetes will be run from inside this node.

$ vagrant ssh admin-lb
[vagrant@admin-lb ~]$ sudo -i
[root@admin-lb ~]#

Test Connectivity:
First, verify network connectivity and DNS resolution.

Next, verify that HAProxy is running by checking the stats page response (200 OK).

ping -c 2 k8s-node1
ping -c 2 k8s-node4

curl -s -I http://192.168.10.10:9000/haproxy_stats | head -1
# HTTP/1.1 200 OK

If everything responds, your infrastructure is ready. We can now proceed to Section 5 to deploy the cluster using Kubespray.

Deploying Kubernetes with Kubespray

With the lab environment fully provisioned — the admin-lb node running HAProxy and NFS, three control plane VMs, and two worker VMs all booted and initialized — we are now ready to deploy a production-grade Kubernetes cluster using Kubespray.

This section walks through every step of the deployment process in detail: understanding the Kubespray project structure, configuring the inventory, tuning cluster variables, executing the installation playbook, and performing thorough post-deployment validation.

Kubespray Directory Structure and Key Files

Kubespray is an Ansible-based project that wraps kubeadm with extensive automation for deploying and managing Kubernetes clusters. During the admin-lb bootstrap phase (covered in Section 4), we cloned the Kubespray v2.29.1 repository into /root/kubespray. Before modifying any configuration, it is essential to understand the project layout.

/root/kubespray/
├── ansible.cfg                  # Ansible configuration (forks, pipelining, etc.)
├── cluster.yml                  # Main deployment playbook
├── scale.yml                    # Add-node playbook
├── remove-node.yml              # Remove-node playbook
├── upgrade-cluster.yml          # Rolling upgrade playbook
├── reset.yml                    # Full cluster teardown playbook
├── roles/                       # Ansible roles for each component
│   ├── download/                # Container image and binary downloads
│   ├── kubernetes/              # kubeadm, kubelet, static pod manifests
│   ├── kubernetes-apps/         # CoreDNS, metrics-server, ingress, etc.
│   ├── etcd/                    # etcd cluster bootstrap and management
│   ├── network_plugin/          # CNI plugins (flannel, calico, cilium, etc.)
│   └── container-engine/        # containerd, CRI-O runtime setup
├── inventory/
│   └── sample/                  # Template inventory (copy this)
│       ├── inventory.ini        # Node definitions and group assignments
│       └── group_vars/
│           ├── all/
│           │   └── all.yml      # Global variables (LB config, DNS, etc.)
│           ├── k8s_cluster/
│           │   ├── k8s-cluster.yml    # Core K8s settings (version, CNI, proxy mode)
│           │   └── addons.yml         # Optional add-ons (metrics-server, ingress, etc.)
│           └── etcd.yml         # etcd-specific variables
└── contrib/
    └── inventory_builder/       # Dynamic inventory generator script

Preparing the Inventory Directory

Rather than editing the sample inventory directly, we copy it to create our own cluster-specific configuration:

cd /root/kubespray
cp -rfp inventory/sample inventory/mycluster

This gives us a clean, isolated workspace under inventory/mycluster/ where all our customizations live. The original inventory/sample/ remains untouched, which is important when we later upgrade Kubespray versions via git checkout — our custom inventory directory won't be overwritten.

Inventory Configuration (inventory.ini)

The inventory file is the heart of a Kubespray deployment. It defines which nodes exist, what roles they play, and how Ansible groups them for targeted playbook execution.

The Complete Inventory File

# /root/kubespray/inventory/mycluster/inventory.ini

[all]
k8s-node1 ansible_host=192.168.10.11 ip=192.168.20.11
k8s-node2 ansible_host=192.168.10.12 ip=192.168.20.12
k8s-node3 ansible_host=192.168.10.13 ip=192.168.20.13
k8s-node4 ansible_host=192.168.10.14 ip=192.168.20.14

[kube_control_plane]
k8s-node1
k8s-node2
k8s-node3

[etcd:children]
kube_control_plane

[kube_node]
k8s-node4

[k8s_cluster:children]
kube_control_plane
kube_node

Breakdown of Each Section

[all] — Node Definitions

Every node in the cluster is listed here with two critical variables:

ansible_host: The IP address Ansible uses to SSH into the node. In our lab, this is the enp0s8 interface on the 192.168.10.0/24 management network.
ip: The IP address Kubernetes components bind to for intra-cluster communication. This is the enp0s9 interface on the 192.168.20.0/24 cluster network.

This dual-network design is intentional: management traffic (Ansible SSH, admin access) stays on one network, while Kubernetes API, etcd, and pod-to-pod traffic flows on a separate, dedicated network. In production environments, this separation improves both security and performance.

[kube_control_plane] — Control Plane Nodes

Three nodes are assigned to the control plane group: k8s-node1, k8s-node2, and k8s-node3. Each will run the full set of Kubernetes control plane components as static pods:

kube-apiserver
kube-controller-manager
kube-scheduler

The order of nodes in this group matters. The first node (k8s-node1) acts as the initial control plane node during kubeadm init. Subsequent nodes join via kubeadm join. When adding a new control plane node later, it must always be appended to the end of this group — never inserted in the middle.

[etcd:children] — etcd Cluster Membership

[etcd:children]
kube_control_plane

By using children, we declare that etcd membership is inherited from the kube_control_plane group. This means all three control plane nodes also run etcd, forming a 3-member etcd cluster. This is the "stacked etcd" topology recommended for most deployments — etcd runs on the same nodes as the control plane, reducing infrastructure complexity while still maintaining a proper quorum.

With 3 etcd members, the cluster can tolerate 1 node failure. The quorum formula is (n/2) + 1, so a 3-node etcd cluster requires at least 2 members to be healthy. This is why odd numbers (3, 5, 7) are always used for etcd — even numbers provide no additional fault tolerance over the odd number below them.

[kube_node] — Worker Nodes

Currently only k8s-node4 is listed as a worker. We will add k8s-node5 later using the scale.yml playbook (covered in Section 7). Worker nodes run the kubelet and kube-proxy, but no control plane components.

[k8s_cluster:children] — Aggregate Group

This is a convenience group that encompasses all Kubernetes nodes (both control plane and workers). Kubespray uses it to apply cluster-wide configurations such as CNI plugin installation, kubelet settings, and container runtime setup.

Ansible Variable Precedence in Kubespray

Before customizing any settings, it is critical to understand how Ansible resolves variable conflicts. Kubespray defines variables at multiple levels, and the precedence hierarchy determines which value wins when the same variable is set in multiple places.

The Precedence Hierarchy (Lowest to Highest)

1. Role defaults         (roles/xxx/defaults/main.yml)          ← Lowest priority
2. Role vars             (roles/xxx/vars/main.yml)
3. Inventory group_vars  (inventory/mycluster/group_vars/...)
4. Inventory host_vars   (inventory/mycluster/host_vars/...)
5. Playbook vars         (vars: section in playbook YAML)
6. CLI extra-vars        (--extra-vars / -e on command line)    ← Highest priority

What This Means in Practice

Role defaults are the baseline values that Kubespray ships with. For example, the default Kubernetes version, default CNI plugin, default kube-proxy mode, and so on are all defined in role defaults. These are intentionally low-priority — they are meant to be overridden.

Role vars are set by the Kubespray developers for values that should generally not be overridden by users. These take precedence over role defaults but can still be overridden by inventory-level variables.

Inventory group_vars are where most of our customization happens. The files under inventory/mycluster/group_vars/ are the primary configuration surface for Kubespray:

group_vars/
├── all/
│   └── all.yml              # Global settings: LB config, DNS, timezone
├── k8s_cluster/
│   ├── k8s-cluster.yml      # K8s core: version, CNI, proxy mode, feature gates
│   └── addons.yml           # Add-ons: metrics-server, ingress, dashboard
└── etcd.yml                 # etcd settings: metrics, compaction, quotas

host_vars allow per-node overrides. This is rarely needed but can be useful for heterogeneous hardware or special-purpose nodes.

CLI --extra-vars (-e) have the highest precedence and override everything. This is the mechanism we use to specify the exact Kubernetes version at deployment time:

ansible-playbook cluster.yml -e kube_version="1.32.9"

Even if kube_version is set to a different value in group_vars/k8s_cluster/k8s-cluster.yml, the -e flag on the command line always wins. This is particularly useful during upgrades, where we increment the version via the CLI without modifying any files.

A common mistake is to set a variable in group_vars and wonder why it has no effect. The cause is almost always that the same variable is defined at a higher precedence level (often in role/vars, which is hard to override from group_vars). In such cases, the only reliable way to override it is via -e on the command line.

Customizing Cluster Settings (CNI, Proxy Mode, Add-ons)

With the inventory structure understood, let’s configure the key cluster parameters. We edit three files under inventory/mycluster/group_vars/.

5.4.1 Global Settings — all/all.yml

# /root/kubespray/inventory/mycluster/group_vars/all/all.yml

## Load Balancer Configuration
## (Default: client-side LB via nginx-proxy on each worker)
## We will modify these later in Section 6 when configuring external LB
# apiserver_loadbalancer_domain_name: "k8s-api-srv.admin-lb.com"
# loadbalancer_apiserver:
#   address: 192.168.10.10
#   port: 6443
# loadbalancer_apiserver_localhost: true   # default: enables nginx-proxy on workers

For the initial deployment, we leave the load balancer settings at their defaults. This means Kubespray will configure client-side load balancing using nginx static pods on each worker node (Case 1 architecture, detailed in Section 6).

Core Kubernetes Settings — k8s_cluster/k8s-cluster.yml

This is the most important configuration file. Here we define the CNI plugin, kube-proxy mode, service/pod CIDRs, and other core cluster parameters:

# /root/kubespray/inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

## CNI Plugin
kube_network_plugin: flannel

## Flannel Interface Selection
## CRITICAL: In multi-NIC environments, flannel must be told which interface to use
## for VXLAN overlay traffic. Without this, flannel may pick the wrong NIC
## (e.g., the NAT interface enp0s3 instead of the cluster network enp0s9).
flannel_interface: enp0s9

## Kube-proxy Mode
## Options: iptables (default in Kubespray), ipvs
## iptables is simpler and sufficient for small-to-medium clusters
kube_proxy_mode: iptables

## DNS Configuration
## CoreDNS is the default and only supported DNS provider
## Kubespray default clusterDNS is 10.233.0.3 (not the kubeadm default of 10.233.0.10)
## This is set via dns_domain and other dns_* variables

## NodeLocal DNS Cache
## Disabled in this lab for simplicity
enable_nodelocaldns: false

## Service and Pod CIDRs (defaults shown — usually no need to change)
# kube_service_addresses: 10.233.0.0/18
# kube_pods_subnet: 10.233.64.0/18

Why Flannel? In this lab environment, we use Flannel for its simplicity. Flannel provides basic L3 overlay networking via VXLAN and is easy to understand and debug. For production environments requiring network policies, Calico or Cilium would be the recommended choices.

The flannel_interface Setting Explained: VirtualBox VMs typically have multiple network interfaces. In our lab:

enp0s3 — NAT interface (VirtualBox default, used for internet access)
enp0s8 — Host-only adapter #1 (management network, 192.168.10.0/24)
enp0s9 — Host-only adapter #2 (cluster network, 192.168.20.0/24)

Without explicitly setting flannel_interface: enp0s9, Flannel might select enp0s3 (the NAT interface) for its VXLAN tunnel endpoints. This would cause pod-to-pod communication to fail because the NAT interface does not provide direct connectivity between VMs. By specifying enp0s9, we ensure Flannel uses the dedicated cluster network.

Why iptables Mode? The iptables kube-proxy mode is the battle-tested default. While IPVS mode offers better performance at scale (O(1) vs. O(n) for service routing rules), iptables mode is simpler to debug and perfectly adequate for clusters with fewer than a thousand services. For this lab, iptables is the appropriate choice.

DNS Address Note: Kubespray sets the cluster DNS service IP to 10.233.0.3 by default, which differs from kubeadm's default of 10.233.0.10. This is a Kubespray-specific convention. The kubelet on every node is configured with --cluster-dns=10.233.0.3, and the CoreDNS service is created with this ClusterIP. If you're migrating from a kubeadm-managed cluster, be aware of this difference.

Add-ons Configuration — k8s_cluster/addons.yml

# /root/kubespray/inventory/mycluster/group_vars/k8s_cluster/addons.yml

## Metrics Server
## Enables the Kubernetes Metrics Server for resource usage collection
## Required for: kubectl top, HPA (Horizontal Pod Autoscaler), VPA
metrics_server_enabled: true

## Other add-ons (disabled in this lab, shown for reference)
# ingress_nginx_enabled: false
# dashboard_enabled: false
# helm_enabled: false

The Metrics Server is the only add-on we enable at deployment time. It provides the /apis/metrics.k8s.io endpoint that powers kubectl top nodes, kubectl top pods, and the Horizontal Pod Autoscaler. We will install additional monitoring components (Prometheus, Grafana) separately using Helm in Section 8.

etcd Settings — etcd.yml

For the initial deployment, we keep etcd settings at their defaults. Later (in Section 8.3), we will enable etcd metrics:

# /root/kubespray/inventory/mycluster/group_vars/etcd.yml

## etcd metrics (will be enabled later for Prometheus monitoring)
# etcd_metrics: true
# etcd_listen_metrics_urls: "http://0.0.0.0:2381"

Running cluster.yml and Verifying Deployment

With the inventory and variables configured, we are ready to deploy the cluster.

Pre-Flight Checks

Before running the playbook, verify Ansible connectivity to all nodes:

cd /root/kubespray

# Test SSH connectivity to all nodes
ansible -i inventory/mycluster/inventory.ini all -m ping

Expected output:

k8s-node1 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
k8s-node2 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
k8s-node3 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
k8s-node4 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

If any node returns UNREACHABLE, check SSH key distribution (the admin-lb.sh bootstrap script should have handled this) and verify network connectivity on the management network (192.168.10.0/24).

Executing the Deployment

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  -e kube_version="1.32.9"

Command Breakdown:

-i inventory/mycluster/inventory.ini — specifies the inventory file
cluster.yml — the main deployment playbook
-e kube_version="1.32.9" — overrides the Kubernetes version via CLI (highest precedence)

What Happens During Deployment

The cluster.yml playbook orchestrates a complex, multi-stage deployment process. At a high level, it proceeds through the following phases:

Phase 1: Prerequisite Checks and OS Configuration

Validates the target OS (Rocky Linux 10, Ubuntu, etc.)
Ensures required kernel modules are loaded (overlay, br_netfilter)
Verifies sysctl settings (net.bridge.bridge-nf-call-iptables, net.ipv4.ip_forward)
Disables swap (required by kubelet)
Configures container runtime prerequisites

Phase 2: Container Runtime Installation

Installs containerd as the CRI (Container Runtime Interface)
Configures containerd with appropriate settings for Kubernetes
Sets up the containerd systemd service

Phase 3: Download Binaries and Container Images

Downloads kubeadm, kubelet, kubectl binaries to /tmp/releases/
Pulls container images for all Kubernetes components:
registry.k8s.io/kube-apiserver:v1.32.9
registry.k8s.io/kube-controller-manager:v1.32.9
registry.k8s.io/kube-scheduler:v1.32.9
registry.k8s.io/kube-proxy:v1.32.9
registry.k8s.io/pause:3.10
registry.k8s.io/coredns/coredns:v1.12.0
registry.k8s.io/etcd:3.5.25
docker.io/flannel/flannel:v0.26.7
docker.io/flannel/flannel-cni-plugin:v1.6.2

Phase 4: etcd Cluster Bootstrap

Initializes etcd on the first control plane node (k8s-node1)
Joins k8s-node2 and k8s-node3 to the etcd cluster
Configures etcd with peer and client TLS certificates
Verifies the 3-member etcd quorum

Phase 5: Control Plane Initialization

Runs kubeadm init on k8s-node1 (the first control plane node)
Generates static pod manifests for kube-apiserver, kube-controller-manager, kube-scheduler
Creates the cluster CA certificates and kubeconfig files
Joins k8s-node2 and k8s-node3 to the control plane via kubeadm join --control-plane

Phase 6: Worker Node Join

Installs kubelet on worker nodes
Uploads control plane CA certificates to workers
Runs kubeadm join to register workers with the cluster
Configures nginx-proxy static pods on workers (for client-side load balancing)

Phase 7: CNI Plugin and Add-ons

Deploys Flannel DaemonSet across all nodes
Deploys CoreDNS as a Deployment in kube-system namespace
Deploys kube-proxy DaemonSet
Deploys Metrics Server (if metrics_server_enabled: true)
Applies node labels and taints as configured

Deployment Duration

The complete deployment takes approximately 8 minutes in this lab environment. The bulk of the time is spent in Phase 3 (downloading container images) and Phase 5 (waiting for control plane components to become healthy after each join).

Verifying Successful Deployment

Once the playbook completes without errors, copy the kubeconfig to the admin-lb node and verify:

# Copy kubeconfig from the first control plane node
mkdir -p /root/.kube
scp k8s-node1:/root/.kube/config /root/.kube/config

# Verify kubectl connectivity
kubectl get nodes -o wide

Expected output:

NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE              KERNEL-VERSION                CONTAINER-RUNTIME
k8s-node1   Ready    control-plane   8m    v1.32.9   192.168.20.11           Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5
k8s-node2   Ready    control-plane   7m    v1.32.9   192.168.20.12           Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5
k8s-node3   Ready    control-plane   6m    v1.32.9   192.168.20.13           Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5
k8s-node4   Ready              5m    v1.32.9   192.168.20.14           Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5

Key things to verify:

All 4 nodes show Ready status
Control plane nodes have the control-plane role label
All nodes report the correct Kubernetes version (v1.32.9)
INTERNAL-IP shows the cluster network addresses (192.168.20.x), not the management or NAT addresses
Container runtime is containerd://2.1.5

Post-Deployment Validation (Nodes, Pods, etcd, Certificates)

A successful cluster.yml run does not guarantee a fully healthy cluster. Thorough post-deployment validation is essential.

System Pod Health Check

kubectl get pods -n kube-system -o wide

Expected output (abbreviated):

NAME                                READY   STATUS    RESTARTS   AGE   IP              NODE
coredns-xxxxxxxxx-xxxxx             1/1     Running   0          8m    10.233.64.x     k8s-node1
coredns-xxxxxxxxx-xxxxx             1/1     Running   0          8m    10.233.64.x     k8s-node2
flannel-xxxxx                       1/1     Running   0          7m    192.168.20.11   k8s-node1
flannel-xxxxx                       1/1     Running   0          7m    192.168.20.12   k8s-node2
flannel-xxxxx                       1/1     Running   0          7m    192.168.20.13   k8s-node3
flannel-xxxxx                       1/1     Running   0          7m    192.168.20.14   k8s-node4
kube-apiserver-k8s-node1            1/1     Running   0          8m    192.168.20.11   k8s-node1
kube-apiserver-k8s-node2            1/1     Running   0          7m    192.168.20.12   k8s-node2
kube-apiserver-k8s-node3            1/1     Running   0          7m    192.168.20.13   k8s-node3
kube-controller-manager-k8s-node1   1/1     Running   0          8m    192.168.20.11   k8s-node1
kube-controller-manager-k8s-node2   1/1     Running   0          7m    192.168.20.12   k8s-node2
kube-controller-manager-k8s-node3   1/1     Running   0          7m    192.168.20.13   k8s-node3
kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.11   k8s-node1
kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.12   k8s-node2
kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.13   k8s-node3
kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.14   k8s-node4
kube-scheduler-k8s-node1            1/1     Running   0          8m    192.168.20.11   k8s-node1
kube-scheduler-k8s-node2            1/1     Running   0          7m    192.168.20.12   k8s-node2
kube-scheduler-k8s-node3            1/1     Running   0          7m    192.168.20.13   k8s-node3
metrics-server-xxxxxxxxx-xxxxx      1/1     Running   0          7m    10.233.64.x     k8s-node2
nginx-proxy-k8s-node4               1/1     Running   0          6m    192.168.20.14   k8s-node4

What to look for:

Static pods (apiserver, controller-manager, scheduler): One instance per control plane node. These are not managed by a Deployment or DaemonSet — they are directly managed by the kubelet on each node based on manifest files in /etc/kubernetes/manifests/.
Flannel DaemonSet: One pod per node (all 4 nodes). This provides the CNI overlay network.
kube-proxy DaemonSet: One pod per node (all 4 nodes). This handles service-to-pod routing.
CoreDNS Deployment: Two replicas (default) for DNS high availability.
Metrics Server: One pod. Provides the metrics.k8s.io API.
nginx-proxy: One static pod on each worker node only (k8s-node4). This is the client-side load balancer that distributes API requests from the worker to all three control plane nodes (covered in detail in Section 6).

etcd Cluster Health

SSH into any control plane node and use the etcdctl.sh wrapper script (installed by Kubespray) to inspect the etcd cluster:

# List all etcd members
ssh k8s-node1 etcdctl.sh member list -w table

Expected output:

+------------------+---------+-----------+----------------------------+----------------------------+
|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        |
+------------------+---------+-----------+----------------------------+----------------------------+
| 1a2b3c4d5e6f7890 | started | etcd1     | https://192.168.20.11:2380 | https://192.168.20.11:2379 |
| 2b3c4d5e6f789012 | started | etcd2     | https://192.168.20.12:2380 | https://192.168.20.12:2379 |
| 3c4d5e6f78901234 | started | etcd3     | https://192.168.20.13:2380 | https://192.168.20.13:2379 |
+------------------+---------+-----------+----------------------------+----------------------------+

# Check endpoint health and latency
ssh k8s-node1 etcdctl.sh endpoint status -w table

Expected output:

+----------------------------+------------------+---------+---------+-----------+-----------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+----------------------------+------------------+---------+---------+-----------+-----------+
| https://192.168.20.11:2379 | 1a2b3c4d5e6f7890 | 3.5.25  |  5.4 MB |     true  |         2 |
| https://192.168.20.12:2379 | 2b3c4d5e6f789012 | 3.5.25  |  5.4 MB |    false  |         2 |
| https://192.168.20.13:2379 | 3c4d5e6f78901234 | 3.5.25  |  5.4 MB |    false  |         2 |
+----------------------------+------------------+---------+---------+-----------+-----------+

Key things to verify:

All 3 members show started status
etcd version is 3.5.25 (the version bundled with Kubespray v2.29.1)
Exactly one member is the leader (IS LEADER: true)
All members share the same RAFT TERM (consistent cluster state)
Peer and client addresses use the cluster network (192.168.20.x)
Peer communication uses port 2380 (TLS-encrypted peer-to-peer)
Client communication uses port 2379 (TLS-encrypted client access)

API Server Endpoint Verification

Test direct API server connectivity from the admin-lb node:

# Test each control plane node's API server individually
curl -sk https://192.168.10.11:6443/version
curl -sk https://192.168.10.12:6443/version
curl -sk https://192.168.10.13:6443/version

Expected output (from each):

{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  "buildDate": "...",
  "goVersion": "go1.23.x",
  "compiler": "gc",
  "platform": "linux/amd64"
}

If all three return the correct version, the API servers are healthy and reachable from the management network.

Container Images on Nodes

Verify that all required container images were downloaded:

ssh k8s-node1 crictl images

Expected output:

IMAGE                                            TAG        IMAGE ID       SIZE
docker.io/flannel/flannel                        v0.26.7    xxxxxxxxxxxx   30.0MB
docker.io/flannel/flannel-cni-plugin             v1.6.2     xxxxxxxxxxxx   4.28MB
registry.k8s.io/coredns/coredns                  v1.12.0    xxxxxxxxxxxx   18.2MB
registry.k8s.io/etcd                             3.5.25     xxxxxxxxxxxx   64.8MB
registry.k8s.io/kube-apiserver                   v1.32.9    xxxxxxxxxxxx   30.5MB
registry.k8s.io/kube-controller-manager          v1.32.9    xxxxxxxxxxxx   28.7MB
registry.k8s.io/kube-proxy                       v1.32.9    xxxxxxxxxxxx   28.5MB
registry.k8s.io/kube-scheduler                   v1.32.9    xxxxxxxxxxxx   21.3MB
registry.k8s.io/metrics-server/metrics-server    v0.7.x     xxxxxxxxxxxx   17.6MB
registry.k8s.io/pause                            3.10       xxxxxxxxxxxx   320kB

Downloaded Binaries

Kubespray downloads Kubernetes binaries to /tmp/releases/ on each node:

ssh k8s-node1 tree /tmp/releases

Expected output:

/tmp/releases
├── kubeadm-v1.32.9-amd64
├── kubectl-v1.32.9-amd64
└── kubelet-v1.32.9-amd64

These binaries are cached on the node. During upgrades, new version binaries will be downloaded alongside the existing ones, and the active symlinks will be updated.

Kubernetes Certificate Inspection

Kubespray uses kubeadm to generate all cluster certificates. You can inspect them:

ssh k8s-node1 kubeadm certs check-expiration

This shows the expiration dates for all certificates. By default, kubeadm generates certificates with a 1-year validity period. Kubespray automatically renews certificates during cluster upgrades, so regular upgrades also serve as a certificate rotation mechanism — another reason not to defer upgrades.

CoreDNS Verification

Verify that cluster DNS resolution is working:

kubectl run dnstest --image=busybox:1.36 --rm -it --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local

Expected output:

Server:    10.233.0.3
Address 1: 10.233.0.3

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1

This confirms:

The kubelet is correctly passing --cluster-dns=10.233.0.3 to pods
CoreDNS is running and reachable at the service IP 10.233.0.3
The kubernetes service (the API server) correctly resolves to 10.233.0.1 (the first IP in the service CIDR)

Metrics Server Verification

# Check if the metrics API is available
kubectl top nodes

Expected output:

NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
k8s-node1   152m         7%     1124Mi          29%
k8s-node2   98m          4%     987Mi           25%
k8s-node3   87m          4%     945Mi           24%
k8s-node4   45m          2%     512Mi           13%

If kubectl top returns results, the Metrics Server is healthy and collecting resource utilization data from the kubelet's /metrics/resource endpoint on each node.

Understanding the K8S API Endpoint

Why API Endpoint Redundancy Matters

In a Kubernetes cluster, every component that interacts with the control plane ultimately talks to the kube-apiserver. This includes kubelet on every node, kube-proxy, kubectl commands from administrators, and any in-cluster workload that queries the Kubernetes API. If there is only a single kube-apiserver and it goes down, the entire cluster becomes unmanageable: new pods cannot be scheduled, services cannot be updated, and node health checks stop functioning.

When you deploy a multi-control-plane HA cluster (as we did in the previous section with three control plane nodes), you now have three kube-apiserver instances running on k8s-node1 (192.168.10.11), k8s-node2 (192.168.10.12), and k8s-node3 (192.168.10.13). But having multiple API servers alone is not enough — every client component on every node needs a strategy for how to reach these API servers and what to do when one of them fails.

This is the API endpoint problem, and Kubespray provides three distinct approaches to solving it.

Case 1: Client-Side Load Balancing with NGINX Static Pods

This is Kubespray’s default behavior when no external load balancer is configured. In this mode, Kubespray deploys an NGINX reverse proxy as a static pod on every worker node. This proxy listens on localhost:6443 and round-robins traffic to all control plane nodes.

How Worker-Side NGINX Proxy Works

When you deploy a cluster with Kubespray’s default settings, the following variable is implicitly set to true:

# inventory/mycluster/group_vars/all/all.yml
loadbalancer_apiserver_localhost: true  # default

With this setting enabled, Kubespray performs the following on every worker node during cluster deployment:

Generates an NGINX configuration file at /etc/nginx/nginx.conf that defines an upstream block pointing to all control plane nodes.
Creates a static pod manifest at /etc/kubernetes/manifests/nginx-proxy.yaml that runs an NGINX container using the host network.
Configures kubelet on the worker to use https://localhost:6443 as its API server endpoint.
Configures kube-proxy to also connect through https://localhost:6443.

The resulting architecture looks like this:

┌─────────────────────────────────────────────────────────────┐
│                        Worker Node (k8s-node4)              │
│                                                             │
│  ┌──────────┐      ┌────────────────────┐                   │
│  │ kubelet   │─────▶│  nginx-proxy       │                   │
│  │ kube-proxy│─────▶│  (static pod)      │                   │
│  └──────────┘      │  localhost:6443     │                   │
│                     └────────┬───────────┘                   │
└──────────────────────────────┼───────────────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                 ▼
     ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
     │  k8s-node1   │ │  k8s-node2   │ │  k8s-node3   │
     │  apiserver   │ │  apiserver   │ │  apiserver   │
     │ :6443        │ │ :6443        │ │ :6443        │
     └──────────────┘ └──────────────┘ └──────────────┘

Because the NGINX proxy runs as a static pod directly on the worker node, no external infrastructure is required. Each worker independently manages its own API server connection pool.

You can verify the nginx-proxy pod is running on each worker node:

$ kubectl get pods -A -o wide | grep nginx-proxy
kube-system   nginx-proxy-k8s-node4   1/1   Running   0   12m   192.168.10.14   k8s-node4   
kube-system   nginx-proxy-k8s-node5   1/1   Running   0   12m   192.168.10.15   k8s-node5

Note that control plane nodes do not run nginx-proxy — they connect directly to their own local kube-apiserver instance since kube-apiserver is already running on localhost:6443.

Configuration Deep Dive (nginx.conf.j2)

SSH into a worker node and inspect the generated NGINX configuration:

$ ssh k8s-node4 cat /etc/nginx/nginx.conf

error_log stderr notice;

worker_processes 1;
worker_rlimit_nofile 130048;
worker_shutdown_timeout 10s;

events {
  multi_accept on;
  use epoll;
  worker_connections 16384;
}

stream {
  upstream kube_apiserver {
    least_conn;
    server 192.168.10.11:6443;
    server 192.168.10.12:6443;
    server 192.168.10.13:6443;
  }

  server {
    listen        127.0.0.1:6443;
    proxy_pass    kube_apiserver;
    proxy_timeout 10m;
    proxy_connect_timeout 1s;
  }
}

Key points to understand about this configuration:

stream block (not http): This is a Layer 4 (TCP) proxy, not Layer 7 (HTTP). NGINX forwards raw TCP connections without inspecting HTTP headers or TLS content. This is critical because the TLS handshake happens directly between the client (kubelet) and kube-apiserver — NGINX never terminates TLS.
least_conn load balancing: Distributes connections to the backend with the fewest active connections. This provides better distribution than round-robin when some API requests take longer than others.
listen 127.0.0.1:6443: The proxy only listens on localhost. It is not accessible from outside the node.
proxy_connect_timeout 1s: If a control plane node is unreachable, NGINX will fail over to the next backend within 1 second.
proxy_timeout 10m: Long-lived connections (such as kubectl exec or watch streams) are kept alive for up to 10 minutes.

Now inspect the static pod manifest:

$ ssh k8s-node4 cat /etc/kubernetes/manifests/nginx-proxy.yaml

apiVersion: v1
kind: Pod
metadata:
  name: nginx-proxy
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: kube-nginx
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  nodeSelector:
    kubernetes.io/os: linux
  priorityClassName: system-node-critical
  containers:
    - name: nginx-proxy
      image: registry.k8s.io/pause:3.10
      resources:
        requests:
          cpu: 25m
          memory: 32M
      # ... volume mounts for /etc/nginx/nginx.conf

The key setting here is hostNetwork: true, which means the pod uses the node's network namespace directly. This allows the NGINX process to bind to 127.0.0.1:6443 on the host, making it transparently available to kubelet and other local processes.

Verify that kubelet on the worker is indeed pointing to localhost:

$ ssh k8s-node4 cat /etc/kubernetes/kubelet.conf | grep server
    server: https://localhost:6443

Similarly, check the kube-proxy configuration:

$ kubectl get configmap kube-proxy -n kube-system -o yaml | grep server
    server: https://localhost:6443

Failure Simulation: Control Plane Node Down

To verify that client-side load balancing works correctly during a control plane failure, we can stop one of the control plane nodes and observe that the cluster continues operating normally.

Step 1: Check current cluster state

$ kubectl get nodes
NAME        STATUS   ROLES           AGE   VERSION
k8s-node1   Ready    control-plane   35m   v1.32.9
k8s-node2   Ready    control-plane   33m   v1.32.9
k8s-node3   Ready    control-plane   31m   v1.32.9
k8s-node4   Ready              28m   v1.32.9
k8s-node5   Ready              28m   v1.32.9

Step 2: Stop kube-apiserver on k8s-node1

# On k8s-node1, move the static pod manifest to stop kube-apiserver
$ ssh k8s-node1 mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

Step 3: Verify the cluster is still functional

# From admin-lb, run kubectl commands — they should still work
$ kubectl get nodes
NAME        STATUS     ROLES           AGE   VERSION
k8s-node1   Ready      control-plane   36m   v1.32.9
k8s-node2   Ready      control-plane   34m   v1.32.9
k8s-node3   Ready      control-plane   32m   v1.32.9
k8s-node4   Ready                29m   v1.32.9
k8s-node5   Ready                29m   v1.32.9

The cluster remains fully operational because the NGINX proxy on each worker automatically routes traffic to the remaining two API servers (k8s-node2 and k8s-node3). The proxy_connect_timeout 1s setting ensures the failover happens almost instantly.

You can observe this in the NGINX error log on a worker:

$ ssh k8s-node4 crictl logs $(ssh k8s-node4 crictl ps --name nginx-proxy -q) 2>&1 | tail
... connect() failed (111: Connection refused) while connecting to upstream,
    upstream: "192.168.10.11:6443" ...

NGINX detected that 192.168.10.11:6443 is unreachable and stopped routing traffic to it.

Step 4: Restore the API server

$ ssh k8s-node1 mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Within a few seconds, kubelet on k8s-node1 will detect the manifest and restart kube-apiserver. NGINX on the workers will begin routing traffic to all three backends again.

Pros of Client-Side Load Balancing:

No external infrastructure required (no separate load balancer to manage or maintain)
Fast failover (1-second connect timeout per backend)
Each worker is independently resilient — no shared single point of failure
Works in air-gapped or restricted network environments

Cons of Client-Side Load Balancing:

NGINX configuration must be updated on every worker when control plane nodes are added/removed (Kubespray handles this automatically during playbook runs)
External clients (such as administrators running kubectl from their laptops) must target a specific control plane node IP — there is no single virtual IP for external access
Certificate SAN management can become complex if you need external access through multiple entry points

Case 2: External LB (HAProxy) + Client-Side LB

In production environments, you typically want a single, stable endpoint that external clients (developers, CI/CD pipelines, monitoring systems) can use to reach the Kubernetes API. This is where an external load balancer comes in.

In Case 2, we add an external HAProxy load balancer on the admin-lb node (192.168.10.10) while keeping the client-side NGINX proxy on worker nodes. This gives you the best of both worlds: workers use the local NGINX proxy for resilient internal connectivity, and external users connect through HAProxy.

External Clients (kubectl, CI/CD, Monitoring)
                    │
                    ▼
        ┌───────────────────────┐
        │   admin-lb (HAProxy)  │
        │   192.168.10.10:6443  │
        └───────────┬───────────┘
                    │
       ┌────────────┼────────────┐
       ▼            ▼            ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ node1    │ │ node2    │ │ node3    │
  │ apiserver│ │ apiserver│ │ apiserver│
  └──────────┘ └──────────┘ └──────────┘
       ▲            ▲            ▲
       └────────────┼────────────┘
                    │
        ┌───────────────────────┐
        │  Worker nodes still   │
        │  use nginx-proxy      │
        │  (localhost:6443)     │
        └───────────────────────┘

Adding the External LB Endpoint

In our lab, HAProxy was already configured on the admin-lb node during the initial Vagrant provisioning (see Section 4). The HAProxy configuration in /etc/haproxy/haproxy.cfg includes:

frontend kubernetes-api
    bind *:6443
    mode tcp
    option tcplog
    default_backend kubernetes-api-backend

backend kubernetes-api-backend
    mode tcp
    option tcp-check
    balance roundrobin
    server k8s-node1 192.168.10.11:6443 check fall 3 rise 2
    server k8s-node2 192.168.10.12:6443 check fall 3 rise 2
    server k8s-node3 192.168.10.13:6443 check fall 3 rise 2

HAProxy performs Layer 4 TCP proxying, health-checking each backend every few seconds. If a control plane node fails health checks three times (fall 3), it is removed from the pool. When it passes two consecutive checks (rise 2), it is re-added.

You can verify HAProxy is forwarding correctly:

$ curl -sk https://192.168.10.10:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

However, at this point, the kube-apiserver TLS certificate does not include 192.168.10.10 or the HAProxy domain name in its Subject Alternative Names (SANs). This means that any client performing proper TLS verification will reject the connection:

$ curl -v https://192.168.10.10:6443/version
* SSL: no alternative certificate subject name matches target host name '192.168.10.10'
curl: (60) SSL: no alternative certificate subject name matches target host name '192.168.10.10'

We must update the certificates to include the external LB address.

Certificate SAN Update for External LB IP/Domain

To make the kube-apiserver certificate valid when accessed through the HAProxy IP (192.168.10.10) or a domain name (k8s-api-srv.admin-lb.com), we need to add these as supplementary SANs.

Step 1: Update the Kubespray group variables

Edit inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml and add the following:

## Supplementary addresses that can be added in kubernetes ssl keys.
## That can be useful for example to setup a determinate endpoint
## for FQDN registrations and target that endpoint with HAProxy.
supplementary_addresses_in_ssl_keys:
  - 192.168.10.10
  - k8s-api-srv.admin-lb.com

This tells Kubespray to include these additional addresses in the SAN field of the kube-apiserver’s TLS certificate.

Step 2: Apply the certificate update

You do not need to re-run the entire cluster.yml playbook. Instead, target only the control plane tag with a limit to control plane nodes:

$ cd /root/kubespray

$ ansible-playbook cluster.yml \
    --tags "control-plane" \
    --limit kube_control_plane \
    -e kube_version="1.32.9"

This playbook run will:

Regenerate the kube-apiserver TLS certificates with the new SANs
Restart kube-apiserver on each control plane node to pick up the new certificates
Leave everything else (etcd, worker nodes, CNI) untouched

The run typically completes in about 3–4 minutes.

Step 3: Verify the updated certificate

After the playbook completes, inspect the certificate SANs:

$ ssh k8s-node1 openssl x509 -in /etc/kubernetes/ssl/apiserver.crt -noout -text \
    | grep -A 20 "Subject Alternative Name"
            X509v3 Subject Alternative Name:
                DNS:k8s-node1, DNS:k8s-node2, DNS:k8s-node3,
                DNS:lb-apiserver.kubernetes.local,
                DNS:localhost, DNS:kubernetes, DNS:kubernetes.default,
                DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local,
                DNS:k8s-api-srv.admin-lb.com,
                IP Address:10.233.0.1, IP Address:192.168.10.11,
                IP Address:192.168.10.12, IP Address:192.168.10.13,
                IP Address:127.0.0.1, IP Address:192.168.10.10

The new entries k8s-api-srv.admin-lb.com and 192.168.10.10 are now present in the certificate SANs.

Step 4: Test external access through HAProxy

# Using IP address
$ curl -sk https://192.168.10.10:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

# Using domain name (ensure DNS or /etc/hosts resolves this)
$ curl -sk https://k8s-api-srv.admin-lb.com:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

Both endpoints now work without TLS errors.

Step 5: Update the admin kubeconfig to use the external LB

Now that HAProxy is a valid entry point, update the admin kubeconfig on the admin-lb node to use it:

# Copy the kubeconfig from a control plane node
$ scp k8s-node1:/etc/kubernetes/admin.conf /root/.kube/config

# Update the server URL to point to HAProxy
$ kubectl config set-cluster cluster.local \
    --server=https://k8s-api-srv.admin-lb.com:6443

Or manually edit ~/.kube/config:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: 
    server: https://k8s-api-srv.admin-lb.com:6443  # Changed from https://192.168.10.11:6443
  name: cluster.local

Now all kubectl commands from the admin-lb node will go through HAProxy, which distributes them across all three control plane nodes.

Failure Simulation with External LB

With the external LB in place, let’s simulate a more realistic failure scenario.

Step 1: Observe HAProxy backend status

Open the HAProxy stats page in a browser:

http://192.168.10.10:9000/haproxy_stats

You should see all three backends (k8s-node1, k8s-node2, k8s-node3) in a green/UP state.

Step 2: Shut down an entire control plane node

# From the host machine (outside the VMs)
$ vagrant halt k8s-node1

Step 3: Verify HAProxy detects the failure

Refresh the HAProxy stats page. Within a few seconds (determined by fall 3 and the health check interval), k8s-node1 will transition to a red/DOWN state. The remaining two backends stay UP.

Step 4: Verify cluster operations continue

$ kubectl get nodes
NAME        STATUS     ROLES           AGE   VERSION
k8s-node1   NotReady   control-plane   45m   v1.32.9
k8s-node2   Ready      control-plane   43m   v1.32.9
k8s-node3   Ready      control-plane   41m   v1.32.9
k8s-node4   Ready                38m   v1.32.9
k8s-node5   Ready                38m   v1.32.9

$ kubectl create deployment test-nginx --image=nginx --replicas=2
deployment.apps/test-nginx created

$ kubectl get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE
test-nginx-7c79c4bf97-abc12   1/1     Running   0          10s   10.233.90.5   k8s-node4
test-nginx-7c79c4bf97-def34   1/1     Running   0          10s   10.233.91.3   k8s-node5

The cluster is fully operational. Both HAProxy (for external access) and the NGINX proxy on workers (for internal communication) have seamlessly failed over to the remaining control plane nodes.

Step 5: Restore the node

$ vagrant up k8s-node1

After the node boots and kubelet starts, it will rejoin the cluster. HAProxy will detect the restored backend via health checks (rise 2) and begin routing traffic to it again.

# Clean up the test deployment
$ kubectl delete deployment test-nginx

Summary of Case 2:

In this configuration:

Worker nodes use the local nginx-proxy (localhost:6443) — unchanged from Case 1
External clients (kubectl, CI/CD, monitoring) use HAProxy (192.168.10.10:6443)
Certificates include both the control plane IPs and the HAProxy IP/domain as SANs
Two independent failover mechanisms protect the cluster: HAProxy for external traffic, nginx-proxy for internal traffic

This is the most common production configuration because it provides maximum resilience without requiring changes to the internal cluster networking model.

Case 3: External LB as Single Endpoint for All Components

In some environments, teams prefer a fully centralized approach where every component — including kubelet and kube-proxy on worker nodes — connects to the API server exclusively through the external load balancer. This eliminates the per-node NGINX proxy and creates a simpler, more uniform architecture.

All Clients (kubelet, kube-proxy, kubectl, CI/CD)
                    │
                    ▼
        ┌───────────────────────┐
        │   admin-lb (HAProxy)  │
        │   192.168.10.10:6443  │
        └───────────┬───────────┘
                    │
       ┌────────────┼────────────┐
       ▼            ▼            ▼
  ┌──────────┐ ┌──────────┐ ┌──────────┐
  │ node1    │ │ node2    │ │ node3    │
  │ apiserver│ │ apiserver│ │ apiserver│
  └──────────┘ └──────────┘ └──────────┘

In this mode, no nginx-proxy static pods exist on worker nodes. Every API request from every node passes through HAProxy.

Disabling Client-Side LB (loadbalancer_apiserver_localhost: false)

To switch to external-LB-only mode, update the Kubespray configuration with the following variables:

Edit inventory/mycluster/group_vars/all/all.yml:

## External LB configuration
## This domain name will be used in kubeconfigs, kubelet configs, and
## kube-proxy configs on all nodes.
apiserver_loadbalancer_domain_name: "k8s-api-srv.admin-lb.com"

## External load balancer address and port
loadbalancer_apiserver:
  address: 192.168.10.10
  port: 6443

## CRITICAL: Disable client-side (localhost) load balancing
## When set to false, nginx-proxy static pods will be removed from worker nodes
## and all components will use the external LB endpoint instead.
loadbalancer_apiserver_localhost: false

Make sure the supplementary_addresses_in_ssl_keys from Case 2 is still present in k8s-cluster.yml:

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
supplementary_addresses_in_ssl_keys:
  - 192.168.10.10
  - k8s-api-srv.admin-lb.com

Apply the changes:

Since we are changing a fundamental cluster networking parameter, this requires a broader playbook run:

$ cd /root/kubespray

$ ansible-playbook cluster.yml \
    -e kube_version="1.32.9"

This playbook run will:

Remove the nginx-proxy static pod manifest from all worker nodes
Remove the /etc/nginx/nginx.conf file from worker nodes
Regenerate kubelet configuration on all nodes to point to https://k8s-api-srv.admin-lb.com:6443
Update the kube-proxy ConfigMap server endpoint
Regenerate all kubeconfig files (admin.conf, kubelet.conf, scheduler.conf, controller-manager.conf) with the new server URL
Restart kubelet on all nodes to pick up the new configuration

⚠️ Important: This is a disruptive change. During the playbook run, worker nodes will be reconfigured to point to the external LB. If HAProxy is not running or the DNS name does not resolve, worker nodes will lose connectivity to the API server. Always verify HAProxy is healthy before applying this change.

Verifying kubelet and kube-proxy Endpoint Configuration

After the playbook completes, systematically verify that every component is pointing to the external LB.

1. Verify nginx-proxy pods are gone from worker nodes

$ kubectl get pods -A -o wide | grep nginx-proxy
# No output — nginx-proxy pods have been removed

Confirm the static pod manifest no longer exists on worker nodes:

$ ssh k8s-node4 ls /etc/kubernetes/manifests/
# nginx-proxy.yaml should NOT be listed

And the NGINX configuration file has been cleaned up:

$ ssh k8s-node4 cat /etc/nginx/nginx.conf
cat: /etc/nginx/nginx.conf: No such file or directory

2. Verify kubelet configuration on worker nodes

$ ssh k8s-node4 cat /etc/kubernetes/kubelet.conf | grep server
    server: https://k8s-api-srv.admin-lb.com:6443

Previously (in Case 1 and Case 2), this was https://localhost:6443. Now it points directly to the external LB domain name.

Check another worker to confirm consistency:

$ ssh k8s-node5 cat /etc/kubernetes/kubelet.conf | grep server
    server: https://k8s-api-srv.admin-lb.com:6443

3. Verify kubelet configuration on control plane nodes

Control plane nodes have a slightly different behavior. Depending on the Kubespray version and settings, control plane kubelet may point to localhost:6443 (directly to the local kube-apiserver) or to the external LB. Verify:

$ ssh k8s-node1 cat /etc/kubernetes/kubelet.conf | grep server
    server: https://k8s-api-srv.admin-lb.com:6443

In Case 3, even control plane nodes route through the external LB for consistency (though the request may loop back to the local apiserver via HAProxy).

4. Verify kube-proxy configuration

$ kubectl get configmap kube-proxy -n kube-system -o yaml | grep server
    server: https://k8s-api-srv.admin-lb.com:6443

5. Verify admin kubeconfig

$ ssh k8s-node1 cat /etc/kubernetes/admin.conf | grep server
    server: https://k8s-api-srv.admin-lb.com:6443

6. Test API access from multiple paths

# Through HAProxy IP
$ curl -sk https://192.168.10.10:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

# Through HAProxy domain name
$ curl -sk https://k8s-api-srv.admin-lb.com:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

# Directly to a control plane node (still works — useful for debugging)
$ curl -sk https://192.168.10.11:6443/version
{
  "major": "1",
  "minor": "32",
  "gitVersion": "v1.32.9",
  ...
}

7. Verify cluster health

$ kubectl get nodes -o wide
NAME        STATUS   ROLES           AGE   VERSION    INTERNAL-IP      OS-IMAGE
k8s-node1   Ready    control-plane   50m   v1.32.9    192.168.10.11    Rocky Linux 10.0
k8s-node2   Ready    control-plane   48m   v1.32.9    192.168.10.12    Rocky Linux 10.0
k8s-node3   Ready    control-plane   46m   v1.32.9    192.168.10.13    Rocky Linux 10.0
k8s-node4   Ready              43m   v1.32.9    192.168.10.14    Rocky Linux 10.0
k8s-node5   Ready              43m   v1.32.9    192.168.10.15    Rocky Linux 10.0

$ kubectl get pods -A
NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
kube-system   coredns-xxxxxxxxx-xxxxx             1/1     Running   0          50m
kube-system   coredns-xxxxxxxxx-xxxxx             1/1     Running   0          50m
kube-system   kube-apiserver-k8s-node1            1/1     Running   0          50m
kube-system   kube-apiserver-k8s-node2            1/1     Running   0          48m
kube-system   kube-apiserver-k8s-node3            1/1     Running   0          46m
kube-system   kube-controller-manager-k8s-node1   1/1     Running   0          50m
kube-system   kube-controller-manager-k8s-node2   1/1     Running   0          48m
kube-system   kube-controller-manager-k8s-node3   1/1     Running   0          46m
kube-system   kube-proxy-xxxxx                    1/1     Running   0          43m
kube-system   kube-proxy-xxxxx                    1/1     Running   0          43m
kube-system   kube-proxy-xxxxx                    1/1     Running   0          50m
kube-system   kube-proxy-xxxxx                    1/1     Running   0          48m
kube-system   kube-proxy-xxxxx                    1/1     Running   0          46m
kube-system   kube-scheduler-k8s-node1            1/1     Running   0          50m
kube-system   kube-scheduler-k8s-node2            1/1     Running   0          48m
kube-system   kube-scheduler-k8s-node3            1/1     Running   0          46m

Notice that there are no nginx-proxy-* pods in the listing. All traffic now flows through the external HAProxy.

Failure Domain Consideration in Case 3:

In this architecture, HAProxy becomes a critical single point of failure for the entire cluster. If the admin-lb node goes down:

Worker nodes lose API server connectivity (kubelet cannot report node status)
kube-proxy cannot receive Service/Endpoint updates
kubectl commands from external clients fail
New pods cannot be scheduled

In production, you would mitigate this by:

Running HAProxy in a highly available pair with keepalived (VRRP) for a floating virtual IP
Using a cloud provider’s managed load balancer (AWS NLB, GCP Internal LB, Azure LB)
Deploying multiple HAProxy instances behind DNS round-robin

Choosing the Right Configuration for Your Environment

For most production deployments, Case 2 (External LB + Client-Side LB) is the recommended approach. It provides two independent failover paths, no single point of failure, and a stable external endpoint for administrative and CI/CD access. The slight additional complexity of running nginx-proxy on workers is fully managed by Kubespray and requires no manual intervention.

Case 3 (External LB Only) is appropriate when:

You have a highly available load balancer infrastructure (cloud-managed LB, keepalived pair)
You want a simpler mental model where all traffic follows a single path
Your organization’s network team manages the LB and prefers centralized control

Case 1 (Client-Side Only) is best suited for:

Development and testing environments
Air-gapped or isolated networks with no LB infrastructure
Temporary clusters where simplicity is prioritized over external accessibility

In the remaining sections of this guide, we will continue with the Case 2 configuration (external HAProxy + client-side NGINX proxy) as it is the most representative of real-world production setups.

Node Management

In a production Kubernetes cluster, nodes are not static. Worker nodes need to be scaled out to handle increased workloads, decommissioned when no longer needed, or replaced when hardware fails. Control plane nodes occasionally require rotation — for OS patching, hardware refresh, or disaster recovery. Kubespray provides dedicated playbooks for each of these lifecycle operations, making it possible to manage nodes declaratively and repeatably.

This section walks through every major node management scenario: adding workers, gracefully removing them, force-removing unhealthy nodes that have gone offline, replacing a control plane node, and performing a full cluster reset.

Lab Context Recap

Our cluster currently has 3 control plane nodes (k8s-node1, k8s-node2, k8s-node3) and 1 worker node (k8s-node4). The admin/LB node (admin-lb) runs HAProxy and Kubespray. Kubernetes version is v1.32.9.

Adding a Worker Node (scale.yml)

When your cluster needs more compute capacity, Kubespray’s scale.yml playbook lets you add new worker nodes without disrupting existing workloads. Unlike cluster.yml, which operates on the entire cluster, scale.yml is purpose-built for incremental node addition — it only touches the new node and the minimal set of existing resources needed to integrate it.

Scale Playbook Walkthrough

Before running the playbook, let’s understand what scale.yml does under the hood. The playbook executes the following sequence of operations:

Download — Pulls the required container images and binaries to the new node (containerd images, CNI plugins, kubelet binary, kubeadm binary).
etcd (conditional) — If the new node is an etcd member (not the case for worker-only additions), it would configure and join the etcd cluster. For worker nodes, this step is skipped entirely.
Install kubelet — Installs and configures the kubelet service on the new node, including the kubelet configuration file, systemd unit, and certificate bootstrap token.
Upload control plane certificates — Copies the necessary CA certificates from an existing control plane node so the new worker can establish trust with the API server.
kubeadm join — Runs kubeadm join on the new node, which registers it with the cluster and starts the kubelet.
Apply labels and taints — Applies any node labels or taints defined in the inventory.
Configure CNI — Deploys the CNI plugin (Flannel in our case) so the node can participate in the pod network.

Important: scale.yml can only be used for adding worker nodes. You cannot use scale.yml to add new control plane nodes. For control plane additions, you must use cluster.yml — this is covered in Section 7.4.

Step-by-Step Execution and Verification

Step 1: Update the Inventory

First, add the new node to inventory.ini. We'll add k8s-node5 as a worker:

# inventory/mycluster/inventory.ini

[all]
k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11
k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12
k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13
k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14
k8s-node5 ansible_host=192.168.10.15 ip=192.168.10.15  # <-- NEW

[kube_control_plane]
k8s-node1
k8s-node2
k8s-node3

[etcd:children]
kube_control_plane

[kube_node]
k8s-node4
k8s-node5  # <-- NEW

[k8s_cluster:children]
kube_control_plane
kube_node

Step 2: Run the Scale Playbook

Execute scale.yml with the --limit flag to target only the new node. Always specify the Kubernetes version explicitly to ensure consistency:

cd /root/kubespray

ansible-playbook scale.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  --limit=k8s-node5 \
  -e kube_version="1.32.9"

The playbook typically completes in about 3 minutes for a single worker node. You’ll see output similar to:

PLAY RECAP ***********************************************************************
k8s-node5                  : ok=198  changed=62   unreachable=0    failed=0    skipped=412  rescued=0    ignored=0

Step 3: Verify the New Node

Once the playbook completes, verify the node has joined the cluster:

kubectl get nodes -o wide

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP      OS-IMAGE            KERNEL-VERSION
k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11    Rocky Linux 10.0    ...
k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12    Rocky Linux 10.0    ...
k8s-node3   Ready    control-plane   1d     v1.32.9   192.168.10.13    Rocky Linux 10.0    ...
k8s-node4   Ready              1d     v1.32.9   192.168.10.14    Rocky Linux 10.0    ...
k8s-node5   Ready              30s    v1.32.9   192.168.10.15    Rocky Linux 10.0    ...

The new k8s-node5 should appear with STATUS: Ready and the correct Kubernetes version. You can also verify that the CNI plugin is running:

kubectl get pods -n kube-system -o wide | grep flannel

kube-flannel-xxxxx   1/1   Running   0   45s   192.168.10.15   k8s-node5   ...

If you’re using client-side load balancing (the default Kubespray behavior), you’ll also see an nginx-proxy static pod on the new worker:

kubectl get pods -n kube-system -o wide | grep nginx-proxy

nginx-proxy-k8s-node4   1/1   Running   0   1d    192.168.10.14   k8s-node4   ...
nginx-proxy-k8s-node5   1/1   Running   0   45s   192.168.10.15   k8s-node5   ...

This nginx-proxy is configured to load-balance API requests across all three control plane nodes, ensuring the new worker has full HA connectivity to the API server from the moment it joins.

Removing a Worker Node (remove-node.yml)

When decommissioning a worker node — whether for cost optimization, hardware retirement, or cluster right-sizing — Kubespray’s remove-node.yml playbook handles the full lifecycle: draining workloads, cleaning up cluster metadata, and resetting the node.

Graceful Removal with PDB Considerations

Before diving into the removal process, it’s critical to understand how PodDisruptionBudget (PDB) objects can affect node removal.

When Kubespray drains a node, it runs the equivalent of kubectl drain --ignore-daemonsets --delete-emptydir-data. This eviction process respects PDB constraints. If a PDB specifies maxUnavailable: 0 for a set of pods, the drain operation will block indefinitely because Kubernetes refuses to evict pods that would violate the disruption budget.

Example scenario: Suppose you have a deployment with 2 replicas and a PDB that says maxUnavailable: 0:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: my-app

If both replicas happen to be running on the node you’re trying to drain (or if draining would bring the available count below the PDB requirement), the drain command will hang. In the Kubespray playbook output, you’ll see the drain task stall with no progress.

Mitigation strategies:

Before removing a node, check for PDBs: kubectl get pdb --all-namespaces
Temporarily adjust the PDB to allow disruption: maxUnavailable: 1
Or manually reschedule pods off the target node before starting the removal
Use kubectl drain --timeout=300s to set a timeout on the drain (Kubespray's default behavior includes a drain timeout, but verify your configuration)

Remove Playbook Walkthrough

The remove-node.yml playbook executes the following steps:

Confirmation prompt — Unless skip_confirmation=true is passed, the playbook asks for manual confirmation before proceeding. This is a safety mechanism to prevent accidental node removal.
Drain the node — Cordons the node (marks it as unschedulable) and evicts all non-DaemonSet pods. Workloads are rescheduled to other available nodes.
Remove etcd member — If the node is an etcd member, it is removed from the etcd cluster. For worker-only nodes, this step is skipped.
kubeadm reset — Runs kubeadm reset on the target node, which tears down the kubelet, removes certificates, and cleans up local Kubernetes state.
Delete node metadata — Removes the Node object from the Kubernetes API so it no longer appears in kubectl get nodes.

Execution:

Let’s remove the k8s-node5 worker node we just added:

cd /root/kubespray

ansible-playbook remove-node.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e node=k8s-node5 \
  -e skip_confirmation=true \
  -e kube_version="1.32.9"

Key parameters:

Parameter Description -e node=k8s-node5 Specifies which node to remove. Can be a comma-separated list for multiple nodes. -e skip_confirmation=true Bypasses the interactive confirmation prompt. Useful for automation.

The playbook typically completes in about 2 minutes. Output:

PLAY RECAP ***********************************************************************
k8s-node1                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
k8s-node5                  : ok=15   changed=9    unreachable=0    failed=0    skipped=18   rescued=0    ignored=0

Notice that k8s-node1 (a control plane node) also shows activity — this is because the playbook connects to a control plane node to execute kubectl drain and kubectl delete node commands.

Verification:

kubectl get nodes -o wide

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP      OS-IMAGE            KERNEL-VERSION
k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11    Rocky Linux 10.0    ...
k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12    Rocky Linux 10.0    ...
k8s-node3   Ready    control-plane   1d     v1.32.9   192.168.10.13    Rocky Linux 10.0    ...
k8s-node4   Ready              1d     v1.32.9   192.168.10.14    Rocky Linux 10.0    ...

k8s-node5 is gone. Any pods that were running on it have been rescheduled to k8s-node4 or one of the control plane nodes (if they tolerate control plane taints).

Don’t forget to update the inventory. After removing a node, clean up inventory.ini by removing the k8s-node5 entry from both the [all] and [kube_node] groups. While Kubespray won't break if a removed node remains in the inventory, keeping the inventory in sync with reality is a best practice for Day-2 operations.

Force-Removing an Unhealthy Node

Not all node removals are graceful. In production, you will encounter scenarios where a node has suffered a catastrophic failure — hardware crash, kernel panic, network partition, or a cloud instance that simply vanished. In these cases, you can’t SSH into the node to run kubeadm reset, and a normal remove-node.yml run will fail because Ansible can't reach the target.

Kubespray handles this with two special flags: reset_nodes=false and allow_ungraceful_removal=true.

Simulating a Node Failure

To demonstrate force removal, let’s simulate a hard failure of k8s-node5. First, add it back to the cluster (if you removed it in the previous section), then force-stop the VM:

# On the Vagrant host (not admin-lb)
vagrant halt k8s-node5 --force

Or if you’re simulating in a different environment, simply power off the machine:

# On the target node itself (before it goes down)
sudo systemctl poweroff --force

After the node goes offline, Kubernetes will eventually mark it as NotReady:

kubectl get nodes

NAME        STATUS     ROLES           AGE    VERSION
k8s-node1   Ready      control-plane   1d     v1.32.9
k8s-node2   Ready      control-plane   1d     v1.32.9
k8s-node3   Ready      control-plane   1d     v1.32.9
k8s-node4   Ready                1d     v1.32.9
k8s-node5   NotReady             10m    v1.32.9

The node transitions to NotReady after the kubelet stops sending heartbeats (default: 40 seconds for the node controller to mark it, then pod-eviction-timeout for pod rescheduling).

Using reset_nodes=false and allow_ungraceful_removal=true

Now attempt a normal removal — it will fail because Ansible cannot SSH into the offline node:

# This will FAIL:
ansible-playbook remove-node.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e node=k8s-node5 \
  -e skip_confirmation=true

fatal: [k8s-node5]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh..."}

To force-remove the node without SSH access, use the ungraceful removal flags:

ansible-playbook remove-node.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e node=k8s-node5 \
  -e skip_confirmation=true \
  -e reset_nodes=false \
  -e allow_ungraceful_removal=true

Key parameters:

Parameter Description -e reset_nodes=false Skips the kubeadm reset step on the target node. Since the node is unreachable, there's nothing to reset remotely. -e allow_ungraceful_removal=true Permits removing the node from the cluster even though it couldn't be drained or cleaned up. Kubespray will only remove the cluster-side metadata (Node object, etcd member if applicable).

What happens during ungraceful removal:

Drain is skipped — Since the node is unreachable, the drain step is bypassed. Pods that were running on the dead node will remain in Terminating state until the garbage collection timeout expires (or until the Node object is deleted).
kubeadm reset is skipped — No SSH to the node, so no local cleanup.
Node metadata is deleted — The Node object is removed from the Kubernetes API. Once deleted, all pods that were “stuck” on that node are freed for rescheduling by their respective controllers (Deployments, StatefulSets, etc.).

Verification:

kubectl get nodes

NAME        STATUS   ROLES           AGE    VERSION
k8s-node1   Ready    control-plane   1d     v1.32.9
k8s-node2   Ready    control-plane   1d     v1.32.9
k8s-node3   Ready    control-plane   1d     v1.32.9
k8s-node4   Ready              1d     v1.32.9

The NotReady node is gone. Workloads are rescheduled.

Caution: If the “dead” node later comes back online (e.g., after a network partition heals), its kubelet will try to re-register with the API server using its old certificates. This can cause unexpected behavior. Best practice is to fully wipe and reprovision the node before allowing it to rejoin the cluster, or ensure its kubelet service is disabled/stopped.

Replacing a Control Plane Node

Replacing a control plane node is the most complex node management operation. Unlike workers, control plane nodes host critical static pods (kube-apiserver, kube-controller-manager, kube-scheduler) and may be etcd members. The replacement process involves:

Removing the old control plane node (including its etcd membership)
Updating the inventory
Adding the new node as a control plane member using cluster.yml

Kubespray handles most of the complexity, but there are important ordering constraints to be aware of.

Removing a Control Plane and Worker Node

Let’s say k8s-node3 (our third control plane node) needs to be replaced. First, we remove it from the cluster:

ansible-playbook remove-node.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e node=k8s-node3 \
  -e skip_confirmation=true \
  -e kube_version="1.32.9"

This playbook will:

Drain k8s-node3 — evict all pods (including those scheduled by tolerations)
Remove k8s-node3 from the etcd cluster — runs etcdctl member remove to safely remove it from the quorum
Run kubeadm reset on k8s-node3 — cleans up certificates, static pod manifests, and kubelet state
Delete the Node object from the API

During etcd member removal, the cluster transitions from a 3-member to a 2-member etcd cluster. This is still operational (2 out of 3 is quorum), but you’ve lost fault tolerance — a 2-member etcd cluster cannot tolerate any additional failures. This is why you should minimize the time window between removing and re-adding a control plane node.

Verify the etcd cluster state after removal:

ssh k8s-node1 etcdctl.sh member list -w table

+------------------+---------+-----------+---------------------------+---------------------------+
|        ID        | STATUS  |   NAME    |        PEER ADDRS         |       CLIENT ADDRS        |
+------------------+---------+-----------+---------------------------+---------------------------+
| 1a2b3c4d5e6f7890 | started | k8s-node1 | https://192.168.10.11:2380| https://192.168.10.11:2379|
| 2b3c4d5e6f789012 | started | k8s-node2 | https://192.168.10.12:2380| https://192.168.10.12:2379|
+------------------+---------+-----------+---------------------------+---------------------------+

Only 2 members remain. Let’s also check the Kubernetes node list:

kubectl get nodes -o wide

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP
k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11
k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12
k8s-node4   Ready              1d     v1.32.9   192.168.10.14

k8s-node3 is completely gone from both the Kubernetes cluster and the etcd cluster.

Critical Limitation: The first node listed in the [kube_control_plane] group (typically k8s-node1) cannot be removed using remove-node.yml. This node is treated as the "initial control plane" by Kubespray and kubeadm, and its removal requires a full cluster rebuild. If you need to replace node1, you must first add a new control plane node, then use cluster.yml with a reconfigured inventory that lists the new node first.

Re-adding a Node as Control Plane (cluster.yml)

Now we’ll add k8s-node3 back, but this time as a fresh control plane member. After reprovisioning the VM (or ensuring the old state has been cleaned up via kubeadm reset), update the inventory.

Critical: New control plane nodes must be added at the END of the [kube_control_plane] group. Kubespray and kubeadm treat the first node in the group as the "initial" control plane. If you insert a new node before existing members, you risk breaking the cluster.

# inventory/mycluster/inventory.ini

[all]
k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11
k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12
k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13
k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14

[kube_control_plane]
k8s-node1
k8s-node2
k8s-node3  # <-- Re-added at the END

[etcd:children]
kube_control_plane

[kube_node]
k8s-node4

[k8s_cluster:children]
kube_control_plane
kube_node

Now run cluster.yml (not scale.yml!) to add the control plane node:

ansible-playbook cluster.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e kube_version="1.32.9"

Why cluster.yml and not scale.yml? The scale.yml playbook only handles worker node additions. Control plane nodes require the full cluster.yml playbook because they need:

etcd member addition and data synchronization
Static pod manifest generation for apiserver, controller-manager, scheduler
Certificate generation and distribution
kubeadm control plane join (different from worker join)

The playbook is idempotent — it will skip tasks for k8s-node1, k8s-node2, and k8s-node4 that are already configured, and only perform meaningful work on k8s-node3. However, running cluster.yml does touch all nodes (unlike scale.yml --limit), so expect it to take longer than a simple worker addition.

etcd Member Changes and NGINX Config Updates

After cluster.yml completes, verify that the control plane is fully restored:

etcd cluster — back to 3 members:

ssh k8s-node1 etcdctl.sh member list -w table

+------------------+---------+-----------+---------------------------+---------------------------+
|        ID        | STATUS  |   NAME    |        PEER ADDRS         |       CLIENT ADDRS        |
+------------------+---------+-----------+---------------------------+---------------------------+
| 1a2b3c4d5e6f7890 | started | k8s-node1 | https://192.168.10.11:2380| https://192.168.10.11:2379|
| 2b3c4d5e6f789012 | started | k8s-node2 | https://192.168.10.12:2380| https://192.168.10.12:2379|
| 3c4d5e6f78901234 | started | k8s-node3 | https://192.168.10.13:2380| https://192.168.10.13:2379|
+------------------+---------+-----------+---------------------------+---------------------------+

The new k8s-node3 etcd member has automatically joined and synchronized data from the existing members.

Verify etcd endpoint health:

ssh k8s-node1 etcdctl.sh endpoint status -w table

+---------------------------+------------------+---------+---------+-----------+-----------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+---------------------------+------------------+---------+---------+-----------+-----------+
| https://192.168.10.11:2379| 1a2b3c4d5e6f7890 | 3.5.25  |   25 MB |   false   |        4  |
| https://192.168.10.12:2379| 2b3c4d5e6f789012 | 3.5.25  |   25 MB |    true   |        4  |
| https://192.168.10.13:2379| 3c4d5e6f78901234 | 3.5.25  |   25 MB |   false   |        4  |
+---------------------------+------------------+---------+---------+-----------+-----------+

All three endpoints are healthy with consistent DB sizes.

Kubernetes nodes — 3 control planes restored:

kubectl get nodes -o wide

NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP
k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11
k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12
k8s-node3   Ready    control-plane   60s    v1.32.9   192.168.10.13
k8s-node4   Ready              1d     v1.32.9   192.168.10.14

NGINX proxy configuration on workers (automatic update):

If you’re using client-side load balancing (the default), Kubespray automatically updates the nginx.conf on all worker nodes to include the new control plane node. You can verify this by checking the nginx configuration on k8s-node4:

ssh k8s-node4 cat /etc/nginx/nginx.conf

stream {
    upstream kube_apiserver {
        least_conn;
        server 192.168.10.11:6443;
        server 192.168.10.12:6443;
        server 192.168.10.13:6443;  # <-- Restored
    }

    server {
        listen 127.0.0.1:6443;
        proxy_pass kube_apiserver;
        proxy_timeout 10m;
        proxy_connect_timeout 1s;
    }
}

All three control plane IPs are present in the upstream block. The nginx-proxy static pod on k8s-node4 will automatically reload with this updated configuration, ensuring the worker's API requests are distributed across all three control plane nodes.

Static pods on the new control plane node:

kubectl get pods -n kube-system -o wide | grep k8s-node3

etcd-k8s-node3                      1/1   Running   0   60s   192.168.10.13   k8s-node3
kube-apiserver-k8s-node3             1/1   Running   0   60s   192.168.10.13   k8s-node3
kube-controller-manager-k8s-node3    1/1   Running   0   60s   192.168.10.13   k8s-node3
kube-scheduler-k8s-node3             1/1   Running   0   60s   192.168.10.13   k8s-node3

All four critical static pods are running on the restored control plane node.

If using HAProxy (external LB):

Remember that HAProxy on admin-lb is configured independently of Kubespray. If your HAProxy backend already lists all three control plane IPs in /etc/haproxy/haproxy.cfg, no changes are needed — HAProxy will automatically start routing traffic to the new node once its health check passes:

backend k8s-api
    option  httpchk GET /healthz
    http-check expect status 200
    default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
    server k8s-node1 192.168.10.11:6443 check check-ssl verify none
    server k8s-node2 192.168.10.12:6443 check check-ssl verify none
    server k8s-node3 192.168.10.13:6443 check check-ssl verify none

You can verify HAProxy backend status at http://192.168.10.10:9000/haproxy_stats — all three backends should show as UP (green).

Full Cluster Reset (reset.yml)

When you need to tear down the entire Kubernetes cluster — for rebuilding from scratch, testing a fresh deployment, or decommissioning the infrastructure — Kubespray provides reset.yml. This playbook completely reverses everything that cluster.yml did: it stops all Kubernetes services, removes all binaries, cleans up configuration files, and leaves the nodes in a pre-Kubernetes state.

What reset.yml does on every node:

Drains and deletes all nodes from the Kubernetes API
Stops kubelet and removes the systemd unit
Stops and removes etcd data and binaries (on etcd members)
Runs kubeadm reset — removes certificates, static pod manifests, and kubeconfig files
Removes CNI configuration — cleans up /etc/cni/net.d/ and CNI binaries
Cleans up iptables rules — removes all Kubernetes-related iptables/ipvs rules
Removes container runtime artifacts — stops all running containers, removes containerd state
Deletes configuration directories — /etc/kubernetes/, /var/lib/kubelet/, /var/lib/etcd/, etc.

Execution:

cd /root/kubespray

ansible-playbook reset.yml \
  --become \
  -i inventory/mycluster/inventory.ini \
  -e skip_confirmation=true

Warning: This is a destructive, irreversible operation. All cluster data, including etcd state (and therefore all Kubernetes objects — deployments, services, secrets, configmaps, PVs, etc.) will be permanently deleted. Ensure you have backups of any critical data before running this playbook.

After reset.yml completes, the nodes are clean and ready for a fresh cluster.yml deployment. This makes reset.yml particularly useful in development and testing workflows where you frequently need to iterate on cluster configurations:

# Tear down
ansible-playbook reset.yml --become -i inventory/mycluster/inventory.ini -e skip_confirmation=true

# Rebuild with new settings
ansible-playbook cluster.yml --become -i inventory/mycluster/inventory.ini -e kube_version="1.32.9"

Verification after reset:

SSH into any node and confirm that Kubernetes components are gone:

ssh k8s-node1

# kubelet should not be running
systemctl status kubelet
# Unit kubelet.service could not be found.

# No Kubernetes directories
ls /etc/kubernetes/
# ls: cannot access '/etc/kubernetes/': No such file or directory

# No etcd data
ls /var/lib/etcd/
# ls: cannot access '/var/lib/etcd/': No such file or directory

# kubectl should fail (no kubeconfig)
kubectl get nodes
# The connection to the server localhost:8080 was refused

The node is now a blank slate, ready for reprovisioning.

Key takeaways

Use scale.yml for workers, cluster.yml for control planes.
Always add new control plane nodes at the end of the [kube_control_plane] group.
The first node in [kube_control_plane] cannot be removed via remove-node.yml.
For unreachable nodes, use reset_nodes=false + allow_ungraceful_removal=true.
PodDisruptionBudgets can block node drains — check PDBs before removing nodes.
Keep your inventory.ini in sync with the actual cluster state after every operation.
Minimize the time window where etcd has fewer than 3 members to maintain fault tolerance.

8. Monitoring Setup

With a fully operational HA Kubernetes cluster — three control plane nodes, two workers, and an external HAProxy load balancer — the next critical step is observability. Without monitoring, you are flying blind: you cannot detect etcd quorum degradation, API server latency spikes, or node resource exhaustion until they become outages.

In this section, we will build a production-grade monitoring stack on top of our lab cluster. The setup consists of three layers:

Persistent storage — An NFS-based dynamic provisioner so that Prometheus and Grafana retain their data across pod restarts.
kube-prometheus-stack — A Helm-based deployment that bundles Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics into a single, cohesive package.
etcd metrics — Dedicated scrape configuration so that Prometheus can collect metrics directly from the etcd cluster running on the control plane nodes.

By the end of this section, you will have Prometheus scraping the Kubernetes API server, kubelet, node-exporter, HAProxy, and etcd, with Grafana dashboards providing real-time visibility into every layer of the cluster.

NFS Subdir External Provisioner for Persistent Storage

Why We Need a StorageClass

Both Prometheus and Grafana require persistent volumes. Prometheus stores its time-series database (TSDB) on disk, and Grafana persists dashboards, data sources, and user sessions. Without a StorageClass that supports dynamic provisioning, every PersistentVolumeClaim (PVC) would require a cluster administrator to manually create a matching PersistentVolume (PV)—impractical for a monitoring stack that may create multiple PVCs during installation.

In our lab environment, the admin-lb node (192.168.10.10) already runs an NFS server exporting /srv/nfs/share. This was configured during the initial bootstrap via the admin-lb.sh script. We will leverage this NFS export as the backing storage for all dynamic PVCs.

Installing the NFS Subdir External Provisioner

The nfs-subdir-external-provisioner is a Kubernetes controller that watches for new PVCs and automatically creates subdirectories on the NFS share as PVs. Each PVC gets its own directory named ${namespace}-${pvcName}-${pvName}, making it easy to identify which data belongs to which workload.

First, add the Helm repository and update:

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

Then install the provisioner, pointing it at our NFS server:

helm install nfs-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --set nfs.server=192.168.10.10 \
  --set nfs.path=/srv/nfs/share \
  --set storageClass.defaultClass=true

Verifying the Provisioner

After installation, confirm the StorageClass is created and set as the default:

kubectl get storageclass

Expected output:

NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-client (default)   cluster.local/nfs-provisioner-nfs-subdir-...    Delete          Immediate           true                   30s

The (default) annotation confirms that any PVC without an explicit storage class will be fulfilled by this provisioner. You can also verify the provisioner pod is running:

kubectl get pods -l app=nfs-subdir-external-provisioner

NAME                                              READY   STATUS    RESTARTS   AGE
nfs-provisioner-nfs-subdir-external-provisioner-xxx   1/1     Running   0          45s

At this point, any component that creates a PVC — including Prometheus and Grafana — will automatically get a dynamically provisioned NFS-backed persistent volume.

Installing kube-prometheus-stack with Helm

The kube-prometheus-stack Helm chart is the de facto standard for deploying a complete monitoring solution on Kubernetes. A single helm install gives you:

Prometheus — Time-series database and scraping engine
Grafana — Visualization and dashboarding platform
Alertmanager — Alert routing and deduplication
node-exporter — Host-level metrics (CPU, memory, disk, network) from every node
kube-state-metrics — Kubernetes object-level metrics (deployments, pods, nodes, PVCs)
Pre-built Grafana dashboards — Dozens of dashboards for Kubernetes, node, and Prometheus self-monitoring
Pre-built PrometheusRules — Alert rules for common failure conditions

Adding the Helm Repository

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

Custom Values (Prometheus, Grafana, Node Exporter)

Rather than installing with default settings, we create a custom values file to tailor the stack to our lab environment. The key customizations are:

NodePort services for Prometheus and Grafana — so we can access them from the host machine without an Ingress controller.
Additional scrape configurations — to collect metrics from HAProxy and etcd, which are not Kubernetes-native workloads.
Grafana admin password — set explicitly for reproducibility.
Node exporter tolerations — to ensure node-exporter runs on control plane nodes as well.

Create the values file:

cat <<'EOF' > ~/kubespray/custom-values.yaml
# ============================================================
# Prometheus Configuration
# ============================================================
prometheus:
  prometheusSpec:
    # Expose Prometheus via NodePort so we can access it from outside the cluster
    # Access URL: http://:30001
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

    # Persistent storage for TSDB
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

    # Additional scrape configs for non-k8s targets
    additionalScrapeConfigs:
      # --- HAProxy Metrics ---
      # HAProxy exposes a Prometheus-compatible metrics endpoint
      # on port 8405 (configured in haproxy.cfg with stats socket)
      - job_name: "haproxy"
        static_configs:
          - targets:
              - "192.168.10.10:8405"
        metrics_path: "/metrics"

      # --- etcd Metrics ---
      # etcd exposes metrics on port 2381 (configured via etcd_listen_metrics_urls)
      # We scrape all 3 etcd members individually
      - job_name: "etcd"
        static_configs:
          - targets:
              - "192.168.10.11:2381"
              - "192.168.10.12:2381"
              - "192.168.10.13:2381"
        metrics_path: "/metrics"

  service:
    type: NodePort
    nodePort: 30001

# ============================================================
# Grafana Configuration
# ============================================================
grafana:
  adminPassword: "prom-operator"

  service:
    type: NodePort
    nodePort: 30002

  # Persistent storage for Grafana dashboards and settings
  persistence:
    enabled: true
    size: 5Gi

  # Sidecar configuration: automatically imports ConfigMaps
  # with label grafana_dashboard="1" as dashboards
  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
      labelValue: "1"
      searchNamespace: ALL

# ============================================================
# Node Exporter Configuration
# ============================================================
nodeExporter:
  # Ensure node-exporter runs on ALL nodes, including control planes
  tolerations:
    - effect: NoSchedule
      operator: Exists

# ============================================================
# Alertmanager Configuration
# ============================================================
alertmanager:
  service:
    type: NodePort
    nodePort: 30003
EOF

Let us walk through each section in detail.

Prometheus Additional Scrape Configs

The additionalScrapeConfigs field is the mechanism for adding scrape targets that are not discovered via Kubernetes service discovery. In our setup, two external targets need explicit configuration:

HAProxy metrics (192.168.10.10:8405)

Our HAProxy configuration on the admin-lb node includes a Prometheus metrics endpoint. The relevant section in /etc/haproxy/haproxy.cfg is:

frontend stats
    bind *:8405
    http-request use-service prometheus-exporter if { path /metrics }

This exposes standard HAProxy metrics such as haproxy_frontend_current_sessions, haproxy_backend_up, haproxy_server_bytes_in_total, etc. These metrics are critical for monitoring the external load balancer that fronts our Kubernetes API servers.

etcd metrics (192.168.10.11-13:2381)

etcd does not expose metrics by default. We will enable this in Section 8.3. Once enabled, each etcd member exposes metrics on port 2381, including etcd_server_has_leader, etcd_disk_wal_fsync_duration_seconds, etcd_network_peer_round_trip_time_seconds, and etcd_mvcc_db_total_size_in_bytes.

Service Types and NodePorts

In a lab environment without an Ingress controller or cloud load balancer, NodePort is the simplest way to expose services externally. The fixed ports make access predictable:

Service NodePort Access URL Prometheus 30001 http://192.168.10.14:30001 Grafana 30002 http://192.168.10.14:30002 Alertmanager 30003 http://192.168.10.14:30003

You can use any node IP in the cluster (worker or control plane). The NodePort is accessible on every node regardless of where the pod is actually scheduled.

Grafana Sidecar Dashboard Loader

The sidecar.dashboards configuration tells Grafana to watch for ConfigMaps across all namespaces (searchNamespace: ALL) that carry the label grafana_dashboard: "1". When such a ConfigMap is created, the Grafana sidecar container automatically loads the JSON dashboard it contains—no manual import required. We will use this mechanism in Section 8.2.2 to add custom dashboards.

Installing the Stack

With the values file ready, install the chart:

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --version 80.13.3 \
  -f ~/kubespray/custom-values.yaml

We pin the chart version to 80.13.3 for reproducibility. The --create-namespace flag creates the monitoring namespace if it does not already exist.

The installation takes approximately 2–3 minutes. Monitor the progress:

kubectl -n monitoring get pods -w

Wait until all pods reach Running or Completed status:

NAME                                                        READY   STATUS    RESTARTS   AGE
alertmanager-kube-prometheus-stack-alertmanager-0            2/2     Running   0          2m
kube-prometheus-stack-grafana-xxxxxxxxx-xxxxx                3/3     Running   0          2m
kube-prometheus-stack-kube-state-metrics-xxxxxxxxx-xxxxx     1/1     Running   0          2m
kube-prometheus-stack-operator-xxxxxxxxx-xxxxx               1/1     Running   0          2m
kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m
kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m
kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m
kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m
prometheus-kube-prometheus-stack-prometheus-0                2/2     Running   0          2m

Note that prometheus-node-exporter pods appear on every node (including control planes, thanks to our toleration). Verify services:

kubectl -n monitoring get svc

Confirm the NodePort assignments match our values file (30001 for Prometheus, 30002 for Grafana, 30003 for Alertmanager).

Accessing the Monitoring UIs

Prometheus — Open http://192.168.10.14:30001 in your browser.

Navigate to Status → Targets to verify all scrape targets are healthy. You should see targets for:

kubernetes-apiservers — the Kubernetes API server metrics endpoint
kubernetes-nodes — kubelet metrics from each node
kubernetes-nodes-cadvisor — container-level resource metrics
node-exporter — host-level metrics from each node
kube-state-metrics — Kubernetes object metrics
haproxy — HAProxy load balancer metrics (from additionalScrapeConfigs)
etcd — will appear as DOWN initially (we enable this in Section 8.3)

Grafana — Open http://192.168.10.14:30002 in your browser.

Username: admin
Password: prom-operator

The kube-prometheus-stack chart automatically provisions dozens of dashboards. Browse Dashboards → Browse to explore them. Key pre-installed dashboards include:

Kubernetes / Compute Resources / Cluster — cluster-wide CPU and memory utilization
Kubernetes / Compute Resources / Node (Pods) — per-node resource breakdown
Node Exporter / Nodes — detailed host metrics
Kubernetes / Networking / Cluster — network throughput and errors
Prometheus / Overview — Prometheus self-monitoring (scrape duration, target health, TSDB size)

Adding Grafana Dashboards via ConfigMap

While the kube-prometheus-stack ships with many useful dashboards, there are several community dashboards that provide deeper or more specialized visibility. We will add three additional dashboards:

Dashboard ID Name Purpose 12693 Kubernetes Monitoring Comprehensive cluster overview with resource usage trends 15661 Node Exporter Full Extremely detailed host metrics (disk I/O, network sockets, CPU frequency, thermal zones) Custom k8s-system-api-server Kubernetes API Server performance (request latency, error rates, inflight requests)

The Grafana sidecar we configured earlier watches for ConfigMaps with the label grafana_dashboard: "1". To add a dashboard, we simply create a ConfigMap containing the dashboard JSON and apply the label.

Downloading Community Dashboards

Grafana community dashboards can be downloaded from grafana.com/grafana/dashboards as JSON files. The download URL pattern is:

https://grafana.com/api/dashboards/{DASHBOARD_ID}/revisions/latest/download

Download the two community dashboards:

# Dashboard 12693 - Kubernetes Monitoring
curl -fsSL \
  https://grafana.com/api/dashboards/12693/revisions/latest/download \
  -o /tmp/dashboard-12693.json

# Dashboard 15661 - Node Exporter Full
curl -fsSL \
  https://grafana.com/api/dashboards/15661/revisions/latest/download \
  -o /tmp/dashboard-15661.json

Creating ConfigMaps from Dashboard JSON

For each dashboard, create a ConfigMap in the monitoring namespace with the appropriate label:

# Dashboard 12693
kubectl create configmap grafana-dashboard-12693 \
  --from-file=k8s-monitoring.json=/tmp/dashboard-12693.json \
  -n monitoring

kubectl label configmap grafana-dashboard-12693 \
  grafana_dashboard="1" \
  -n monitoring

# Dashboard 15661
kubectl create configmap grafana-dashboard-15661 \
  --from-file=node-exporter-full.json=/tmp/dashboard-15661.json \
  -n monitoring

kubectl label configmap grafana-dashboard-15661 \
  grafana_dashboard="1" \
  -n monitoring

Creating the API Server Dashboard via YAML

For the custom API Server dashboard, we define it directly as a YAML manifest. This dashboard focuses on Kubernetes API Server performance — a critical metric for HA clusters where API server latency directly impacts cluster operations.

cat <<'EOF' > ~/kubespray/grafana-dashboard-apiserver.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-apiserver
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  k8s-system-api-server.json: |-
    {
      "annotations": {
        "list": []
      },
      "description": "Kubernetes API Server Monitoring",
      "editable": true,
      "gnetId": null,
      "graphTooltip": 1,
      "links": [],
      "panels": [
        {
          "title": "API Server Request Rate",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(apiserver_request_total[5m])) by (verb)",
              "legendFormat": "{{ verb }}"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
        },
        {
          "title": "API Server Request Latency (99th percentile)",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (verb, le))",
              "legendFormat": "{{ verb }}"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
        },
        {
          "title": "API Server Error Rate",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(apiserver_request_total{code=~\"5..\"}[5m])) by (resource)",
              "legendFormat": "{{ resource }}"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
        },
        {
          "title": "API Server Inflight Requests",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(apiserver_current_inflight_requests) by (request_kind)",
              "legendFormat": "{{ request_kind }}"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
        }
      ],
      "schemaVersion": 36,
      "style": "dark",
      "tags": ["kubernetes", "apiserver"],
      "templating": { "list": [] },
      "time": { "from": "now-1h", "to": "now" },
      "title": "Kubernetes / System / API Server",
      "uid": "k8s-system-api-server"
    }
EOF

Apply the manifest:

kubectl apply -f ~/kubespray/grafana-dashboard-apiserver.yaml

Verifying Dashboard Import

The Grafana sidecar detects new ConfigMaps within a few seconds. You do not need to restart Grafana or any pods. Verify by listing the ConfigMaps with the dashboard label:

kubectl -n monitoring get configmap -l grafana_dashboard="1"

Expected output:

NAME                             DATA   AGE
grafana-dashboard-12693          1      2m
grafana-dashboard-15661          1      2m
grafana-dashboard-apiserver      1      30s

Now open Grafana (http://192.168.10.14:30002) and navigate to Dashboards → Browse. You should see the three new dashboards alongside the pre-installed ones. Open each to confirm they are rendering correctly with live data from Prometheus.

The API Server dashboard is particularly valuable for HA monitoring because it lets you observe whether API server request latency is consistent across the three control plane nodes. Asymmetric latency may indicate a problem with one specific node’s etcd connection or resource exhaustion.

Enabling etcd Metrics Collection

etcd is the most critical stateful component in a Kubernetes cluster. It stores the entire cluster state — every resource definition, every secret, every lease. If etcd fails, the cluster becomes read-only at best and completely non-functional at worst. Monitoring etcd health is therefore non-negotiable for any production cluster.

However, etcd does not expose a metrics endpoint by default in Kubespray deployments. We must explicitly enable it.

Configuring etcd_listen_metrics_urls

Kubespray provides dedicated variables to control etcd metrics exposure. We need to set two variables in the etcd group variables file.

Edit the etcd configuration:

vi ~/kubespray/inventory/mycluster/group_vars/etcd.yml

Add (or modify) the following variables:

# Enable etcd metrics endpoint
etcd_metrics: true

# Bind the metrics endpoint to all interfaces on port 2381
# This allows Prometheus (running on worker nodes) to scrape etcd metrics
# from the control plane nodes over the network
etcd_listen_metrics_urls: "http://0.0.0.0:2381"

Let us examine what each variable does:

etcd_metrics: true

This variable is Kubespray-specific. When set to true, Kubespray passes the --listen-metrics-urls flag to each etcd member during configuration. Without this flag, etcd simply does not serve any HTTP endpoint for metrics.

etcd_listen_metrics_urls: "http://0.0.0.0:2381"

This sets the address and port on which each etcd member will listen for metrics scrape requests. The values are deliberate:

http:// (not https://) — Metrics are served over plain HTTP. This avoids the complexity of configuring Prometheus with etcd client TLS certificates. Since metrics do not contain sensitive data (only operational telemetry), HTTP is acceptable in a lab environment. In production, you might prefer https:// with mutual TLS.
0.0.0.0 — Bind to all network interfaces. This is necessary because Prometheus runs on worker nodes, which communicate with control plane nodes over the pod network or host network. Binding to 127.0.0.1 would restrict access to local-only scraping.
2381 — The conventional port for etcd metrics. The standard etcd client port is 2379, and peer port is 2380, so 2381 follows the natural sequence and avoids conflicts.

Applying the Configuration with Kubespray

Now apply the configuration change using Kubespray’s cluster.yml playbook with the etcd tag, limited to the etcd node group:

cd ~/kubespray

ansible-playbook cluster.yml \
  --tags "etcd" \
  --limit etcd \
  -e kube_version="1.32.9"

This command takes approximately 2 minutes to complete. Here is what happens during the playbook run:

Kubespray detects the configuration change — The etcd_listen_metrics_urls variable modifies the etcd systemd unit file (or static pod manifest, depending on deployment type).
etcd members are restarted one at a time — Kubespray performs a rolling restart of the etcd cluster. Each member is stopped, its configuration is updated, and it is restarted before moving to the next member. This preserves quorum throughout the process.
Automatic backups are created — Before restarting each etcd member, Kubespray takes a snapshot backup of the etcd data. These backups are stored in /var/backups/ on each control plane node:

ssh k8s-node1 ls -la /var/backups/etcd-*
-rw------- 1 root root 3145728 Jul 15 10:30 /var/backups/etcd-snapshot-20250715103000.db

This automatic backup behavior is a valuable safety net — if the etcd restart somehow corrupts data, you have an immediate point-in-time recovery option.
etcd rejoins the cluster — After restart, each member reconnects to the cluster and synchronizes its data with the other members. You can verify cluster health:

ssh k8s-node1 etcdctl.sh member list -w table

+------------------+---------+-----------+----------------------------+----------------------------+------------+
   |        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
   +------------------+---------+-----------+----------------------------+----------------------------+------------+
   | 8e9e05c52164694d | started | k8s-node1 | https://192.168.10.11:2380 | https://192.168.10.11:2379 |      false |
   | 91bc3c398fb3c146 | started | k8s-node2 | https://192.168.10.12:2380 | https://192.168.10.12:2379 |      false |
   | fd422379fda50e48 | started | k8s-node3 | https://192.168.10.13:2380 | https://192.168.10.13:2379 |      false |
   +------------------+---------+-----------+----------------------------+----------------------------+------------+

All three members should show started status.

Verifying the Metrics Endpoint

After the playbook completes, confirm that each etcd member is now serving metrics:

# Test from the admin-lb node (or any node that can reach the control plane network)
curl -s http://192.168.10.11:2381/metrics | head -20

Expected output (truncated):

# HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1
# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 2
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 542
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 1203
...

Repeat for all three control plane nodes to confirm consistency:

for ip in 192.168.10.11 192.168.10.12 192.168.10.13; do
  echo "=== $ip ==="
  curl -s http://$ip:2381/metrics | grep etcd_server_has_leader
done

All three should report etcd_server_has_leader 1.

Adding etcd Scrape Config to Prometheus

We already configured the etcd scrape job in Section 8.2.1 as part of additionalScrapeConfigs in our custom values file. Here is the relevant snippet again for reference:

additionalScrapeConfigs:
  - job_name: "etcd"
    static_configs:
      - targets:
          - "192.168.10.11:2381"
          - "192.168.10.12:2381"
          - "192.168.10.13:2381"
    metrics_path: "/metrics"

Since this was included in the initial Helm install, Prometheus has been attempting to scrape these endpoints from the beginning. Before enabling etcd_listen_metrics_urls, the etcd targets appeared as DOWN in Prometheus. Now that the metrics endpoint is active, Prometheus should automatically pick them up.

Verifying etcd Targets in Prometheus

Open the Prometheus UI at http://192.168.10.14:30001 and navigate to Status → Targets. Look for the etcd job. You should see three targets, all with state UP:

Endpoint                     State    Labels                  Last Scrape    Scrape Duration
http://192.168.10.11:2381    UP       instance="192.168..."   12s ago        23.4ms
http://192.168.10.12:2381    UP       instance="192.168..."   14s ago        21.1ms
http://192.168.10.13:2381    UP       instance="192.168..."   11s ago        22.8ms

If any target shows DOWN, common troubleshooting steps include:

Firewall — Verify that port 2381 is open on the control plane nodes:

ssh k8s-node1 ss -tlnp | grep 2381

Expected: LISTEN 0 4096 *:2381 *:* users:(("etcd",..))

Network connectivity — Test from a worker node (where Prometheus is running):

ssh k8s-node4 curl -s http://192.168.10.11:2381/health

Expected: {"health":"true","reason":""}

etcd configuration — Verify the etcd process has the metrics flag:

ssh k8s-node1 ps aux | grep etcd | grep listen-metrics

You should see --listen-metrics-urls=http://0.0.0.0:2381 in the process arguments.

Sample PromQL Queries

You can run these queries in the Prometheus UI (http://192.168.10.14:30001/graph) to immediately visualize etcd health:

etcd cluster leader status:

etcd_server_has_leader

WAL fsync latency (99th percentile, per member):

histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

Database size across all members:

etcd_mvcc_db_total_size_in_bytes

Peer network round-trip time (99th percentile):

histogram_quantile(0.99,
  rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])
)

Rate of leader changes (should be 0 in a healthy cluster):

rate(etcd_server_leader_changes_seen_total[15m])

Creating an etcd Grafana Dashboard

You can also import a community etcd dashboard into Grafana. Dashboard ID 3070 (etcd by etcd.io) is a popular choice. Download and create a ConfigMap following the same pattern as Section 8.2.2:

# Download the etcd dashboard
curl -fsSL \
  https://grafana.com/api/dashboards/3070/revisions/latest/download \
  -o /tmp/dashboard-etcd-3070.json

# Create ConfigMap with the dashboard label
kubectl create configmap grafana-dashboard-etcd \
  --from-file=etcd.json=/tmp/dashboard-etcd-3070.json \
  -n monitoring

kubectl label configmap grafana-dashboard-etcd \
  grafana_dashboard="1" \
  -n monitoring

After a few seconds, the dashboard will appear in Grafana under Dashboards → Browse. This dashboard provides real-time visibility into:

etcd cluster membership and leader status
Raft proposal commit rates
Disk I/O performance (WAL fsync, backend commit)
gRPC request rates and latencies
Database size and compaction

Complete Target Verification

At this point, your Prometheus instance should have the following target groups all showing UP:

Job                        Targets   Status
─────────────────────────────────────────────
kubernetes-apiservers      3/3       UP
kubernetes-nodes           5/5       UP
kubernetes-nodes-cadvisor  5/5       UP
node-exporter              5/5       UP
kube-state-metrics         1/1       UP
haproxy                    1/1       UP
etcd                       3/3       UP

The HAProxy stats page is also accessible at http://192.168.10.10:9000/haproxy_stats for a quick visual check of API server backend health, independent of Prometheus.

Kubernetes Upgrade with Kubespray

Upgrading Kubernetes is one of the most critical Day-2 operations in any production environment. Unlike initial cluster deployment, upgrades must be performed carefully — a single misconfiguration can lead to API downtime, workload disruption, or even data loss in etcd. Kubespray provides well-structured playbooks that automate the upgrade process while maintaining high availability, but understanding what happens under the hood is essential for operating confidently.

Each upgrade type introduces progressively more complexity. A patch upgrade touches only the Kubernetes binaries. A minor upgrade may involve API deprecations and behavioral changes. A major upgrade combined with a Kubespray version bump means new Ansible roles, new default variables, updated container runtime versions, and potentially new etcd releases — all changing simultaneously.

Before proceeding with any upgrade, let’s first address a prerequisite that many operators overlook: the CNI plugin.

Pre-Upgrade: Flannel CNI Plugin Update

Before upgrading Kubernetes itself, it is important to verify that the CNI (Container Network Interface) plugin is compatible with the target Kubernetes version. In our lab, we use Flannel as the CNI. Flannel is deployed as a DaemonSet, which means it runs on every node in the cluster. Unlike node-level components (kubelet, kube-proxy), you cannot upgrade Flannel on a per-node basis — a DaemonSet update rolls out to all nodes at once.

This is a critical consideration: if you upgrade Flannel after upgrading only half of your nodes, you risk running incompatible versions of the CNI across different nodes, which can cause intermittent pod networking failures.

Best practice: Update the Flannel CNI plugin before starting the Kubernetes upgrade, while all nodes are still running the same Kubernetes version.

Checking the Current Flannel Version

# Check the current Flannel DaemonSet image
kubectl -n kube-system get daemonset kube-flannel -o jsonpath='{.spec.template.spec.containers[0].image}'

Output:

docker.io/flannel/flannel:v0.26.7

Updating Flannel via Kubespray

Kubespray manages the Flannel version through its role defaults. To update Flannel, you can either change the variable in your inventory or pass it as an extra variable. The relevant variable is flannel_image_tag in roles/network_plugin/flannel/defaults/main.yml.

However, the simplest approach is to let Kubespray handle it by running the cluster playbook with the network plugin tag:

cd /root/kubespray

# Check what Flannel version Kubespray v2.29.1 ships with
grep -r "flannel_image_tag" roles/network_plugin/flannel/defaults/main.yml

Output:

flannel_image_tag: "v0.26.7"

If the current version is already the latest supported by your Kubespray release, no action is needed. If you need to update it manually:

ansible-playbook cluster.yml \
  --tags "network" \
  --limit "kube_node,kube_control_plane" \
  -e kube_version="1.32.9"

Important: Because Flannel is a DaemonSet, the --limit flag does not restrict which nodes receive the update. The DaemonSet controller will roll out the new image to all nodes regardless. The --limit flag only controls which nodes Ansible connects to for running tasks, but the Kubernetes DaemonSet update propagates cluster-wide.

After the update, verify:

# Verify Flannel pods are running the new version on all nodes
kubectl -n kube-system get pods -l app=flannel -o wide

# Check that all Flannel pods are in Running state
kubectl -n kube-system get pods -l app=flannel -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\t"}{.spec.nodeName}{"\n"}{end}'

Kubespray Upgrade Strategies (Unsafe vs. Graceful)

Kubespray provides two fundamentally different approaches to cluster upgrades. Understanding the difference is critical for choosing the right strategy based on your environment’s tolerance for risk and downtime.

Strategy 1: Unsafe Upgrade (cluster.yml)

The cluster.yml playbook is the same playbook used for initial cluster deployment. When used for upgrades, it applies changes to all nodes simultaneously without draining workloads first. This is called an "unsafe" upgrade because:

Nodes are not cordoned before the upgrade
Pods are not drained — they continue running during the upgrade
If a kubelet restart kills running pods, they are simply restarted by their controllers
All nodes of the same type (control plane or worker) may be upgraded in parallel

To use cluster.yml for upgrades, you must set upgrade_cluster_setup: true:

ansible-playbook cluster.yml \
  -e kube_version="1.32.10" \
  -e upgrade_cluster_setup=true

When to use: Development environments, non-production clusters, or situations where speed matters more than zero-downtime guarantees.

Strategy 2: Graceful Upgrade (upgrade-cluster.yml)

The upgrade-cluster.yml playbook is purpose-built for production upgrades. It performs a rolling upgrade, processing one node at a time (or a configurable batch size), with proper workload migration:

Cordon the node (mark it as unschedulable)
Drain the node (gracefully evict all pods, respecting PodDisruptionBudgets)
Upgrade the node (update containerd, kubelet, kube-proxy, static pod manifests)
Uncordon the node (mark it as schedulable again)
Repeat for the next node

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10"

Controlling the Rolling Upgrade Behavior

Several variables control how the rolling upgrade proceeds:

Serial Execution

The serial parameter in the playbook controls how many nodes are upgraded simultaneously:

# Default: 20% of nodes at a time
serial: ". 20%"

# Conservative: one node at a time
serial: 1

# Aggressive: all nodes at once (effectively unsafe)
serial: "100%"

For production environments, serial: 1 is the safest option. It ensures that at any point during the upgrade, at most one node is unavailable.

Upgrade Confirmation Prompt

If you want manual confirmation before each node upgrade:

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10" \
  -e upgrade_node_confirm=true

This pauses the playbook before each node and waits for you to press Enter. This is useful for verifying cluster health between node upgrades.

Timed Pause Between Nodes

Alternatively, you can set an automatic pause between node upgrades:

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10" \
  -e upgrade_node_pause_seconds=60

This gives the cluster 60 seconds to stabilize after each node upgrade before proceeding to the next one.

PodDisruptionBudget (PDB) Considerations

During the drain phase, Kubernetes respects PodDisruptionBudgets. If a workload has a PDB with maxUnavailable: 0, the drain will hang indefinitely waiting for permission to evict pods. This is a common source of upgrade failures.

Before starting an upgrade, audit your PDBs:

# List all PDBs and their disruption settings
kubectl get pdb --all-namespaces -o wide

# Check for PDBs that might block drain
kubectl get pdb --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: maxUnavailable={.spec.maxUnavailable}, disruptionsAllowed={.status.disruptionsAllowed}{"\n"}{end}'

If you find PDBs with disruptionsAllowed: 0, either temporarily adjust them or ensure the corresponding workloads have enough replicas to tolerate one pod being evicted.

Patch Upgrade: v1.32.9 → v1.32.10

A patch upgrade is the simplest type of Kubernetes upgrade. It only includes bug fixes and security patches — no new features, no API changes, no deprecations. The Kubespray version remains the same (v2.29.1), and no supporting tools (etcd, containerd) need to change.

Despite its simplicity, a patch upgrade is the perfect opportunity to validate your upgrade procedures and tooling before attempting more complex minor or major upgrades.

Pre-Upgrade Cluster State

Before starting, document the current state:

# Current node versions
kubectl get nodes -o wide

NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP      OS-IMAGE                       KERNEL-VERSION
k8s-node1   Ready    control-plane   1d    v1.32.9   192.168.10.11    Rocky Linux 10.0 (Obsidian)    6.12.x
k8s-node2   Ready    control-plane   1d    v1.32.9   192.168.10.12    Rocky Linux 10.0 (Obsidian)    6.12.x
k8s-node3   Ready    control-plane   1d    v1.32.9   192.168.10.13    Rocky Linux 10.0 (Obsidian)    6.12.x
k8s-node4   Ready              1d    v1.32.9   192.168.10.14    Rocky Linux 10.0 (Obsidian)    6.12.x
k8s-node5   Ready              1d    v1.32.9   192.168.10.15    Rocky Linux 10.0 (Obsidian)    6.12.x

# Current etcd cluster health
ssh k8s-node1 etcdctl.sh endpoint status -w table

# Current component versions (static pods on control plane)
kubectl -n kube-system get pods -l tier=control-plane -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

kube-apiserver-k8s-node1            registry.k8s.io/kube-apiserver:v1.32.9
kube-apiserver-k8s-node2            registry.k8s.io/kube-apiserver:v1.32.9
kube-apiserver-k8s-node3            registry.k8s.io/kube-apiserver:v1.32.9
kube-controller-manager-k8s-node1   registry.k8s.io/kube-controller-manager:v1.32.9
kube-controller-manager-k8s-node2   registry.k8s.io/kube-controller-manager:v1.32.9
kube-controller-manager-k8s-node3   registry.k8s.io/kube-controller-manager:v1.32.9
kube-scheduler-k8s-node1            registry.k8s.io/kube-scheduler:v1.32.9
kube-scheduler-k8s-node2            registry.k8s.io/kube-scheduler:v1.32.9
kube-scheduler-k8s-node3            registry.k8s.io/kube-scheduler:v1.32.9

# Current kube-proxy version
kubectl -n kube-system get daemonset kube-proxy -o jsonpath='{.spec.template.spec.containers[0].image}'

registry.k8s.io/kube-proxy:v1.32.9

Control Plane Rolling Upgrade

We start the upgrade with the control plane nodes and etcd. This is always done first because worker nodes depend on the API server, and Kubernetes guarantees backward compatibility: a newer control plane can manage older worker nodes, but not vice versa.

cd /root/kubespray

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10" \
  --limit "kube_control_plane:etcd"

This command takes approximately 14 minutes for three control plane nodes.

What Happens During the Control Plane Upgrade

The upgrade-cluster.yml playbook executes the following sequence for each control plane node, one at a time:

Phase 1: Pre-upgrade Downloads

Before touching any node, Kubespray downloads all required container images and binaries to every node in the --limit scope. This ensures that the actual upgrade step (which involves restarting components) completes as quickly as possible.

TASK [download : Download containers if pull is required or told to always pull] ****

The images downloaded for a patch upgrade:

registry.k8s.io/kube-apiserver:v1.32.10
registry.k8s.io/kube-controller-manager:v1.32.10
registry.k8s.io/kube-scheduler:v1.32.10
registry.k8s.io/kube-proxy:v1.32.10

Images that do not change in a patch upgrade (and are therefore skipped):

registry.k8s.io/coredns/coredns:v1.12.0         # CoreDNS version unchanged
registry.k8s.io/pause:3.10                        # Pause container unchanged
quay.io/coreos/etcd:v3.5.25                       # etcd version unchanged

Phase 2: Rolling Upgrade per Node (k8s-node1 first)

TASK [kubernetes/control-plane : Kubeadm | Cordon node] *************************
changed: [k8s-node1]

TASK [kubernetes/control-plane : Kubeadm | Drain node] **************************
changed: [k8s-node1]

At this point, k8s-node1 is cordoned (no new pods will be scheduled) and drained (existing pods are gracefully evicted). Since this is a control plane node, only the static pods (apiserver, controller-manager, scheduler) and DaemonSet pods remain — they cannot be evicted.

TASK [container-engine/containerd : Containerd | Ensure containerd is installed] ***
ok: [k8s-node1]

For a patch upgrade, the containerd version typically does not change. Kubespray verifies the installed version matches the expected version and skips reinstallation if they match.

TASK [kubernetes/control-plane : Kubeadm | Upgrade first control plane] *********
changed: [k8s-node1]

This is the core step. Kubespray runs kubeadm upgrade apply v1.32.10 on the first control plane node. This command:

Validates the upgrade path (v1.32.9 → v1.32.10 is allowed)
Downloads any missing component images
Updates the static pod manifests in /etc/kubernetes/manifests/
The kubelet detects the manifest changes and restarts the static pods
Waits for the new API server to become healthy

For subsequent control plane nodes (k8s-node2, k8s-node3), kubeadm upgrade node is used instead of kubeadm upgrade apply:

TASK [kubernetes/control-plane : Kubeadm | Upgrade subsequent control planes] ***
changed: [k8s-node2]
...
changed: [k8s-node3]

The difference:

kubeadm upgrade apply — updates the cluster-level configuration and upgrades the first node
kubeadm upgrade node — upgrades a single node using the cluster-level configuration already applied

Phase 3: kube-proxy DaemonSet Update

After the first control plane node is upgraded, Kubespray updates the kube-proxy DaemonSet image. Since kube-proxy is a DaemonSet, this update propagates to all nodes in the cluster, including worker nodes that have not been upgraded yet:

TASK [kubernetes/control-plane : Kubeadm | Update kube-proxy DaemonSet] *********
changed: [k8s-node1]

This is safe because kube-proxy v1.32.10 is backward-compatible with kubelet v1.32.9 running on the worker nodes.

Phase 4: Uncordon

TASK [kubernetes/control-plane : Kubeadm | Uncordon node] ***********************
changed: [k8s-node1]

The node is marked schedulable again, and the playbook moves to the next control plane node.

Monitoring the Upgrade in Real-Time

While the upgrade is running, you can monitor progress from another terminal:

# Watch node versions change in real-time
watch -n 2 'kubectl get nodes -o wide'

# Watch static pod restarts
watch -n 2 'kubectl -n kube-system get pods -l tier=control-plane -o wide'

# Monitor etcd cluster health (should remain healthy throughout)
ssh k8s-node1 etcdctl.sh endpoint status -w table

# Check API server availability through the load balancer
while true; do curl -sk -o /dev/null -w "%{http_code}\n" https://192.168.10.10:6443/healthz; sleep 1; done

During the rolling upgrade, you will briefly see mixed versions:

NAME        STATUS                     ROLES           VERSION
k8s-node1   Ready                      control-plane   v1.32.10   ← upgraded
k8s-node2   Ready,SchedulingDisabled   control-plane   v1.32.9    ← being upgraded
k8s-node3   Ready                      control-plane   v1.32.9    ← waiting
k8s-node4   Ready                                v1.32.9
k8s-node5   Ready                                v1.32.9

The HAProxy load balancer on admin-lb (192.168.10.10) continues to route traffic to the two healthy API servers while one is being upgraded. The API remains available throughout.

Post-Control-Plane-Upgrade Verification

After the control plane upgrade completes:

# All control plane nodes should show v1.32.10
kubectl get nodes -o wide

NAME        STATUS   ROLES           VERSION
k8s-node1   Ready    control-plane   v1.32.10
k8s-node2   Ready    control-plane   v1.32.10
k8s-node3   Ready    control-plane   v1.32.10
k8s-node4   Ready              v1.32.9    ← workers still on old version
k8s-node5   Ready              v1.32.9

# Verify static pod images
kubectl -n kube-system get pods -l tier=control-plane -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

All control plane static pods should now show v1.32.10.

# Verify kube-proxy was updated on ALL nodes (including workers)
kubectl -n kube-system get daemonset kube-proxy -o jsonpath='{.spec.template.spec.containers[0].image}'

registry.k8s.io/kube-proxy:v1.32.10

# Verify etcd cluster health (etcd version unchanged for patch upgrade)
ssh k8s-node1 etcdctl.sh member list -w table
ssh k8s-node1 etcdctl.sh endpoint status -w table

# Verify API server responds correctly through all endpoints
curl -sk https://192.168.10.11:6443/version | jq .gitVersion
curl -sk https://192.168.10.12:6443/version | jq .gitVersion
curl -sk https://192.168.10.13:6443/version | jq .gitVersion
curl -sk https://192.168.10.10:6443/version | jq .gitVersion  # via HAProxy

All should return "v1.32.10".

Worker Node Individual Upgrade

With the control plane upgraded, we now upgrade the worker nodes. Unlike control plane nodes, worker nodes should be upgraded individually to maintain workload availability. If you have workloads with replicas: 2 spread across two workers, upgrading both simultaneously would cause a full outage.

# Upgrade worker node k8s-node4
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10" \
  --limit "k8s-node4"

This takes approximately 2 minutes per worker node.

What Happens During a Worker Node Upgrade

The process is simpler than control plane upgrades:

Cordon k8s-node4
Drain k8s-node4 (all non-DaemonSet pods are evicted to other nodes)
Upgrade containerd (if version changed — not in patch upgrades)
Run kubeadm upgrade node (updates kubelet configuration)
Restart kubelet service
Uncordon k8s-node4

During the drain, you can observe pods migrating:

# Watch pods being rescheduled from node4 to node5
watch -n 1 'kubectl get pods -A -o wide --field-selector spec.nodeName=k8s-node4'

After k8s-node4 is done, upgrade k8s-node5:

# Upgrade worker node k8s-node5
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.32.10" \
  --limit "k8s-node5"

Post-Worker-Upgrade Verification

# All nodes should now be v1.32.10
kubectl get nodes -o wide

NAME        STATUS   ROLES           VERSION
k8s-node1   Ready    control-plane   v1.32.10
k8s-node2   Ready    control-plane   v1.32.10
k8s-node3   Ready    control-plane   v1.32.10
k8s-node4   Ready              v1.32.10
k8s-node5   Ready              v1.32.10

# Verify all system pods are healthy
kubectl -n kube-system get pods -o wide

# Verify workloads are running normally
kubectl get pods --all-namespaces -o wide | grep -v Running | grep -v Completed

Updating Admin kubectl and kubeconfig

After the cluster upgrade, the admin node (admin-lb) should also have its kubectl binary updated to match the cluster version. While older kubectl versions can communicate with newer API servers (within one minor version skew), it is best practice to keep them aligned.

# On the admin-lb node (192.168.10.10)

# Check current kubectl version
kubectl version --client

Client Version: v1.32.9

# Update kubectl to match the cluster version
# First, ensure the Kubernetes yum repo points to v1.32
cat /etc/yum.repos.d/kubernetes.repo

[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key

# Install the latest kubectl in the v1.32 series
dnf install kubectl --disableexcludes=kubernetes -y

# Verify the new version
kubectl version --client

Client Version: v1.32.10

Also refresh the kubeconfig from the first control plane node, as the certificate or cluster configuration may have been updated:

# Copy the updated kubeconfig from the first control plane
scp k8s-node1:/root/.kube/config /root/.kube/config

# Verify connectivity
kubectl cluster-info
kubectl get nodes -o wide

The patch upgrade from v1.32.9 to v1.32.10 is now complete.

Minor Upgrade: v1.32.10 → v1.33.7

A minor version upgrade is more significant than a patch upgrade. Minor releases introduce new features, new APIs, deprecate old APIs, and may change default behaviors. However, the upgrade procedure with Kubespray is nearly identical — the difference lies primarily in what changes inside the cluster.

Pre-Upgrade: Review Release Notes

Before any minor upgrade, review the Kubernetes changelog for breaking changes:

# Check what version Kubespray v2.29.1 supports for v1.33
grep -r "kube_version" roles/kubespray-defaults/defaults/main/download.yml | head -5

Kubernetes follows a strict version skew policy: you can only upgrade one minor version at a time. Jumping from v1.32 to v1.34 directly is not supported. You must go v1.32 → v1.33 → v1.34.

Running the Minor Upgrade

The commands are identical to the patch upgrade — only the kube_version value changes.

Step 1: Upgrade Control Plane

cd /root/kubespray

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.33.7" \
  --limit "kube_control_plane:etcd"

This takes approximately 18 minutes for three control plane nodes. The longer duration compared to the patch upgrade is because:

More container images need to be downloaded (some support components have new versions)
kubeadm upgrade apply performs more extensive preflight checks when crossing minor versions
Static pod restarts may take slightly longer as new features are initialized

The rolling upgrade process is identical:

Node 1: cordon → drain → kubeadm upgrade apply v1.33.7 → uncordon
Node 2: cordon → drain → kubeadm upgrade node → uncordon
Node 3: cordon → drain → kubeadm upgrade node → uncordon

During the upgrade, monitor the API server availability:

# Continuous health check through HAProxy
while true; do
  HTTP_CODE=$(curl -sk -o /dev/null -w "%{http_code}" https://192.168.10.10:6443/healthz)
  echo "$(date '+%H:%M:%S') - API Server: $HTTP_CODE"
  sleep 1
done

You should see uninterrupted 200 responses throughout the control plane upgrade, thanks to the HAProxy load balancer distributing requests across the three API servers.

Step 2: Upgrade Worker Nodes

# Upgrade workers one at a time
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.33.7" \
  --limit "k8s-node4"

ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.33.7" \
  --limit "k8s-node5"

Each worker takes approximately 3 minutes.

Alternatively, you can upgrade all workers at once if you have sufficient capacity:

# Upgrade all workers (serial behavior controlled by playbook)
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.33.7" \
  --limit "kube_node"

With the default serial: 20%, both workers would be upgraded sequentially.

Step 3: Verify the Minor Upgrade

# All nodes should show v1.33.7
kubectl get nodes -o wide

NAME        STATUS   ROLES           VERSION
k8s-node1   Ready    control-plane   v1.33.7
k8s-node2   Ready    control-plane   v1.33.7
k8s-node3   Ready    control-plane   v1.33.7
k8s-node4   Ready              v1.33.7
k8s-node5   Ready              v1.33.7

# Verify API version
curl -sk https://192.168.10.10:6443/version | jq .

{
  "major": "1",
  "minor": "33",
  "gitVersion": "v1.33.7",
  "buildDate": "...",
  "goVersion": "go1.23.x",
  "compiler": "gc",
  "platform": "linux/amd64"
}

# Verify system pods
kubectl -n kube-system get pods -o wide

# Check for any deprecated API usage in your workloads
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

Step 4: Update Admin kubectl

For a minor version upgrade, the kubectl repo must be updated to the new minor version:

# On admin-lb node

# Update the Kubernetes yum repo to point to v1.33
cat > /etc/yum.repos.d/kubernetes.repo << 'EOF'
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/repodata/repomd.xml.key
EOF

# Install kubectl v1.33
dnf install kubectl --disableexcludes=kubernetes -y

# Verify
kubectl version --client

Client Version: v1.33.7

# Refresh kubeconfig
scp k8s-node1:/root/.kube/config /root/.kube/config

# Confirm connectivity
kubectl cluster-info
kubectl get nodes

The minor upgrade from v1.32.10 to v1.33.7 is now complete.

Major Upgrade with Kubespray Version Bump: v1.33.7 → v1.34.3

This is the most complex upgrade scenario. Not only are we upgrading Kubernetes by another minor version (v1.33 → v1.34), but we are also upgrading Kubespray itself from v2.29.1 to v2.30.0. A Kubespray version bump means:

New Ansible roles and tasks
Updated default variable values
New versions of supporting components (etcd, containerd, CoreDNS)
Potentially new Python dependencies for Ansible
Changed Ansible collection requirements

This is the type of upgrade that most closely mirrors what happens in real production environments, where the infrastructure tooling evolves alongside the target platform.

Switching Kubespray Tags (v2.29.1 → v2.30.0)

First, we need to switch the Kubespray repository to the new release tag.

cd /root/kubespray

# Check current Kubespray version
git describe --tags

v2.29.1

# Fetch the latest tags
git fetch --all --tags

# Check available v2.30.x tags
git tag -l "v2.30*"

v2.30.0

# Switch to the new version
git checkout v2.30.0

Note: switching to 'v2.30.0'.
You are in 'detached HEAD' state...
HEAD is now at  Release v2.30.0

After switching, verify what Kubernetes versions this Kubespray release supports:

# Check the supported K8s version range
grep -r "kube_version_min_required\|kube_version" roles/kubespray-defaults/defaults/main/*.yml | grep -i version | head -10

Also check what component versions ship with this release:

# etcd version
grep "etcd_version" roles/download/defaults/main/main.yml

etcd_version: v3.5.26

# containerd version
grep "containerd_version" roles/download/defaults/main/main.yml

containerd_version: 2.2.1

# CoreDNS version
grep "coredns_version" roles/download/defaults/main/main.yml

This reveals that upgrading to Kubespray v2.30.0 will also upgrade:

etcd: 3.5.25 → 3.5.26
containerd: 2.1.5 → 2.2.1

These component upgrades happen automatically as part of the Kubernetes upgrade because Kubespray controls their versions through role defaults.

Python Virtual Environment for Dependency Isolation

Kubespray v2.30.0 may require different Python package versions than v2.29.1. Installing new dependencies directly on the system can break existing Ansible setups. A Python virtual environment provides isolation.

# Check the new requirements
cat /root/kubespray/requirements.txt

ansible==10.7.0
cryptography==46.0.3
jinja2==3.1.5
jmespath==1.1.0
MarkupSafe==3.0.2
netaddr==1.3.0
pbr==6.1.1
ruamel.yaml==0.18.14

Compare with what is currently installed:

pip3 list 2>/dev/null | grep -i -E "ansible|cryptography|jmespath|netaddr"

If versions differ, install the new requirements:

# Option 1: Direct installation (simpler, our lab approach)
pip3 install -r requirements.txt --break-system-packages

# Option 2: Virtual environment (recommended for production admin nodes)
python3 -m venv /root/kubespray-venv
source /root/kubespray-venv/bin/activate
pip3 install -r requirements.txt

If using a virtual environment, remember to activate it before running any Ansible commands:

source /root/kubespray-venv/bin/activate

Verify the key packages:

pip3 list | grep -i -E "ansible|cryptography|jmespath|netaddr"

ansible          10.7.0
cryptography     46.0.3
jmespath         1.1.0
netaddr          1.3.0

Also verify that Ansible can reach all nodes:

cd /root/kubespray
ansible -i inventory/mycluster/inventory.ini all -m ping

All nodes should return pong.

etcd Version Upgrade (3.5.25 → 3.5.26)

One of the most critical parts of a Kubespray version bump is the etcd upgrade. etcd is the single source of truth for all cluster state — a failed etcd upgrade can be catastrophic. Kubespray handles this carefully:

Before modifying any etcd member, Kubespray creates a backup snapshot
Each etcd member is upgraded one at a time
After each member upgrade, cluster health is verified before proceeding

Pre-Upgrade etcd Status

# etcd member list
ssh k8s-node1 etcdctl.sh member list -w table

+------------------+---------+-----------+---------------------------+---------------------------+
|        ID        | STATUS  |   NAME    |       PEER ADDRS          |      CLIENT ADDRS         |
+------------------+---------+-----------+---------------------------+---------------------------+
| 8e9e05c52164694d | started | etcd1     | https://192.168.10.11:2380| https://192.168.10.11:2379|
| 91bc3c398fb3c146 | started | etcd2     | https://192.168.10.12:2380| https://192.168.10.12:2379|
| fd422379fda50e48 | started | etcd3     | https://192.168.10.13:2380| https://192.168.10.13:2379|
+------------------+---------+-----------+---------------------------+---------------------------+

# etcd version before upgrade
ssh k8s-node1 etcdctl.sh endpoint status -w table

The version column should show 3.5.25.

etcd Upgrade Process (Automatic during K8s upgrade)

The etcd upgrade happens automatically when you run the upgrade playbook. Kubespray detects that the new release expects etcd v3.5.26 and upgrades each member:

TASK [etcd : Backup etcd data] **************************************************
changed: [k8s-node1]
changed: [k8s-node2]
changed: [k8s-node3]

TASK [etcd : Upgrade etcd member] ***********************************************
changed: [k8s-node1]

TASK [etcd : Wait for etcd cluster health] **************************************
ok: [k8s-node1]

TASK [etcd : Upgrade etcd member] ***********************************************
changed: [k8s-node2]

TASK [etcd : Wait for etcd cluster health] **************************************
ok: [k8s-node2]

TASK [etcd : Upgrade etcd member] ***********************************************
changed: [k8s-node3]

Each etcd member is restarted with the new binary. During the restart of a single member, the etcd cluster maintains quorum with the remaining two members (2/3 = majority), so there is no data availability interruption.

etcd Backup Verification

After the upgrade, verify that backups were created:

ssh k8s-node1 tree /var/backups/ | head -20

/var/backups/
├── etcd-20260207-before-upgrade/
│   └── member/
│       ├── snap/
│       │   └── db
│       └── wal/
│           └── ...

These backups are critical — if anything goes wrong during the upgrade, you can restore etcd from these snapshots. Kubespray creates timestamped backup directories under /var/backups/ on each etcd member.

Post-Upgrade etcd Verification

# Verify etcd version is now 3.5.26
ssh k8s-node1 etcdctl.sh endpoint status -w table

# Verify cluster health
ssh k8s-node1 etcdctl.sh endpoint health -w table

+---------------------------+--------+-------+-------+
|         ENDPOINT          | HEALTH |  TOOK | ERROR |
+---------------------------+--------+-------+-------+
| https://192.168.10.11:2379| true   |  12ms |       |
| https://192.168.10.12:2379| true   |  11ms |       |
| https://192.168.10.13:2379| true   |  13ms |       |
+---------------------------+--------+-------+-------+

containerd Upgrade (2.1.5 → 2.2.1)

The container runtime upgrade is another significant change that comes with the Kubespray version bump. containerd is the CRI (Container Runtime Interface) implementation that actually runs containers on each node. Upgrading it requires restarting the containerd service, which briefly affects container operations on that node.

Kubespray handles this as part of the rolling upgrade:

The node is already cordoned and drained (no user workloads running)
containerd binary is replaced
containerd service is restarted
kubelet is restarted (reconnects to the new containerd socket)
Static pods are re-created by the kubelet

containerd Upgrade is Transparent

You do not need to run separate commands for containerd — it is upgraded automatically during the upgrade-cluster.yml execution:

TASK [container-engine/containerd : Containerd | Download containerd] ***********
changed: [k8s-node1]

TASK [container-engine/containerd : Containerd | Install containerd] ************
changed: [k8s-node1]

TASK [container-engine/containerd : Containerd | Restart containerd] ************
changed: [k8s-node1]

Running the Full Upgrade

Now, execute the complete upgrade:

cd /root/kubespray

# Step 1: Upgrade control plane and etcd (includes etcd + containerd upgrades)
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.34.3" \
  --limit "kube_control_plane:etcd"

This takes approximately 15 minutes for three control plane nodes. The additional time compared to the patch upgrade comes from the etcd restart/upgrade and the containerd binary replacement on each node.

# Step 2: Upgrade worker nodes
ansible-playbook upgrade-cluster.yml \
  -e kube_version="1.34.3" \
  --limit "kube_node"

Worker nodes take approximately 4 minutes each because containerd also needs to be upgraded on workers.

Verifying containerd Version

After the upgrade completes:

# Check containerd version on all nodes
ansible -i inventory/mycluster/inventory.ini all -m shell -a "containerd --version"

k8s-node1 | SUCCESS | containerd containerd.io 2.2.1 ...
k8s-node2 | SUCCESS | containerd containerd.io 2.2.1 ...
k8s-node3 | SUCCESS | containerd containerd.io 2.2.1 ...
k8s-node4 | SUCCESS | containerd containerd.io 2.2.1 ...
k8s-node5 | SUCCESS | containerd containerd.io 2.2.1 ...

# Verify container images on a node
ssh k8s-node1 crictl images

You should see images for Kubernetes v1.34.3 alongside the upgraded etcd v3.5.26 image.

Post-Upgrade Cluster Verification

# All nodes should show v1.34.3
kubectl get nodes -o wide

NAME        STATUS   ROLES           VERSION
k8s-node1   Ready    control-plane   v1.34.3
k8s-node2   Ready    control-plane   v1.34.3
k8s-node3   Ready    control-plane   v1.34.3
k8s-node4   Ready              v1.34.3
k8s-node5   Ready              v1.34.3

# API version check through all endpoints
for ip in 192.168.10.{11,12,13,10}; do
  echo -n "$ip: "
  curl -sk https://$ip:6443/version | jq -r .gitVersion
done

192.168.10.11: v1.34.3
192.168.10.12: v1.34.3
192.168.10.13: v1.34.3
192.168.10.10: v1.34.3

# Verify all system components
kubectl -n kube-system get pods -o wide

# Check downloaded release binaries
ssh k8s-node1 tree /tmp/releases

Helm and kubectl Version Updates

After a major upgrade, the admin tooling must be updated to match. This includes both kubectl and helm.

Updating kubectl to v1.34

# On admin-lb node

# Update the Kubernetes yum repo to v1.34
cat > /etc/yum.repos.d/kubernetes.repo << 'EOF'
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/repodata/repomd.xml.key
EOF

# Install kubectl
dnf clean metadata
dnf install kubectl --disableexcludes=kubernetes -y

# Verify
kubectl version --client

Client Version: v1.34.3

Updating Helm

Kubespray v2.30.0 may also ship with or recommend a newer Helm version. In our case, Helm is updated to v3.20.0:

# Check current Helm version
helm version --short

v3.17.3+...

# Install Helm v3.20.0
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | DESIRED_VERSION=v3.20.0 bash

Downloading https://get.helm.sh/helm-v3.20.0-linux-amd64.tar.gz
Verifying checksum...
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm

# Verify the new version
helm version --short

v3.20.0+...

Refreshing kubeconfig

As with previous upgrades, refresh the kubeconfig from the first control plane node:

scp k8s-node1:/root/.kube/config /root/.kube/config

# Final verification
kubectl cluster-info
kubectl get nodes -o wide
kubectl -n kube-system get pods

HAProxy Stats Verification

Finally, verify that the HAProxy load balancer still shows all three control plane backends as healthy:

# Open HAProxy stats page
# URL: http://192.168.10.10:9000/haproxy_stats

# Or check via CLI
curl -s http://192.168.10.10:9000/haproxy_stats\;csv | grep k8s_api | cut -d',' -f1-3,18

All three backends should show status UP.

Prometheus/Grafana Monitoring Check

If you have the kube-prometheus-stack installed (see Section 8), verify that all monitoring targets are still healthy after the upgrade:

# Prometheus targets (check via NodePort)
# URL: http://192.168.10.14:30001/targets

# Grafana dashboards (check via NodePort)
# URL: http://192.168.10.14:30002
# Login: admin / prom-operator

Key dashboards to check:

Kubernetes / API server (dashboard 15661): Verify API server latency and error rates returned to normal
etcd (dashboard 12693): Verify etcd leader elections did not spike, DB size is stable
Node Exporter (dashboard from kube-prometheus-stack): Verify resource utilization across all nodes

Upgrade Summary

Over the course of this section, we performed three distinct upgrade paths, each building on the complexity of the previous one:

Patch Upgrade (v1.32.9 → v1.32.10): The simplest case — only Kubernetes binaries changed. No etcd, no containerd, no Kubespray changes. Demonstrated the core rolling upgrade mechanics: cordon, drain, kubeadm upgrade, uncordon.

Minor Upgrade (v1.32.10 → v1.33.7): Same procedure, but with potential API deprecations and longer upgrade times. Reinforced the importance of upgrading one minor version at a time and updating the admin kubectl repository pointer.

Major Upgrade with Kubespray Bump (v1.33.7 → v1.34.3): The full production scenario — new Kubespray tag, new Python dependencies, new etcd version, new containerd version, new Helm version, and the Kubernetes upgrade itself. This is the upgrade type that production teams encounter most frequently and the one that requires the most preparation.

Throughout all three upgrades, the cluster remained available. The HAProxy load balancer ensured API server continuity, the rolling upgrade strategy preserved workload availability, and etcd quorum was maintained during member restarts. This is the power of combining Kubespray’s automation with a properly designed HA architecture.

Conclusion

Key Takeaways

This guide walked through the complete lifecycle of a production-oriented Kubernetes cluster managed by Kubespray. Here are the most important lessons:

HA is not optional — it is the foundation. A 3-node control plane with an etcd quorum of 3 is the minimum viable production architecture. Without it, any single node failure renders the entire cluster unmanageable. The cost of 2 additional control plane nodes is negligible compared to the cost of downtime.

The API endpoint configuration has real consequences. The three cases we explored (client-side NGINX proxy, external LB + client-side proxy, external LB only) each have distinct trade-offs. Client-side proxies give workers independence from external infrastructure but add per-node complexity. External-only LB simplifies worker configuration but makes the load balancer a critical dependency. Choose based on your infrastructure constraints and failure tolerance.

Node management is a routine operation, not an emergency. Adding workers, removing failed nodes, and replacing control plane nodes should all be practiced regularly. The distinction between scale.yml (workers only) and cluster.yml (control plane additions) is critical — using the wrong playbook leads to incomplete configurations. Always add new control plane nodes at the end of the inventory group.

etcd is the heart of the cluster. Every piece of cluster state lives in etcd. Always use an odd number of members (3, 5, 7) for proper quorum. Always verify etcd health after any cluster modification. Always ensure backups are being created before upgrades.

Upgrades should be frequent and incremental. Patch upgrades (v1.32.9 → v1.32.10) are low-risk and take minutes. Minor upgrades (v1.32 → v1.33) require more attention but follow the same pattern. Skipping versions creates compounding technical debt and makes each subsequent upgrade more painful. Kubernetes only supports upgrading one minor version at a time.

Kubespray version matters. Each Kubespray release supports a specific range of Kubernetes versions. When crossing into a new Kubernetes minor version, check whether your Kubespray version supports it. Switching Kubespray versions (v2.29.1 → v2.30.0) also requires updating Python dependencies — always run pip install -r requirements.txt after switching tags.

Monitoring is not a nice-to-have. You cannot upgrade with confidence if you cannot see what is happening. kube-prometheus-stack with Prometheus, Grafana, and etcd metrics gives you the observability needed to verify cluster health before, during, and after upgrades.

Kubespray as a Production Lifecycle Management Framework

Kubespray is more than a deployment tool — it is a lifecycle management framework. The five core playbooks (cluster.yml, scale.yml, remove-node.yml, upgrade-cluster.yml, reset.yml) cover the full operational spectrum:

PhasePlaybookUse CaseDeploycluster.ymlInitial deployment, adding control plane nodesScalescale.ymlAdding worker nodesShrinkremove-node.ymlRemoving any node (graceful or forced)Upgradeupgrade-cluster.ymlRolling upgrades with drain/cordon/uncordonTeardownreset.ymlFull cluster reset

Because Kubespray is built on Ansible, every operation is:

Idempotent: Running a playbook twice produces the same result. Safe to retry after failures.
Declarative: The inventory and variables define the desired state; Kubespray converges the cluster to match.
Auditable: Every playbook run can be logged, and the inventory can be version-controlled in Git.
Extensible: Custom roles and tasks can be added for organization-specific requirements (certificate management, custom CNI configurations, security hardening).

The combination of Git-managed inventory, Ansible’s idempotency, and Kubespray’s rolling upgrade capabilities makes it a solid foundation for teams that need to manage Kubernetes clusters without relying on a managed cloud service.

Next Steps: Disaster Recovery, Automation, and Beyond

This guide covered the core operational workflows, but several advanced topics extend naturally from here:

Disaster Recovery (DR):

Implement automated etcd snapshot backups (e.g., CronJob running etcdctl snapshot save to an off-cluster storage backend)
Practice etcd restoration from snapshots: etcdctl snapshot restore
Document and test the full cluster recovery procedure: restore etcd → restart control plane → verify state
Consider etcd backup solutions like Velero for full cluster backup including PVs

CI/CD-Driven Upgrades:

Integrate Kubespray playbooks into a CI/CD pipeline (Jenkins, GitLab CI, GitHub Actions)
Use a Git branch per environment (dev, staging, production)
Automate the upgrade sequence: run in staging → run health checks → promote to production
Add pre-upgrade and post-upgrade verification steps (API health, etcd quorum, workload status) as pipeline gates

Security Hardening:

Enable PodSecurityAdmission (or OPA/Gatekeeper) for workload policy enforcement
Implement NetworkPolicies to restrict pod-to-pod communication
Configure certificate rotation and monitor certificate expiration dates
Harden etcd with TLS client authentication and restrict access to control plane nodes only

Advanced HA:

Deploy HAProxy in HA with keepalived for virtual IP failover (eliminates LB as SPOF)
Consider running 5 etcd members for higher fault tolerance in large clusters
Implement pod topology spread constraints and anti-affinity rules for critical workloads

Multi-Cluster Management:

Use Kubespray to manage multiple clusters from a single admin node
Maintain separate inventories per cluster in the same Git repository
Standardize configurations across clusters using shared group_vars with cluster-specific overrides

The techniques in this guide — HA deployment, rolling upgrades, node lifecycle management, and observability — form the operational foundation upon which all of these advanced topics are built. Master these fundamentals first, and the rest follows naturally.

Lab Cleanup: When you’re done, tear down the entire environment with vagrant destroy -f. All VMs and their data will be removed.

Kubernetes, The Kubespray Way

Sigrid Jin — Sat, 31 Jan 2026 21:59:53 GMT

You can optimizeKubernetes infrastructure provisioning and lifecycle management through Kubespray’s Ansible-based automation framework, leveraging codified community expertise to deliver robust, production-grade environments.

Chapter 1: Why Kubespray? Understanding the Automation Spectrum

If you have been following the Kubernetes learning path properly, you have probably gone through something like Kubernetes The Hard Way first. You manually provisioned VMs, generated certificates with OpenSSL, created kubeconfig files by hand, configured etcd as a systemd service, and set up each control plane component one by one. It was painful. It took hours. But you understood exactly what was happening.

Then you discovered kubeadm. Suddenly, all that certificate generation, etcd bootstrapping, and control plane deployment collapsed into a single command:

kubeadm init — config=kubeadm-config.yaml

The hours of manual work became minutes. But you still had to prepare each node yourself — disable swap, load kernel modules, install containerd, configure kubelet. You still had to SSH into each machine and run commands.

Now we arrive at Kubespray. One command, and everything happens:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml

Fifteen minutes later, you have a production-ready Kubernetes cluster. The OS is configured. Containerd is installed. Certificates are generated. etcd is running. The control plane is up. Workers have joined. CNI is deployed. CoreDNS is answering queries.

This is the automation spectrum. Each step up the ladder makes deployment easier and faster. But there is a cost that nobody talks about until it bites them.

What Kubespray Actually Is

The Kubespray GitHub repository introduces itself with a simple tagline:

Deploy a Production Ready Kubernetes Cluster

This is not marketing speak. Kubespray genuinely aims to give you a cluster that you can run workloads on immediately, configured according to best practices that the Kubernetes community has learned through years of production experience.

The official documentation describes Kubespray as:

Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks.

That phrase “domain knowledge” is the key. Kubespray is not just a script that runs commands in order. It encodes decisions about how to configure NTP synchronization, what kernel parameters to set, how to structure etcd clusters, when to use IPVS versus iptables for kube-proxy, and hundreds of other operational details.

Kubespray is maintained by the Kubernetes SIG Cluster Lifecycle group. Since version 2.3, it uses kubeadm internally for the actual cluster bootstrapping. This means Kubespray handles everything that kubeadm does not — the OS preparation, container runtime installation, CNI deployment, and post-installation configuration.

Think of it as layers:

Manual installation: You do everything kubeadm: Automates cluster bootstrapping, you do the rest Kubespray: Automates everything, you configure variables

The Kubespray repository is organized as a standard Ansible project.

Each directory under roles contains the logic for one piece of the cluster. The bootstrap-os role configures the operating system. The container-engine role installs containerd. The etcd role sets up the etcd cluster. And so on.

The Double-Edged Sword

Here is the uncomfortable truth about automation: the more it does for you, the less you understand about what is actually happening.

When you run cluster.yml, Kubespray executes hundreds of tasks across dozens of roles. It loads kernel modules, writes configuration files, downloads binaries, generates certificates, creates systemd units, applies Kubernetes manifests, and configures networking. All of this happens behind a progress bar showing task names scrolling by.

If the playbook completes successfully, you get a working cluster. You can run kubectl get nodes and see your machines. You can deploy applications. Everything works.

But what happens when something goes wrong six months later?

A pod is not scheduling. You check the events, and they mention resource pressure. You look at the node, and something is consuming all the memory. Is it the kubelet configuration? The container runtime? A kernel parameter? Where do you even look?

Network traffic between pods is failing. Packets are being dropped somewhere. Is it the CNI plugin? An iptables rule? A kernel module that is not loaded? Which configuration file controls this behavior?

Certificates have expired. The API server is refusing connections. How do you renew them? Where are they stored? What tool generated them in the first place?

If you installed everything manually, you would know the answers because you typed every command yourself. If you used kubeadm with understanding, you would know that kubeadm certs renew exists and where the certificates live.

But if you just ran ansible-playbook cluster.yml without understanding what it does, you are now staring at a broken cluster with no idea how to fix it.

This is the black box problem. The automation that saved you hours during installation now costs you days during troubleshooting.

The Upgrade Trap

Kubernetes moves fast. A new minor version releases every four months. Security patches come out regularly. Running an outdated cluster is a security risk and eventually a compatibility problem as client tools and container images stop supporting old API versions.

Kubespray provides an upgrade playbook:

ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.31.0

In theory, you run this command and your cluster upgrades. In practice, upgrades are where black box automation fails hardest.

The upgrade playbook modifies your cluster in place. It drains nodes, updates binaries, restarts services, and hopes everything comes back up. If something fails partway through, you have a partially upgraded cluster in an inconsistent state.

Without understanding what the playbook is doing, you cannot:

Diagnose why a particular step failed
Manually complete the upgrade if automation cannot
Roll back to the previous state safely
Verify that the upgrade completed correctly

The result is that people who do not understand their automation become afraid to upgrade. They skip versions. They fall behind on security patches. The tool that was supposed to help them maintain their cluster becomes the reason they cannot maintain it.

Kubespray Itself Changes

This problem compounds because Kubespray is also evolving. New versions add roles, rename variables, change default values, and restructure configuration files.

If you learned Kubespray 2.20 and then try to use 2.28, you will find that:

Some variable names have changed
New configuration options exist that you have never seen
Default behaviors are different
The role structure may have been reorganized

Your muscle memory and mental model from the old version no longer apply. You need to relearn the tool, but you never really learned it deeply in the first place — you just ran the playbook.

The Temptation of “Just Make It Work”

Every developer has done this at some point. A deadline is approaching. Something needs to be running by tomorrow. You find a tool that claims to solve your problem, you run it, and it works. You move on.

For short-lived systems, this is fine. But Kubernetes clusters are not short-lived. They run for years. They get upgraded. They have nodes added and removed. They suffer hardware failures and network partitions. They need certificates renewed and configurations tuned.

If you do not understand how your cluster was built, you do not really own it. You are at the mercy of the tool. When the tool cannot help you, you are stuck.

I have seen this happen. At a previous job, we used Ansible extensively for infrastructure automation. Nobody on the team really understood Ansible deeply. We copied playbooks from examples, modified them until they worked, and ran them. When something broke, we spent hours debugging YAML indentation and variable precedence because we had never learned how Ansible actually evaluates variables.

The playbooks worked, but we did not understand why they worked. That meant we also did not understand why they failed.

The Philosophy of This Tutorial

This brings us to why this tutorial exists and how it approaches Kubespray.

We are not going to treat Kubespray as a magic button that produces clusters. We are going to open it up, look inside, and understand what it does.

When we run cluster.yml, we will trace through the plays and roles to see what tasks execute. When we configure variables, we will understand where those variables come from and how they affect the final cluster. When we deploy a cluster, we will verify that each component is running correctly and know where to look if it is not.

This takes longer than just running the playbook and hoping for the best. But it means that six months from now, when something breaks at 3 AM, you will have a chance of fixing it.

The goal is not to memorize every variable and task in Kubespray. That would be impossible — the kubespray_defaults role alone has over 800 lines of variables, and there are dozens of other roles. The goal is to understand the structure well enough that you know where to look when you need to find something.

Think of it like learning a new city. You do not memorize every street. But you learn the main neighborhoods, the major roads, and how the transit system works. Then when you need to get somewhere specific, you can figure it out.

What Kubespray Automates

To understand Kubespray, it helps to see exactly what manual work it replaces. If you have done Kubernetes The Hard Way, you will recognize these steps:

Kubernetes The Hard Way had you:

Provision compute resources (VMs or physical machines)
Generate TLS certificates for every component with OpenSSL
Create kubeconfig files for kubelet, controller-manager, scheduler, and admin
Generate a data encryption configuration
Bootstrap the etcd cluster with systemd units
Bootstrap the control plane components as systemd services
Configure kubelet and kube-proxy on worker nodes
Set up pod networking routes manually

kubeadm automated steps 2 through 6 into kubeadm init and step 8 into kubeadm join. But you still had to:

Prepare the OS (disable swap, load kernel modules, set kernel parameters)
Install the container runtime (containerd)
Install kubeadm, kubelet, and kubectl
Configure the container runtime for Kubernetes
Install a CNI plugin
Set up high availability if you wanted it

Kubespray automates all of this. When you run cluster.yml, these roles execute in sequence across your inventory hosts. The entire process typically takes 15–30 minutes depending on network speed and the number of nodes.

The Comparison That Matters

Here is a concrete comparison of deploying a basic cluster with each approach:

With Kubernetes The Hard Way, you run approximately 200 commands across multiple machines. You SSH into each node, copy files, edit configurations, and start services. The process takes several hours if you are careful and understand what you are doing.

With kubeadm, you still prepare each machine manually, but the cluster bootstrapping is reduced to:

On the first control plane node
kubeadm init - config=kubeadm-config.yaml

On each worker node
kubeadm join 192.168.10.10:6443 - token  - discovery-token-ca-cert-hash sha256:

You save significant time, but you still need to visit each machine.

With Kubespray, you run a single command from your Ansible control node:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

Kubespray connects to all machines simultaneously, configures them in parallel where possible, and handles the sequencing of operations automatically.

The time savings are dramatic. But remember: you are trading understanding for convenience. This tutorial exists to help you get both.

Chapter 2: Kubespray Core Concepts and Features

Let me start by reading you the official description from the Kubespray GitHub repository. The tagline says “Deploy a Production Ready Kubernetes Cluster.” This is not just marketing speak. It means Kubespray is designed to give you a cluster that you can actually run workloads on immediately, not a toy setup that needs hours of additional configuration.

The official documentation describes Kubespray as “a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks.” Pay attention to that phrase “domain knowledge.” This is what separates Kubespray from a simple shell script that runs kubeadm commands. The Kubespray maintainers have spent years figuring out what works in production environments, what kernel parameters matter, which CNI configurations cause problems, and how to handle edge cases. All of that accumulated wisdom is baked into the playbooks.

Kubespray is maintained by the Kubernetes SIG Cluster Lifecycle group, which means it is an official Kubernetes project. This is not some random GitHub repository that might disappear tomorrow.

One thing that confuses people is how Kubespray relates to kubeadm. They are not competing tools. Since version 2.3, Kubespray has used kubeadm internally for the actual cluster bootstrapping. The official documentation explains this decision:

“Kubespray has started using kubeadm internally for cluster creation since v2.3 in order to consume life cycle management domain knowledge from it and offload generic OS configuration things from it, which hopefully benefits both sides.”

So when you run Kubespray, it is not reinventing the wheel. It is using kubeadm to handle certificate generation, etcd configuration, and control plane bootstrapping. What Kubespray adds on top is everything else that kubeadm does not handle.

Think about what you still need to do manually when using kubeadm alone.

Machine provisioning OS pre-configuration like time synchronization
SELinux settings, swap disable, kernel parameters Installing containerd and kubelet
Installing CNI plugins Setting up load balancers for HA configuration

Kubespray automates all of this. When you run the cluster.yml playbook, you get a complete cluster, not a half-finished one that requires more work.

Kubespray is not just for initial deployment. It handles the entire lifecycle of your cluster through different playbooks.

For creating a new cluster: ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b

For upgrading the cluster to a new Kubernetes version: ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -b

For adding new worker nodes: ansible-playbook -i inventory/mycluster/inventory.ini scale.yml -b

For removing specific nodes: ansible-playbook -i inventory/mycluster/inventory.ini remove-node.yml -b -e node=worker-3

For completely resetting the cluster back to a clean state: ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b

Each of these playbooks is idempotent thanks to Ansible. If your deployment fails halfway through, you can run the same command again and Ansible will skip the tasks that already succeeded, continuing from where it left off. This is enormously valuable when you are dealing with flaky networks or transient failures.

One of Kubespray’s strongest features is its ability to deploy clusters anywhere. The same inventory structure and playbooks work whether you are deploying to AWS, GCP, Azure, a VMware vSphere environment, or bare metal servers in your data center.

For public cloud environments, the typical pattern is to use Terraform for infrastructure provisioning and then Kubespray for Kubernetes deployment:

terraform apply ansible-playbook cluster.yml

The Kubespray repository includes Terraform samples for various cloud providers if you want to see how this integration works.

For air-gapped environments where there is no internet access, Kubespray provides offline deployment support. This requires pre-downloading container images and binaries, then hosting them on an internal registry and file server.

The group_vars files include settings for configuring registry mirrors and custom download URLs:

In group_vars/all/containerd.yml
containerd_registries_mirrors:
prefix: docker.io mirrors:
host: https://registry.internal.company.com capabilities: ["pull", "resolve"]

The only hard requirements are that your Ansible control node can reach the target machines via SSH and that Python is installed on those machines.

Since Ansible is agentless and uses SSH, you do not need to install any agents or daemons on the cluster nodes before running Kubespray.

Production clusters need high availability. Kubespray supports HA configurations for both the control plane and etcd.

For control plane HA, you simply list multiple nodes in your inventory:

[kube_control_plane] controller-0 controller-1 controller-2
[etcd] controller-0 controller-1 controller-2
[kube_node] worker-0 worker-1 worker-2

Kubespray will configure all three control plane nodes with the API server, controller manager, and scheduler. The controller manager and scheduler use leader election, so only one instance is active at a time while the others stand by.

For etcd, Kubespray deploys a proper cluster using the Raft consensus protocol. You should always use an odd number of etcd nodes, typically three or five, to ensure quorum can be reached even if one node fails.

Now here is something important to understand about load balancing. Kubespray automates client-side load balancing but not external load balancing. What does this mean?

Client-side load balancing means each worker node runs a local nginx or haproxy that proxies requests to the available API servers. This is configured automatically by Kubespray. The kubelet on each worker connects to localhost, and the local proxy distributes requests across the control plane nodes.

External load balancing means having a load balancer in front of your control plane that external clients like kubectl on your laptop can connect to. Kubespray does not automatically configure this because external load balancing is completely different depending on your environment:

On AWS you would use an ELB or NLB On GCP you would use a GCP Load Balancer On bare metal you might use HAProxy with keepalived and a virtual IP In some environments you might use MetalLB or kube-vip

Kubespray is a tool for configuring software on top of your operating system. It is not an infrastructure provisioning tool. Configuring virtual IPs, DNS entries, and cloud load balancers is outside its scope.

The expectation is that you handle external load balancing through Terraform or manual configuration, and Kubespray handles everything from the OS level up.

That said, Kubespray does support kube-vip as an option for control plane VIP, which can provide external load balancing without requiring separate infrastructure. You can enable it in your addons configuration:

In group_vars/k8s_cluster/addons.yml
kube_vip_enabled: true

When you deploy a cluster with Kubespray, you get production-grade defaults without having to think about them. Remember all those manual steps from the kubeadm documentation about preparing your nodes? Kubespray handles them automatically.

Time synchronization is configured using chrony or ntp. In a distributed system like Kubernetes, clock skew between nodes causes all sorts of subtle problems with certificates, leases, and log correlation.

Kernel parameters are set correctly for container networking:

net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1

The necessary kernel modules are loaded.

overlay br_netfilter: Swap is disabled, which is a hard requirement for kubelet to run properly.

These settings are defined in the kubernetes/preinstall role and applied automatically during cluster deployment. You do not need to SSH into each node and run sysctl commands manually.

When you run cluster.yml, here is what gets installed and configured on your nodes.

On all nodes:

Kernel modules and sysctl parameters for container networking
Container runtime (containerd by default)
kubelet and kubectl binaries
CNI plugins

On control plane nodes:

kube-apiserver as a static pod
kube-controller-manager as a static pod
kube-scheduler as a static pod
etcd (either as a systemd service or static pod depending on configuration)

On worker nodes:

kube-proxy (as a DaemonSet)
Client-side load balancer for API server access (nginx or haproxy)

Cluster-wide:

CoreDNS for cluster DNS
CNI plugin DaemonSet (Calico, Flannel, etc.)
Optional addons like metrics-server, ingress controllers, helm

The configuration for all of this lives in your inventory directory. Once you have a working inventory, deploying the same cluster configuration repeatedly is just a matter of running the playbook against new infrastructure. This is infrastructure as code applied to Kubernetes deployment.

Chapter 3: Project Structure Overview

When you first clone the Kubespray repository and run ls, you're greeted with a wall of files and directories that can feel overwhelming. But here's the thing — Kubespray follows standard Ansible project conventions, so once you understand how Ansible organizes things, the whole structure clicks into place. Let me walk you through it piece by piece.

First, let’s get a bird’s eye view. Clone the repository and take a look:

git clone https://github.com/kubernetes-sigs/kubespray.git
cd kubespray
git checkout release-2.28
ls -la

You’ll see something like this at the root level:

kubespray/
├── ansible.cfg
├── cluster.yml
├── reset.yml
├── scale.yml
├── upgrade-cluster.yml
├── remove-node.yml
├── recover-control-plane.yml
├── inventory/
├── roles/
├── playbooks/
├── library/
├── docs/
└── ...

Those YAML files sitting at the root — they’re your entry points. Each one represents a major cluster operation, and you’ll be running these directly with ansible-playbook.

The most important one is cluster.yml. This is what you run to create a new cluster from scratch. When you execute ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml, Kubespray kicks off the entire deployment process — from configuring the operating system to installing containerd, setting up etcd, deploying the Kubernetes control plane, joining worker nodes, installing CNI, and deploying addons like CoreDNS. We'll dissect this playbook in detail in Chapter 7.

Then there’s reset.yml. This one tears everything down. It's the nuclear option — it removes all Kubernetes components, wipes etcd data, and leaves your nodes in a clean state. I've used this more times than I'd like to admit during testing when something went sideways and I just needed a fresh start.

scale.yml is for adding new nodes to an existing cluster. Say you deployed with three workers and now you need five. You update your inventory file with the new nodes, then run scale.yml. It's smart enough to skip all the work that's already done on existing nodes and only configure the new ones.

remove-node.yml does the opposite. When you need to decommission a node — maybe the hardware is failing or you're downsizing — this playbook handles the graceful removal. It drains the node, removes it from the cluster, and cleans up.

upgrade-cluster.yml handles version upgrades. Kubernetes releases new versions every few months, and running this playbook walks your cluster through the upgrade process node by node, respecting the proper order (etcd first, then control plane, then workers).

recover-control-plane.yml is your emergency playbook. If a control plane node dies and you need to restore it, this is what you reach for.

Now, here’s something that might confuse you at first. You’ll notice some files exist in both hyphen and underscore versions:

ls -la | grep -E "remove|scale"

remove-node.yml
remove_node.yml
scale.yml

Both remove-node.yml and remove_node.yml exist. This isn't a mistake — it's backward compatibility. At some point, Kubespray standardized on kebab-case (hyphens), but they kept the old snake_case versions around so existing scripts and documentation wouldn't break. If you're writing new automation, use the hyphen versions.

The inventory Directory

This is where you define what your cluster looks like. Kubespray ships with a sample inventory that you’re meant to copy and customize:

ls inventory/

local/
sample/

The local/ directory is for single-node testing on your local machine. The sample/ directory is the template you'll copy for real deployments. Let's look inside:

tree inventory/sample/

inventory/sample/
├── inventory.ini
└── group_vars/
    ├── all/
    │   ├── all.yml
    │   ├── aws.yml
    │   ├── azure.yml
    │   ├── containerd.yml
    │   ├── coreos.yml
    │   ├── cri-o.yml
    │   ├── docker.yml
    │   ├── etcd.yml
    │   ├── gcp.yml
    │   ├── hcloud.yml
    │   ├── huaweicloud.yml
    │   ├── oci.yml
    │   ├── offline.yml
    │   ├── openstack.yml
    │   ├── upcloud.yml
    │   └── vsphere.yml
    └── k8s_cluster/
        ├── addons.yml
        ├── k8s-cluster.yml
        ├── k8s-net-calico.yml
        ├── k8s-net-cilium.yml
        ├── k8s-net-custom-cni.yml
        ├── k8s-net-flannel.yml
        ├── k8s-net-kube-ovn.yml
        ├── k8s-net-kube-router.yml
        ├── k8s-net-macvlan.yml
        └── kube_control_plane.yml

The inventory.ini file is where you list your actual hosts and assign them to groups. A typical production inventory might look like:

[all]
node1 ansible_host=10.0.1.10 ip=10.0.1.10
node2 ansible_host=10.0.1.11 ip=10.0.1.11
node3 ansible_host=10.0.1.12 ip=10.0.1.12
node4 ansible_host=10.0.1.20 ip=10.0.1.20
node5 ansible_host=10.0.1.21 ip=10.0.1.21

[kube_control_plane]
node1
node2
node3

[etcd:children]
kube_control_plane

[kube_node]
node4
node5

[k8s_cluster:children]
kube_control_plane
kube_node

Notice the [etcd:children] syntax. This is Ansible's way of creating nested groups. Instead of listing hosts directly under [etcd], we're saying "the etcd group contains all hosts from the kube_control_plane group." This pattern is convenient when your etcd nodes and control plane nodes are the same machines, which is the common "stacked etcd" topology.

The group_vars/ directory is where the magic happens. Ansible automatically loads variables from files in this directory based on which groups a host belongs to. Files under group_vars/all/ apply to every host. Files under group_vars/k8s_cluster/ apply only to hosts in the k8s_cluster group.

This is why you see so many YAML files in there. containerd.yml has container runtime settings. etcd.yml has etcd-specific configuration. k8s-cluster.yml has core Kubernetes settings like the network plugin, service CIDR, and pod CIDR. The cloud provider files (aws.yml, gcp.yml, azure.yml) contain settings that only matter if you're deploying to those environments — they're mostly commented out by default.

The roles Directory

Here’s where the actual work gets defined. Kubespray is essentially a collection of Ansible roles, and each role handles a specific piece of the cluster setup:

ls roles/

adduser/
bootstrap_os/
bootstrap-os/
container-engine/
download/
etcd/
etcdctl_etcdutl/
kubernetes/
kubernetes-apps/
kubespray_defaults/
kubespray-defaults/
network_plugin/
recover_control_plane/
remove-node/
reset/
upgrade/
win_nodes/
...

Let me explain the key ones.

bootstrap-os handles the initial operating system configuration. It installs Python (which Ansible needs), updates packages, sets up required kernel modules like overlay and br_netfilter, configures sysctl parameters for networking, and disables swap. All that tedious pre-work you'd do manually before running kubeadm — this role automates it.

container-engine installs the container runtime. By default that's containerd, but Kubespray supports Docker (via cri-dockerd) and CRI-O as well. This role downloads the binaries, creates the configuration files, sets up systemd units, and ensures the runtime is running.

etcd deploys the etcd cluster. Depending on your settings, it either installs etcd as a systemd service directly on the host or lets kubeadm manage it as a static pod. It handles certificate generation, cluster membership, and health checking.

kubernetes is actually a directory containing multiple sub-roles:

ls roles/kubernetes/

control-plane/
kubeadm/
node/
node-label/
node-taint/
preinstall/

The preinstall sub-role does Kubernetes-specific preparation — things like creating the kubernetes user, setting up directories, and configuring kubelet. The node sub-role installs kubelet, kubectl, and kubeadm on all nodes. The control-plane sub-role runs the kubeadm init process on control plane nodes. The kubeadm sub-role handles both the initial cluster bootstrap and joining additional nodes.

network_plugin installs your chosen CNI. Look inside and you'll see sub-directories for each supported CNI:

ls roles/network_plugin/

calico/
cilium/
flannel/
kube-ovn/
kube-router/
macvlan/
multus/
weave/
...

When you set kube_network_plugin: calico in your group_vars, Kubespray runs the tasks in roles/network_plugin/calico/. Switch it to flannel, and it runs roles/network_plugin/flannel/ instead.

kubernetes-apps deploys the addons — CoreDNS for cluster DNS, metrics-server for resource metrics, helm if you enable it, ingress controllers, cert-manager, and so on.

download is a critical role that handles fetching all the binaries and container images. It's designed to be idempotent and can work in both online and offline (air-gap) scenarios.

Now, about those confusingly similar names — kubespray_defaults versus kubespray-defaults. These are not duplicates. They serve different purposes, and understanding this distinction matters.

kubespray_defaults (with underscore) contains the actual variable definitions. Look at its structure:

tree roles/kubespray_defaults/

roles/kubespray_defaults/
├── defaults/
│   └── main/
│       ├── download.yml
│       └── main.yml
└── vars/
    └── main/
        ├── checksums.yml
        └── main.yml

The defaults/main/main.yml file alone is over 800 lines. It defines default values for nearly every configurable aspect of Kubespray — Kubernetes version, network settings, paths, timeouts, feature flags. The defaults/main/download.yml adds another 1100+ lines covering download URLs and version mappings. The vars/main/checksums.yml contains SHA256 checksums for all downloadable binaries — these are in the vars/ directory (higher priority) because you really shouldn't be changing checksums unless you know exactly what you're doing.

kubespray-defaults (with hyphen), on the other hand, contains tasks:

tree roles/kubespray-defaults/

roles/kubespray-defaults/
└── tasks/
    └── main.yml

This role’s job is to load and process those variables. It’s a pattern you’ll see throughout Kubespray — separate the data (variables) from the logic (tasks).

You’ll find similar underscore/hyphen pairs elsewhere. bootstrap_os holds variables while bootstrap-os holds tasks. It's a bit confusing at first, but once you recognize the pattern, it makes navigating the codebase easier.

The playbooks Directory

Remember those root-level playbooks like cluster.yml? They’re mostly thin wrappers that import the real playbooks from this directory:

ls playbooks/

ansible_version.yml
cluster.yml
facts.yml
install_etcd.yml
reset.yml
scale.yml
upgrade_cluster.yml
...

When you run the root cluster.yml, it imports playbooks/cluster.yml which contains the actual play definitions. This separation keeps the root directory clean while allowing the playbook logic to be more complex.

Let’s peek at what the root cluster.yml actually does:

cat cluster.yml | head -30

---
- name: Check Ansible version
  import_playbook: ansible_version.yml

- name: Add kube-master nodes to kube_control_plane
  # Backward compatibility
  import_playbook: legacy_groups.yml

- name: Common tasks for every playbooks
  import_playbook: boilerplate.yml

- name: Gather facts
  import_playbook: facts.yml

- name: Prepare for etcd install
  hosts: "{{ hostvars[groups['etcd'][0]]['etcd_retries'] | default(groups['etcd'][0]) }}:kube_control_plane"
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, tags: preinstall }
    - { role: container-engine, tags: container-engine, when: deploy_container_engine }
    - { role: download, tags: download, when: "not skip_downloads" }

You can see it’s a sequence of import_playbook statements and play definitions. The structure is methodical — check Ansible version, handle legacy group names, run common boilerplate, gather facts, then proceed through each phase of cluster setup.

Here’s the complete picture of how these pieces connect when you run a deployment:

You run:
  ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml

Ansible reads:
  1. ansible.cfg (from current directory)
  2. inventory/mycluster/inventory.ini (hosts and groups)
  3. inventory/mycluster/group_vars/all/*.yml (global variables)
  4. inventory/mycluster/group_vars/k8s_cluster/*.yml (cluster variables)

cluster.yml imports and runs:
  → playbooks/ansible_version.yml (version check)
  → playbooks/boilerplate.yml (common setup)
  → playbooks/facts.yml (gather system info)
  → plays that execute roles:
      → roles/kubespray_defaults (load variables)
      → roles/bootstrap-os (OS setup)
      → roles/container-engine (containerd)
      → roles/etcd (etcd cluster)
      → roles/kubernetes/node (kubelet)
      → roles/kubernetes/control-plane (kubeadm init)
      → roles/network_plugin/calico (or flannel, etc.)
      → roles/kubernetes-apps (addons)

Each role follows the standard Ansible role structure:

roles/some-role/
├── defaults/
│   └── main.yml      # Default variable values (lowest priority)
├── vars/
│   └── main.yml      # Role variables (higher priority)
├── tasks/
│   └── main.yml      # The actual work
├── templates/
│   └── config.j2     # Jinja2 templates
├── handlers/
│   └── main.yml      # Handlers (restart services, etc.)
└── files/
    └── something     # Static files to copy

The roles execute in dependency order. You can’t install Kubernetes before containerd is running. You can’t join worker nodes before the control plane is up. Kubespray handles this orchestration through careful ordering in the playbooks.

For a quick reference, here’s how to list what’s in each major directory:

# Count files in each major area
echo "Roles:" && ls roles/ | wc -l
echo "Role defaults files:" && find roles -path "*/defaults/*.yml" | wc -l  
echo "Role vars files:" && find roles -path "*/vars/*.yml" | wc -l
echo "Sample group_vars/all:" && ls inventory/sample/group_vars/all/ | wc -l
echo "Sample group_vars/k8s_cluster:" && ls inventory/sample/group_vars/k8s_cluster/ | wc -l

On the 2.28 release, you’ll see something like:

Roles: 40+
Role defaults files: 77
Role vars files: 50
Sample group_vars/all: 16
Sample group_vars/k8s_cluster: 10

That’s a lot of moving parts, but they’re organized logically. When you need to understand or modify something specific — say, how containerd gets configured — you know to look in roles/container-engine/containerd/. When you want to change a default timeout or version, you check roles/kubespray_defaults/defaults/. When you want to customize your deployment, you edit files in inventory/mycluster/group_vars/.

The key insight is that Kubespray isn’t magic. It’s just a well-organized collection of Ansible automation following established patterns. Once you internalize the structure, you can navigate it confidently, troubleshoot issues effectively, and customize it for your environment without fear of breaking things you don’t understand.

Chapter 4: ansible.cfg Configuration Analysis

When you run ansible-playbook from the Kubespray directory, Ansible doesn’t just use its global defaults. It picks up a project-specific configuration file that Kubespray ships with. Understanding this file is important because it controls how Ansible behaves during cluster deployment — things like SSH connection handling, fact caching, output formatting, and performance optimizations.

Ansible looks for configuration files in a specific order. First, it checks if you’ve set the ANSIBLE_CONFIG environment variable. If that’s not set, it looks for ansible.cfg in your current working directory. Failing that, it checks your home directory for .ansible.cfg, and finally falls back to the system-wide /etc/ansible/ansible.cfg.

This is exactly why every Kubespray command in the documentation tells you to run ansible-playbook from the kubespray directory. If you run it from somewhere else, Ansible won’t find Kubespray’s ansible.cfg and will use different settings, which can lead to unexpected behavior or failures.

Let’s look at what Kubespray’s ansible.cfg actually contains:

[ssh_connection]
pipelining=True
ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null
#control_path = ~/.ssh/ansible-%%r@%%h:%%p

[defaults]
force_valid_group_names = ignore
host_key_checking=False
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400
timeout = 300
stdout_callback = default
display_skipped_hosts = no
library = ./library
callbacks_enabled = profile_tasks
roles_path = roles:$VIRTUAL_ENV/usr/local/share/kubespray/roles:$VIRTUAL_ENV/usr/local/share/ansible/roles:/usr/share/kubespray/roles
deprecation_warnings=False
inventory_ignore_extensions = ~, .orig, .bak, .ini, .cfg, .retry, .pyc, .pyo, .creds, .gpg

[inventory]
ignore_patterns = artifacts, credentials

That’s a lot to unpack. Let’s go through each section.

The SSH Connection Section

The [ssh_connection] section controls how Ansible connects to your target nodes. Kubespray deploys to multiple machines and runs hundreds of tasks, so SSH performance matters a lot here.

The first setting is pipelining=True. By default, when Ansible runs a task on a remote machine, it copies a Python script over, executes it, and closes the connection. With pipelining disabled, each task involves multiple round trips. With pipelining enabled, Ansible can send multiple commands through a single SSH session, reducing overhead significantly. For a playbook with 500+ tasks running across multiple nodes, this makes a real difference.

The ssh_args line packs several SSH client options together:

ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null

ControlMaster=auto enables SSH connection multiplexing. When you first connect to a host, SSH creates a master connection. Subsequent connections to the same host reuse this master connection instead of establishing new ones. This eliminates the authentication overhead for each connection.

ControlPersist=30m keeps that master connection alive for 30 minutes after the last session ends. During a Kubespray run, Ansible connects to the same nodes over and over. Keeping the master connection alive means those reconnections are nearly instant.

ConnectionAttempts=100 tells SSH to retry up to 100 times if a connection fails. This sounds excessive, but network hiccups happen, especially in cloud environments. You don’t want your 20-minute deployment to fail because of a momentary network blip.

The last one, UserKnownHostsFile=/dev/null, is worth understanding properly. Normally, SSH maintains a file called known_hosts in your ~/.ssh directory. The first time you connect to a server, SSH shows you a fingerprint and asks if you want to trust this host. If you say yes, it saves the fingerprint to known_hosts. On subsequent connections, SSH checks that the fingerprint matches. If it doesn’t match, SSH refuses to connect because this could indicate a man-in-the-middle attack.

Setting UserKnownHostsFile=/dev/null means SSH writes the fingerprint to /dev/null — the black hole device that discards everything written to it. The fingerprint is never saved, so every connection is treated as a first-time connection. Combined with host_key_checking=False (which we’ll see in the defaults section), this means Ansible never prompts you to verify host keys and never complains about changed fingerprints.

Why would Kubespray want this? In cloud and VM environments, you often tear down and recreate machines. The same IP address might point to completely different machines over time. If Ansible kept checking host keys, you’d constantly get “HOST KEY VERIFICATION FAILED” errors and have to manually clean up your known_hosts file. For automated deployment, this is a practical necessity.

That said, this does reduce security. In a production environment where hosts are long-lived and you want to detect if someone is intercepting your SSH traffic, you might want to reconsider this setting. But for most Kubespray use cases — spinning up clusters, tearing them down, trying different configurations — the convenience outweighs the security tradeoff.

The control_path line is commented out:

#control_path = ~/.ssh/ansible-%%r@%%h:%%p

This would specify where SSH stores its control socket files for connection multiplexing. The %% patterns expand to the remote user, hostname, and port. Kubespray leaves this commented because Ansible manages control paths automatically, and the default behavior works fine for most setups.

The Defaults Section

The [defaults] section contains most of the interesting configuration. Let’s break it down.

force_valid_group_names = ignore

This one has a story behind it. In Ansible 2.8, the developers added stricter validation for inventory group names. They decided that group names should follow Python variable naming rules — no hyphens, no dots. The reasoning was that group names sometimes get used in Jinja2 templates or as variable names, and special characters could cause problems.

The thing is, Kubernetes uses hyphens everywhere. Namespaces like kube-system, pod names like coredns-12345, node names with hyphens. If you’re using Kubespray with an inventory that reflects your Kubernetes naming conventions, you’ll have groups with hyphens in their names.

When Ansible 2.8 came out, people started getting warnings about invalid group names. Some people’s playbooks broke entirely. There was a GitHub issue (#56930) where users complained, pointing out that group names aren’t variable names — they’re just strings that happen to be used as dictionary keys. The Kubernetes ecosystem had already standardized on hyphen-separated names, and forcing everyone to use underscores instead wasn’t practical.

Setting force_valid_group_names = ignore tells Ansible to accept any group name without complaining. Kubespray needs this to work seamlessly with Kubernetes naming conventions.

host_key_checking=False

This complements the UserKnownHostsFile=/dev/null setting from the SSH section. When SSH connects to a new host, it normally asks “Are you sure you want to continue connecting?” and waits for you to type yes. With host_key_checking disabled, Ansible automatically accepts new host keys without prompting. Combined with the /dev/null known_hosts file, this makes SSH connections completely non-interactive.

gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400

These four lines work together to optimize fact gathering. Facts are system information that Ansible collects from each host — things like the operating system, IP addresses, memory size, CPU count. Collecting facts requires running commands on each host, which takes time.

With gathering = smart, Ansible only collects facts when it doesn’t already have them cached. The fact_caching = jsonfile setting tells Ansible to cache facts as JSON files. fact_caching_connection = /tmp specifies where to store those JSON files. And fact_caching_timeout = 86400 means cached facts are valid for 86400 seconds — that’s 24 hours.

During a Kubespray run, Ansible connects to the same hosts multiple times across different plays. Without caching, it would collect facts every single time, adding minutes to the total runtime. With smart gathering and caching, facts are collected once and reused throughout the run.

timeout = 300

This sets the SSH connection timeout to 300 seconds (5 minutes). If a host doesn’t respond within 5 minutes, Ansible gives up on that connection. This is pretty generous, but it accommodates slow networks or hosts that take a while to become available.

stdout_callback = default
display_skipped_hosts = no

These control how Ansible displays output. The default callback gives you the standard Ansible output format. You could change this to yaml or json if you preferred different formatting.

display_skipped_hosts = no is a nice quality-of-life setting. Kubespray has many tasks that only run on certain host types. A task that configures etcd will be skipped on worker nodes. A task that installs kubelet runs everywhere. With display_skipped_hosts enabled, you’d see endless “skipping: [worker-1]” messages cluttering your output. Disabling it keeps the output focused on what’s actually happening.

library = ./library

Ansible lets you write custom modules when the built-in ones don’t do what you need. Kubespray includes a custom module called kube.py in its library directory. This module wraps kubectl commands, letting Ansible tasks manage Kubernetes resources directly. The library = ./library setting tells Ansible to look in that directory for custom modules.

callbacks_enabled = profile_tasks

This enables the profile_tasks callback plugin. When your playbook finishes, you’ll see a summary showing how long each task took:

PLAY RECAP ****
Tuesday 28 January 2026  15:23:45 +0900 (0:00:02.456)

===============================================================================
Download containerd -------------------------------- 45.23s
Configure kubelet ---------------------------------- 12.67s
Download images ------------------------------------ 89.45s

This is incredibly useful for understanding where time goes during deployment. If your cluster deployment is taking too long, this output tells you which tasks are the bottleneck.

roles_path = roles:$VIRTUAL_ENV/usr/local/share/kubespray/roles:$VIRTUAL_ENV/usr/local/share/ansible/roles:/usr/share/kubespray/roles

This tells Ansible where to look for roles, in order of priority. The paths are separated by colons. Ansible checks the first path first, then moves on if it doesn’t find the role.

The first path, roles, is the local roles directory in the Kubespray project. This is where all of Kubespray’s roles live, and it has the highest priority.

The paths containing $VIRTUAL_ENV are for when you install Kubespray as a Python package into a virtual environment. The $VIRTUAL_ENV variable gets set automatically when you activate a Python virtual environment. If you’re just running Kubespray from a git clone (which is the most common approach), these paths won’t match anything and get ignored.

The last path, /usr/share/kubespray/roles, is for system-wide installations.

deprecation_warnings=False

Ansible loves to warn you about deprecated features. While these warnings are useful when you’re developing playbooks, they add noise when you’re just running Kubespray. Kubespray’s maintainers keep the playbooks updated, so you can trust that deprecated features will be fixed in future releases.

inventory_ignore_extensions = ~, .orig, .bak, .ini, .cfg, .retry, .pyc, .pyo, .creds, .gpg

When Ansible scans your inventory directory, it ignores files with these extensions. This prevents backup files, compiled Python files, and credential files from being accidentally parsed as inventory.

The Inventory Section

[inventory]
ignore_patterns = artifacts, credentials

Similar to inventory_ignore_extensions, this tells Ansible to skip directories matching these patterns when scanning for inventory files. The artifacts directory typically contains deployment outputs, and credentials contains sensitive files. Neither should be treated as inventory sources.

Customizing ansible.cfg

Kubespray’s default ansible.cfg works well for most situations, but you might want to adjust it for your environment. The key is understanding that you can override settings without modifying the original file.

One approach is to use the ANSIBLE_CONFIG environment variable:

ANSIBLE_CONFIG=ansible.cfg.custom ansible-playbook cluster.yml

You could maintain different configuration files for different environments:

kubespray/
├── ansible.cfg           # default
├── ansible.cfg.dev       # development (more verbose, shorter timeouts)
└── ansible.cfg.prod      # production (stricter security settings)

Some settings you might want to change:

For a more secure environment where you do want host key checking:

[defaults]
host_key_checking = True

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=30m
# Removed UserKnownHostsFile=/dev/null

For CI/CD pipelines where you need machine-parseable output:

[defaults]
stdout_callback = json

For debugging, to see which hosts are being skipped:

[defaults]
display_skipped_hosts = yes

For large clusters where you want more parallelism:

[defaults]
forks = 20  # default is 5

The forks setting controls how many hosts Ansible manages simultaneously. The default of 5 is conservative. If you have the network bandwidth and your control machine has enough resources, increasing this can speed up deployments to large clusters significantly.

If you modify ansible.cfg, keep a few things in mind. First, always keep a copy of the original so you can revert if something breaks. Second, document your changes with comments explaining why you made them. Third, if you’re version-controlling your Kubespray setup, include your modified ansible.cfg so team members get the same behavior.

The ansible.cfg file might seem like a minor detail compared to the actual playbooks and roles, but these settings affect every single task Ansible runs. Understanding them helps you troubleshoot problems, optimize performance, and adapt Kubespray to your specific environment.

Chapter 5: Variable System and Precedence Strategy

If you have ever spent hours debugging why a variable in Kubespray does not behave the way you expected, you are not alone. Kubespray has hundreds of variables spread across dozens of files, and understanding where to look and what takes priority over what is essential before you start customizing anything. This chapter will walk you through Ansible’s variable precedence system and show you exactly how Kubespray leverages it to create a maintainable, layered configuration architecture.

The fundamental principle is simple. Ansible has a 22-level priority system, and Kubespray deliberately places its variables at specific levels so that you, the operator, can override them without touching the core codebase. Once you understand this design, everything clicks into place.

Ansible evaluates variables from multiple sources and applies them in a strict order. When the same variable appears in multiple places, the one with higher priority wins. Here is the complete list, numbered from lowest to highest priority:

1.  command line values (for example, -u my_user)
2.  role defaults (roles/*/defaults/main.yml)
3.  inventory file or script group vars
4.  inventory group_vars/all
5.  playbook group_vars/all
6.  inventory group_vars/*
7.  playbook group_vars/*
8.  inventory file or script host vars
9.  inventory host_vars/*
10. playbook host_vars/*
11. host facts / cached set_facts
12. play vars
13. play vars_prompt
14. play vars_files
15. role vars (roles/*/vars/main.yml)
16. block vars
17. task vars
18. include_vars
19. set_facts / registered vars
20. role parameters
21. include parameters
22. extra vars (-e) (always win)

The magic number to remember is 22. Extra vars passed via the command line with -e always win. The second thing to remember is that role defaults sit at priority 2, the absolute bottom of the hierarchy. This is not an accident. Ansible designed it this way so that role authors could provide sensible defaults while making it trivial for users to override them from almost anywhere.

If you visualize this as a stack, it looks like this:

┌─────────────────────────┐
│   extra_vars (-e)       │  22 ← highest priority
├─────────────────────────┤
│   task vars             │  17
├─────────────────────────┤
│   role/vars/            │  15
├─────────────────────────┤
│   play vars             │  12
├─────────────────────────┤
│   host_vars/            │  9
├─────────────────────────┤
│   group_vars/*          │  6
├─────────────────────────┤
│   group_vars/all        │  4
├─────────────────────────┤
│   role/defaults/        │  2  ← lowest priority
└─────────────────────────┘

When you set kube_version: v1.30.0 in a role's defaults and then set kube_version: v1.31.0 in your inventory's group_vars, the group_vars value wins because priority 4 or 6 beats priority 2. If you then run the playbook with -e "kube_version=v1.32.0", that value wins over everything because priority 22 is king.

Mapping Precedence to Kubespray Files

Now let us connect these abstract priority levels to actual files in the Kubespray repository. This mapping is crucial for debugging and for knowing where to put your customizations.

At priority level 2, you have role defaults. In Kubespray, these live in paths like:

roles/kubespray_defaults/defaults/main/main.yml
roles/kubespray_defaults/defaults/main/download.yml
roles/etcd/defaults/main.yml
roles/kubernetes/node/defaults/main.yml

The kubespray_defaults role alone contains over 800 lines of variables in main.yml and over 1100 lines in download.yml. These are the sensible defaults that Kubespray provides out of the box.

At priority level 4, you have inventory group_vars/all. In a typical Kubespray setup, this maps to:

inventory/mycluster/group_vars/all/all.yml
inventory/mycluster/group_vars/all/etcd.yml
inventory/mycluster/group_vars/all/containerd.yml

At priority level 6, you have inventory group_vars for specific groups:

inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
inventory/mycluster/group_vars/k8s_cluster/addons.yml
inventory/mycluster/group_vars/k8s_cluster/k8s-net-calico.yml

At priority level 15, you have role vars. These are internal fixed values that Kubespray does not want you to accidentally override:

roles/kubespray_defaults/vars/main/checksums.yml
roles/kubespray_defaults/vars/main/main.yml

Let me show you how many files exist at each level. Run these commands from the Kubespray root directory:

find roles -type f -name "*.yml" | grep "/defaults/" | wc -l

This returns around 77 files. These are all the default variable files across every role.

find inventory/sample -path "*/group_vars/all/*.yml" | wc -l

This returns around 16 files in the sample inventory’s all directory.

find inventory/sample -path "*/group_vars/k8s_cluster/*.yml" | wc -l

This returns around 10 files for cluster-specific settings.

find roles -type f -name "*.yml" | grep "/vars/" | wc -l

This returns around 50 files containing internal fixed values.

Kubespray’s Intentional Design

Here is where it gets interesting. Kubespray deliberately places variables at specific priority levels to create a clean separation of concerns.

Role defaults at priority 2 contain the base configuration. These are values that work for most deployments. The Kubespray maintainers chose priority 2 because it is the lowest level, meaning you can override these values from almost anywhere.

Inventory group_vars at priority 4 and 6 are where you, the operator, put your customizations. When you copy the sample inventory to create your own cluster configuration, you are creating your override layer. Anything you define here beats the role defaults.

Role vars at priority 15 contain values that should not be overridden casually. The checksums.yml file is a perfect example. It contains SHA256 hashes for every binary that Kubespray downloads. If you accidentally override a checksum, the integrity verification fails and your deployment breaks. By placing checksums at priority 15, Kubespray ensures that your group_vars settings at priority 4–6 cannot accidentally clobber them.

The flow looks like this:

[role defaults] ──override──→ [inventory group_vars] ──override──→ [extra_vars]
    基本値                         your settings                    临时设定

Let me show you a concrete example. The variable override_system_hostname controls whether Kubespray sets the hostname on your nodes. In the role defaults, you will find:

# roles/bootstrap_os/defaults/main.yml
override_system_hostname: true

This is the default behavior. If you want to disable it for your cluster, you add this to your inventory:

# inventory/mycluster/group_vars/all/all.yml
override_system_hostname: false

Your value at priority 4 overrides the default at priority 2. You did not have to modify any Kubespray source code. When you upgrade Kubespray to a newer version, your customization remains intact in your inventory directory while the role defaults get updated.

The kubespray_defaults Role

The kubespray_defaults role deserves special attention because it is the single largest collection of variables in the entire project.

Look at its structure:

roles/kubespray_defaults/
├── defaults/main/
│   ├── main.yml        # 801 lines - core defaults
│   └── download.yml    # 1139 lines - download URLs, versions
└── vars/main/
    ├── main.yml        # internal paths
    └── checksums.yml   # binary integrity checksums

The defaults/main/main.yml file contains the core configuration for your cluster. Open it and you will see variables like:

kube_version: v1.31.0
container_manager: containerd
kube_network_plugin: calico
kube_proxy_mode: ipvs
etcd_deployment_type: host

The defaults/main/download.yml file contains download URLs and version mappings for every component. It tells Kubespray where to fetch containerd, runc, crictl, etcd, and dozens of other binaries.

The vars/main/checksums.yml file is massive. It contains checksums for every supported version of every component, organized by architecture.

A small excerpt looks like:

crictl_checksums:
  amd64:
    v1.29.0: sha256:abc123...
    v1.30.0: sha256:def456...
  arm64:
    v1.29.0: sha256:789abc...
    v1.30.0: sha256:012def...

The reason checksums live in vars/ instead of defaults/ is protection. At priority 15, they are harder to accidentally override. If you need to add a new version that Kubespray does not support yet, you would need to use extra_vars at priority 22 or modify the file directly. This friction is intentional.

Play Vars and Task Vars in Kubespray

Beyond role defaults and group_vars, Kubespray also uses play vars and task vars in specific situations. These are less common but worth understanding.

Play vars appear at priority 12. In Kubespray, they are used sparingly. I searched through the playbooks and found only four variables defined this way:

# playbooks/cluster.yml (around line 20-22)
- name: Install etcd
  vars:
    etcd_cluster_setup: true
    etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"
  import_playbook: install_etcd.yml

The same pattern appears in upgrade_cluster.yml and scale.yml. Why does Kubespray define etcd_cluster_setup as a play var when there is already a default value in the role?

The answer lies in scale.yml. When you add a new node to an existing cluster, you do not want to reinstall etcd. The scale playbook sets etcd_cluster_setup: false to skip etcd installation. The cluster playbook sets it to true to perform the installation. By using play vars, Kubespray can reuse the same install_etcd.yml playbook with different behavior depending on which parent playbook invoked it.

# playbooks/scale.yml
- name: Install etcd
  vars:
    etcd_cluster_setup: false
    etcd_events_cluster_setup: false
  import_playbook: install_etcd.yml

The ansible_version.yml playbook uses play vars to define minimum and maximum supported Ansible versions:

# playbooks/ansible_version.yml
- name: Check Ansible version
  hosts: localhost
  vars:
    minimal_ansible_version: "2.14.0"
    maximal_ansible_version: "2.17.99"

Task vars at priority 17 appear more frequently but serve a narrow purpose. Most task vars in Kubespray are parameters passed to included tasks.

For example:

# roles/container-engine/containerd/tasks/main.yml
- name: Download containerd
  include_tasks: "../download/tasks/download_file.yml"
  vars:
    download: "{{ containerd_download }}"

The download variable is scoped to that specific task inclusion. It does not override any global variable because nothing else uses a variable simply named download at higher levels. This is a parameter-passing pattern, not a global override pattern.

I checked whether any task vars conflict with group_vars or defaults. They do not. The variable names used in task vars like download, iface, and etcd_peer_addresses are local parameters that exist only within the scope of that task.

Variable Categories from Documentation

The official Kubespray documentation at docs/ansible/vars.md categorizes variables into several groups. While I recommend reading the actual files rather than memorizing documentation, knowing these categories helps when you are searching for a specific setting.

Generic Ansible Variables are facts that Ansible collects automatically. The most commonly referenced one is:

ansible_default_ipv4.address

This is the IP address that Ansible detects as your node’s primary address. Kubespray uses this as a fallback when you do not specify the ip variable in your inventory. You can see how Ansible determines this by running:

ip -4 route get 8.8.8.8

On a typical VirtualBox VM, this returns the NAT interface (10.0.2.15), which is why you must explicitly set the ip variable in multi-NIC environments.

Common Vars are the variables you will modify most often:

kube_version: v1.31.0
kube_network_plugin: calico
kube_proxy_mode: ipvs
container_manager: containerd
etcd_version: v3.5.25

Container runtime variables depend on which runtime you choose. If you set container_manager: containerd, Kubespray uses containerd_version. If you set container_manager: docker, it uses docker_version and docker_containerd_version.

Addressing Variables control how nodes communicate:

ip: 192.168.10.10
access_ip: 192.168.10.10
loadbalancer_apiserver: 192.168.10.100

The ip variable is what the node uses for internal cluster communication. The access_ip is what other nodes use to reach this node. In most cases they are the same, but in environments with separate management and data networks, they might differ.

Calico-specific variables live in a separate category if you use Calico as your CNI:

calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird

The Customization Workflow

Now that you understand where variables live and how priority works, let me walk you through the workflow for customizing a Kubespray deployment.

First, recognize what you should and should not modify. This table is critical:

Location                              | Modify? | Reason
--------------------------------------|---------|----------------------------------
inventory/mycluster/group_vars/       | YES     | Your customization area
inventory/mycluster/host_vars/        | YES     | Host-specific overrides
-e command line option                | YES     | Temporary overrides
playbooks/*.yml                       | NO      | Kubespray code, gets overwritten
roles/*/defaults/                     | NO      | Kubespray code, gets overwritten
roles/*/vars/                         | NO      | Internal fixed values

When you clone Kubespray and run git pull or switch to a newer release, everything in playbooks/ and roles/ gets updated. But your inventory/mycluster/ directory is separate. It survives upgrades because you created it by copying from inventory/sample/.

The three-step workflow goes like this.

Step one: Copy the sample inventory.

cp -rfp inventory/sample inventory/mycluster

The sample inventory is a template. It contains some variables with values, some commented out as hints, and many variables not present at all (meaning they use role defaults).

Step two: Find the variable you want to change by checking role defaults.

grep -r "kube_version" roles/*/defaults/ | head -5

This shows you where kube_version is defined and what its default value is:

roles/kubespray_defaults/defaults/main/main.yml:kube_version: v1.31.0

Step three: Override it in your group_vars.

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
kube_version: v1.30.0

You only define the variables you want to change. Everything else uses the defaults. This keeps your configuration minimal and makes upgrades easier because you are not duplicating values that you did not need to modify.

Let me show you a complete example. Suppose you want to change the Kubernetes version from the default v1.31.0 to v1.30.0.

First, verify the default:

grep "kube_version" roles/kubespray_defaults/defaults/main/main.yml

Output:

kube_version: v1.31.0

Next, check if the sample inventory already has this variable:

grep "kube_version" inventory/sample/group_vars/k8s_cluster/k8s-cluster.yml

Output:

# kube_version: v1.31.0

The sample has it commented out, which means it is using the default. The comment serves as documentation showing you that this variable exists and can be changed.

Now add your override:

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
kube_version: v1.30.0

When you run ansible-playbook cluster.yml, Ansible loads the role default (v1.31.0) at priority 2, then loads your group_vars (v1.30.0) at priority 6, and the higher priority wins. Kubernetes 1.30.0 gets installed.

If you need to temporarily test a different version without modifying files, use extra vars:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  -e "kube_version=v1.32.0"

Priority 22 beats everything. The cluster deploys with v1.32.0 regardless of what your files say.

Key Files You Will Modify

Let me give you a practical reference for the files you will interact with most often.

The inventory/mycluster/group_vars/all/all.yml file contains global settings that apply to every node, including etcd nodes. Common settings here include:

bin_dir: /usr/local/bin
loadbalancer_apiserver_port: 6443
ntp_enabled: true
ntp_servers:
  - "0.pool.ntp.org iburst"
  - "1.pool.ntp.org iburst"
unsafe_show_logs: false

The inventory/mycluster/group_vars/all/etcd.yml file controls etcd deployment:

etcd_data_dir: /var/lib/etcd
etcd_deployment_type: host

The etcd_deployment_type is particularly important. Setting it to host means etcd runs as a systemd service outside of Kubernetes. Setting it to kubeadm means etcd runs as a static pod managed by kubeadm. The host option gives you more control and independence from kubeadm's lifecycle management.

The inventory/mycluster/group_vars/all/containerd.yml file configures the container runtime. Most options are commented out in the sample, meaning you use defaults:

# containerd_storage_dir: "/var/lib/containerd"
# containerd_state_dir: "/run/containerd"
# containerd_oom_score: 0
# containerd_default_runtime: "runc"

The inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml file is the most important one. This is where you configure the Kubernetes cluster itself:

kube_config_dir: /etc/kubernetes
kube_network_plugin: calico
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_proxy_mode: ipvs
dns_mode: coredns
enable_nodelocaldns: true
container_manager: containerd
auto_renew_certificates: false

The inventory/mycluster/group_vars/k8s_cluster/addons.yml file controls which add-ons get installed:

helm_enabled: false
metrics_server_enabled: false
ingress_nginx_enabled: false
cert_manager_enabled: false
metallb_enabled: false
argocd_enabled: false

Every addon has additional configuration options below its enable flag. For example, if you enable metrics_server, you can configure its resource requests, replica count, and other settings in the same file.

The network plugin files like k8s-net-calico.yml and k8s-net-flannel.yml contain CNI-specific settings. The file that gets used depends on your kube_network_plugin value. If you set kube_network_plugin: flannel, then k8s-net-flannel.yml becomes relevant.

Understanding Scope: all vs k8s_cluster

A subtle but important detail is the difference between group_vars/all/ and group_vars/k8s_cluster/.

Variables in group_vars/all/ apply to every host in your inventory. This includes etcd nodes, control plane nodes, and worker nodes.

Variables in group_vars/k8s_cluster/ apply only to hosts that belong to the k8s_cluster group. Looking at a typical inventory:

[kube_control_plane]
k8s-ctr

[etcd]
k8s-ctr

[kube_node]
k8s-ctr

[k8s_cluster:children]
kube_control_plane
kube_node

The k8s_cluster group is defined using the :children suffix, which means it includes all hosts from kube_control_plane and kube_node. In this single-node example, k8s-ctr belongs to all groups. But in a multi-node setup where you have dedicated etcd nodes that are not part of the Kubernetes cluster, the etcd nodes would be in the etcd group but not in k8s_cluster.

This distinction matters because some settings only make sense for Kubernetes nodes. The kube_proxy_mode variable, for example, should only apply to nodes running kube-proxy. Putting it in group_vars/k8s_cluster/ ensures it does not get applied to standalone etcd nodes.

The priority difference also matters. Remember that group_vars/all is priority 4 and group_vars/k8s_cluster is priority 6. If you somehow define the same variable in both places, the k8s_cluster value wins for nodes that belong to that group.

One practical question that comes up is how to know which Kubernetes versions Kubespray supports. The answer is in the checksums file.

cat roles/kubespray_defaults/vars/main/checksums.yml | grep -A 20 "kubelet_checksums:"

This shows you the kubelet checksums, and each version listed is a supported version. If you try to set kube_version to a value that does not have a corresponding checksum, the download task will fail because Kubespray cannot verify the binary integrity.

When a new Kubernetes version comes out, the Kubespray maintainers add its checksums to this file. Until they do, you cannot use that version unless you bypass the checksum verification (not recommended) or add the checksums yourself using extra_vars at priority 22.

When a variable does not behave as expected, here is how to debug it.

First, check what value Ansible is actually using. Add a debug task to your playbook or run:

ansible -i inventory/mycluster/inventory.ini all -m debug -a "var=kube_version"

This shows you the resolved value of kube_version for each host.

Second, trace where the value comes from. Ansible does not have a built-in “show me which file set this variable” feature, but you can search:

grep -r "kube_version" inventory/mycluster/
grep -r "kube_version" roles/*/defaults/
grep -r "kube_version" roles/*/vars/

Compare the search results with the priority list. The highest priority source wins.

Third, check for typos. A common mistake is defining kube_versions (plural) when the variable is kube_version (singular). Ansible silently ignores variables it does not recognize, so your setting has no effect.

Fourth, verify the file is being loaded. If you created a new file in group_vars but named it without the .yml extension, Ansible ignores it. The file must end in .yml or .yaml.

Chapter 6: Group Variables Deep Dive

I explained how Kubespray organizes its variable system and why the precedence matters. Now we get into the actual files you will be editing. When you copy inventory or sample to inventory/mycluster, you inherit a carefully structured set of configuration files that control every aspect of your cluster. Understanding what each file does and which variables live where saves you from the frustrating experience of changing something in the wrong place and wondering why nothing happened.

The group_vars Directory Structure

After copying the sample inventory, your mycluster directory looks like this:

inventory/mycluster/
├── group_vars
│   ├── all
│   │   ├── all.yml
│   │   ├── aws.yml
│   │   ├── azure.yml
│   │   ├── containerd.yml
│   │   ├── coreos.yml
│   │   ├── cri-o.yml
│   │   ├── docker.yml
│   │   ├── etcd.yml
│   │   ├── gcp.yml
│   │   ├── hcloud.yml
│   │   ├── huaweicloud.yml
│   │   ├── oci.yml
│   │   ├── offline.yml
│   │   ├── openstack.yml
│   │   ├── upcloud.yml
│   │   └── vsphere.yml
│   └── k8s_cluster
│       ├── addons.yml
│       ├── k8s-cluster.yml
│       ├── k8s-net-calico.yml
│       ├── k8s-net-cilium.yml
│       ├── k8s-net-custom-cni.yml
│       ├── k8s-net-flannel.yml
│       ├── k8s-net-kube-ovn.yml
│       ├── k8s-net-kube-router.yml
│       ├── k8s-net-macvlan.yml
│       └── kube_control_plane.yml
└── inventory.ini

Two directories exist under group_vars: all and k8s_cluster. This split is not arbitrary. It follows Ansible’s group variable system where variables defined in a directory apply only to hosts belonging to that group.

How Ansible Group Variables Work

The group_vars/all directory contains variables that apply to every single host in your inventory. When Ansible runs a play against any host, it loads all the YAML files from group_vars/all and makes those variables available.

The group_vars/k8s_cluster directory is more specific. It only applies to hosts that belong to the k8s_cluster group. If you look at a typical Kubespray inventory, you will see something like this:

[kube_control_plane]
k8s-ctr

[etcd]
k8s-ctr

[kube_node]
k8s-ctr

[k8s_cluster:children]
kube_control_plane
kube_node

The k8s_cluster group is defined using the :children suffix, which means it includes all hosts from kube_control_plane and kube_node groups. So any host that is either a control plane node or a worker node gets the variables from group_vars/k8s_cluster.

This distinction matters for etcd. If you run etcd on separate dedicated nodes that are not part of kube_control_plane or kube_node, those etcd hosts would get variables from group_vars/all but not from group_vars/k8s_cluster. The separation allows you to configure etcd-specific settings without accidentally applying Kubernetes cluster settings to standalone etcd nodes.

Variable Precedence Between Groups

Ansible has a rule: more specific groups override less specific ones. The group_vars/all directory has priority 4 in Ansible’s precedence system, while group_vars/k8s_cluster has priority 6. This means if the same variable is defined in both places, the k8s_cluster version wins for hosts that belong to that group.

In practice, you rarely define the same variable in both places. Kubespray’s sample files are organized so that truly global settings go in all and cluster-specific settings go in k8s_cluster. But knowing this precedence helps when you need to override something for just the Kubernetes nodes without affecting standalone etcd hosts.

Conditional File Application

Not every file in group_vars gets used in every deployment. Many files are conditional based on other variable values.

For cloud provider files like aws.yml, azure.yml, and gcp.yml, the contents only matter if you set cloud_provider to that specific provider. If you are running on bare metal or a different cloud, these files are essentially ignored even though they exist.

Similarly, the container runtime files are conditional. The containerd.yml settings apply when container_manager is set to containerd. The cri-o.yml settings apply when you choose CRI-O instead. The docker.yml file is there for legacy Docker support through cri-dockerd, though this path is deprecated since Kubernetes 1.24 removed native Docker support.

The network plugin files in k8s_cluster follow the same pattern. The k8s-net-calico.yml file only matters when kube_network_plugin is calico. Switch to flannel and suddenly k8s-net-flannel.yml becomes relevant while the Calico file is ignored.

This conditional system is elegant because you can see all available options in your inventory without them interfering with each other. You pick your choices in k8s-cluster.yml and the corresponding configuration files automatically become active.

The group_vars/all Directory

This directory contains settings that every node needs regardless of its role.

all.yml — Global Settings

The all.yml file is where you configure settings that genuinely apply everywhere. Running grep to see only the non-commented lines:

grep "^[^#]" inventory/mycluster/group_vars/all/all.yml

You get output like:

---
bin_dir: /usr/local/bin
loadbalancer_apiserver_port: 6443
loadbalancer_apiserver_healthcheck_port: 8081
no_proxy_exclude_workers: false
kube_webhook_token_auth: false
kube_webhook_token_auth_url_skip_tls_verify: false
ntp_enabled: false
ntp_manage_config: false
ntp_servers:
  - "0.pool.ntp.org iburst"
  - "1.pool.ntp.org iburst"
  - "2.pool.ntp.org iburst"
  - "3.pool.ntp.org iburst"
unsafe_show_logs: false
allow_unsupported_distribution_setup: false

The bin_dir setting tells Kubespray where to install binaries like kubectl, kubelet, and kubeadm. The default /usr/local/bin works for most systems. Change this if your organization has a different standard location for locally installed binaries.

The loadbalancer_apiserver_port defaults to 6443, which is the standard Kubernetes API server port. In HA setups with an external load balancer, this is the port your load balancer listens on.

The ntp_enabled setting is false by default, which might surprise you. Time synchronization is critical for Kubernetes, especially for certificate validation and etcd consistency. Kubespray assumes your nodes already have NTP configured through your base OS provisioning. If you want Kubespray to manage NTP, set this to true and it will configure chrony or ntpd with the servers listed in ntp_servers.

The unsafe_show_logs setting controls whether sensitive information appears in Ansible output. Keep this false in production. Setting it true during debugging can help you see what values are being used, but you risk exposing secrets in your logs.

The allow_unsupported_distribution_setup is a safety valve. Kubespray has a list of tested and supported operating systems. If you try to deploy on something not on that list, it fails with an error. Setting this to true bypasses that check. Use this only if you know what you are doing and accept that things might break in unexpected ways.

etcd.yml — etcd Configuration

The etcd configuration is surprisingly minimal in the sample file:

grep "^[^#]" inventory/mycluster/group_vars/all/etcd.yml

---
etcd_data_dir: /var/lib/etcd
etcd_deployment_type: host

The etcd_data_dir is where etcd stores its database. This directory contains the entire state of your Kubernetes cluster. Losing this directory means losing your cluster. Make sure this path is on reliable storage and included in your backup strategy.

The etcd_deployment_type setting is more interesting. It has two possible values: host and kubeadm.

When set to host, Kubespray installs etcd as a systemd service running directly on the host. This is independent of Kubernetes itself. The etcd binary runs under the etcd user, managed by systemd, with its own certificates in /etc/ssl/etcd/ssl. This approach keeps etcd completely separate from Kubernetes, which some operators prefer because etcd can be managed, backed up, and recovered independently.

When set to kubeadm, etcd runs as a static pod managed by kubelet. This is what you get with a standard kubeadm init. The etcd container runs in the kube-system namespace, and kubeadm handles certificate generation and renewal. This approach is simpler but couples etcd lifecycle to Kubernetes.

Kubespray defaults to host because it provides more operational flexibility, especially for etcd upgrades and disaster recovery scenarios.

containerd.yml — Container Runtime Settings

Opening the containerd configuration file reveals mostly comments:

cat inventory/mycluster/group_vars/all/containerd.yml

---
# Please see roles/container-engine/containerd/defaults/main.yml 
# for more configuration options

# containerd_storage_dir: "/var/lib/containerd"
# containerd_state_dir: "/run/containerd"
# containerd_oom_score: 0

# containerd_default_runtime: "runc"
# containerd_snapshotter: "native"
...

The file is almost entirely commented out because the defaults work for most deployments. This is a pattern you will see throughout Kubespray: the sample files show you what options exist without forcing you to understand them all upfront.

If you need to change containerd’s storage location, perhaps because /var/lib is on a small root partition and you want container images on a larger disk, uncomment containerd_storage_dir and set your preferred path.

The comment at the top pointing to roles/container-engine/containerd/defaults/main.yml is valuable. When you need to customize something not shown in the sample file, that is where you look to find the variable name and its default value.

Cloud Provider Files

The group_vars/all directory contains configuration files for various cloud providers: aws.yml, azure.yml, gcp.yml, openstack.yml, vsphere.yml, and several others.

These files configure cloud-specific integrations like:

Cloud controller manager settings
CSI driver configurations for persistent volumes
Load balancer integration
Node metadata services

For example, if you deploy on AWS and want to use EBS volumes for persistent storage, you would configure the AWS CSI driver settings in aws.yml. If you deploy on bare metal, these files sit unused.

Most of the content in these files is commented out. The comments serve as documentation showing what options are available. When you need a specific cloud integration, you uncomment and configure the relevant settings.

docker.yml — Legacy Docker Support

The docker.yml file exists for deployments that still use Docker as the container runtime. Since Kubernetes 1.24 removed the built-in dockershim, using Docker now requires cri-dockerd as a shim layer.

docker_container_storage_setup: false
docker_dns_servers_strict: false
docker_daemon_graph: "/var/lib/docker"
docker_iptables_enabled: "false"
docker_log_opts: "--log-opt max-size=50m --log-opt max-file=5"
docker_bin_dir: "/usr/bin"
docker_rpm_keepcache: 1

Unless you have a specific reason to use Docker, stick with containerd. It is the default and the direction the Kubernetes ecosystem has moved.

offline.yml — Air-Gap Deployment

The offline.yml file contains settings for deploying Kubernetes in environments without internet access. Air-gap deployments require pre-downloading all container images and binaries, then serving them from internal repositories.

This file is where you configure your internal registry mirrors and binary download locations. In a connected environment, you can ignore this file entirely.

The group_vars/k8s_cluster Directory

This directory contains the real meat of your cluster configuration. The settings here define what kind of Kubernetes cluster you are building.

k8s-cluster.yml — The Heart of Your Cluster Configuration

This is the most important configuration file in Kubespray. It controls fundamental cluster characteristics that are difficult or impossible to change after deployment.

Running grep to see active settings:

grep "^[^#]" inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

The output is substantial:

---
kube_config_dir: /etc/kubernetes
kube_script_dir: "/kubernetes-scripts"
kube_manifest_dir: "/manifests"
kube_cert_dir: "/ssl"
kube_token_dir: "/tokens"
kube_api_anonymous_auth: true
local_release_dir: "/tmp/releases"
retry_stagger: 5
kube_owner: kube
kube_cert_group: kube-cert
kube_log_level: 2
credentials_dir: "/credentials"
kube_network_plugin: calico
kube_network_plugin_multus: false
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_network_node_prefix: 24
kube_service_addresses_ipv6: fd85:ee78:d8a6:8607::1000/116
kube_pods_subnet_ipv6: fd85:ee78:d8a6:8607::1:0000/112
kube_network_node_prefix_ipv6: 120
kube_apiserver_ip: ""
kube_apiserver_port: 6443
kube_proxy_mode: ipvs
kube_proxy_strict_arp: false
kube_encrypt_secret_data: false
cluster_name: cluster.local
ndots: 2
dns_mode: coredns
enable_nodelocaldns: true
enable_nodelocaldns_secondary: false
nodelocaldns_ip: 169.254.25.10
resolvconf_mode: host_resolvconf
deploy_netchecker: false
dns_domain: ""
container_manager: containerd
kata_containers_enabled: false
k8s_image_pull_policy: IfNotPresent
kubernetes_audit: false
volume_cross_zone_attachment: false
persistent_volumes_enabled: false
event_ttl_duration: "1h0m0s"
auto_renew_certificates: false
kubeadm_patches_dir: "/patches"
kubeadm_patches: []
remove_anonymous_access: false

Directory Settings

The first group of settings defines where Kubernetes stores its files:

kube_config_dir: /etc/kubernetes
kube_script_dir: "/kubernetes-scripts"
kube_manifest_dir: "/manifests"
kube_cert_dir: "/ssl"
kube_token_dir: "/tokens"

These paths are relative to kube_config_dir, so the actual manifest directory ends up at /etc/kubernetes/manifests. This is where static pod manifests for control plane components live.

You rarely need to change these unless your organization has specific filesystem layout requirements.

Network Settings

The network configuration is critical and mostly immutable after cluster creation:

kube_network_plugin: calico
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
kube_network_node_prefix: 24

The kube_network_plugin setting chooses your CNI provider. Calico is the default and provides excellent functionality including network policies. Other options include flannel for simplicity, cilium for eBPF-based networking, and several others.

The kube_service_addresses CIDR defines the IP range for Kubernetes Services. When you create a Service of type ClusterIP, it gets an IP from this range. The default 10.233.0.0/18 provides about 16,000 service IPs.

The kube_pods_subnet CIDR is where pod IPs come from. With 10.233.64.0/18, you have roughly 16,000 pod IPs. The kube_network_node_prefix of 24 means each node gets a /24 subnet (256 IPs) from this range for its pods.

These CIDR ranges must not overlap with your existing network infrastructure. If your corporate network uses 10.x.x.x addressing, you need to carefully plan these ranges to avoid conflicts. Changing these after deployment requires recreating the cluster.

Proxy Mode

The kube_proxy_mode setting determines how Kubernetes implements service networking:

kube_proxy_mode: ipvs

IPVS (IP Virtual Server) is the default and provides better performance than the older iptables mode, especially in clusters with many services. IPVS uses hash tables for service lookup, giving O(1) performance regardless of how many services exist. The iptables mode uses linear rule chains, so performance degrades as service count grows.

For most deployments, stick with ipvs. The iptables mode still works and might be preferred in very small clusters or when debugging networking issues since iptables rules are more familiar to many operators.

DNS Configuration

dns_mode: coredns
enable_nodelocaldns: true
nodelocaldns_ip: 169.254.25.10
cluster_name: cluster.local

CoreDNS is the standard Kubernetes DNS server and the only real option these days. The dns_mode setting exists for historical reasons when kube-dns was an alternative.

The enable_nodelocaldns setting deploys NodeLocal DNSCache, which runs a DNS caching agent on every node. Pods make DNS queries to the nodelocaldns_ip address (169.254.25.10, a link-local address), and the local agent either answers from cache or forwards to CoreDNS. This significantly reduces DNS latency and load on CoreDNS pods.

The cluster_name becomes your cluster’s DNS domain. Services are accessible at servicename.namespace.svc.cluster.local.

Container Runtime

container_manager: containerd

This chooses between containerd and CRI-O. Containerd is the dominant choice in the Kubernetes ecosystem now. It is what Docker uses internally, it is the default for most managed Kubernetes services, and it has excellent tooling support.

Certificate Management

auto_renew_certificates: false

Kubernetes component certificates expire after one year by default. When auto_renew_certificates is true, Kubespray configures a systemd timer to automatically renew certificates monthly using kubeadm certs renew.

The sample defaults to false, which means you need to manually renew certificates or implement your own automation. For any cluster running longer than a few months, enabling this is strongly recommended. Expired certificates cause control plane components to stop communicating, effectively bringing down your cluster.

addons.yml — Cluster Add-ons

The addons file controls which additional components get deployed:

grep "^[^#]" inventory/mycluster/group_vars/k8s_cluster/addons.yml

---
helm_enabled: false
registry_enabled: false
metrics_server_enabled: false
local_path_provisioner_enabled: false
local_volume_provisioner_enabled: false
gateway_api_enabled: false
ingress_nginx_enabled: false
ingress_publish_status_address: ""
ingress_alb_enabled: false
cert_manager_enabled: false
metallb_enabled: false
metallb_speaker_enabled: ""
metallb_namespace: "metallb-system"
argocd_enabled: false
kube_vip_enabled: false
node_feature_discovery_enabled: false

Each boolean controls whether that component gets installed. The defaults are conservative, deploying a minimal cluster. You enable what you need.

For most production clusters, you probably want:

helm_enabled: true
metrics_server_enabled: true

Helm is the de facto package manager for Kubernetes applications. Metrics Server is required for kubectl top commands and Horizontal Pod Autoscaler functionality.

For bare metal clusters without a cloud load balancer, MetalLB provides LoadBalancer service support:

metallb_enabled: true

For clusters that need ingress routing:

ingress_nginx_enabled: true

The file contains many commented-out sections with detailed configuration options for each addon. When you enable an addon, scroll through those comments to see what customization options exist.

kube_control_plane.yml — Control Plane Resource Reservation

This file is entirely commented out in the sample:

cat inventory/mycluster/group_vars/k8s_cluster/kube_control_plane.yml

# Reservation for control plane kubernetes components
# kube_memory_reserved: 512Mi
# kube_cpu_reserved: 200m
# kube_ephemeral_storage_reserved: 2Gi
# kube_pid_reserved: "1000"

# Reservation for control plane host system
# system_memory_reserved: 256Mi
# system_cpu_reserved: 250m
# system_ephemeral_storage_reserved: 2Gi
# system_pid_reserved: "1000"

These settings reserve resources on control plane nodes for Kubernetes components and the host operating system. Without reservations, pods could consume all available resources, starving critical system processes.

In production, especially on nodes that run both control plane components and workload pods, configuring these reservations prevents resource exhaustion scenarios. The commented values are reasonable starting points.

Network Plugin Configuration Files

The k8s_cluster directory contains configuration files for each supported CNI plugin. Only the file matching your kube_network_plugin choice gets used.

For Calico (k8s-net-calico.yml):

calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird

These control Calico’s encapsulation mode. IPIP mode wraps pod traffic in IP-in-IP packets, which works across most networks. VXLAN is an alternative encapsulation. The bird backend uses BGP for routing, which can peer with your physical network infrastructure for advanced deployments.

For Flannel (k8s-net-flannel.yml):

flannel_backend_type: vxlan

Flannel is simpler than Calico. It uses VXLAN overlay networking by default. The main thing you might configure is flannel_interface if you have multiple network interfaces and need Flannel to use a specific one.

For Cilium (k8s-net-cilium.yml):

Cilium has extensive configuration options for its eBPF-based networking. The defaults work for getting started, but Cilium’s advanced features like transparent encryption, Hubble observability, and service mesh capabilities all have configuration options here.

Checking Supported Kubernetes Versions

The group_vars files define what you want, but Kubespray can only install versions it knows about. The roles/kubespray_defaults/vars/main/checksums.yml file contains SHA256 checksums for all supported binaries.

To see what Kubernetes versions are available:

cat roles/kubespray_defaults/vars/main/checksums.yml | grep -i kube -A40

If a version does not have checksums in this file, Kubespray cannot install it. This is a security feature ensuring binary integrity. When Kubernetes releases a new version, Kubespray maintainers add the checksums in a subsequent release.

This is why you cannot simply set kube_version to any arbitrary version. You need a Kubespray release that includes that version’s checksums.

Practical Workflow

When setting up a new cluster, the typical workflow is:

First, copy the sample inventory:

cp -rfp inventory/sample inventory/mycluster

Second, edit inventory.ini with your hosts and groups.

Third, review k8s-cluster.yml and change the settings that matter for your environment:

Choose your CNI plugin
Verify the CIDR ranges do not conflict with your network
Enable certificate auto-renewal
Adjust any other settings based on requirements

Fourth, review addons.yml and enable the add-ons you need.

Fifth, if using a specific cloud provider, configure the corresponding file in group_vars/all.

Variables you do not change keep their defaults from roles/*/defaults/main.yml. You do not need to understand every variable to deploy a working cluster. Start with the defaults, deploy, and then iterate as you learn what needs customization.

The group_vars files in your inventory are yours to modify freely. Kubespray updates will not overwrite them because they live in your custom inventory directory, not in the sample directory or the roles themselves.

Chapter 7: cluster.yml Playbook Flow Analysis

When you run ansible-playbook cluster.yml, you are kicking off a carefully orchestrated sequence of operations that transforms a bunch of Linux machines into a functioning Kubernetes cluster.

Understanding this flow is not optional if you want to operate Kubespray in production. Let me walk you through the entire cluster.yml playbook, explaining what happens at each stage and why it is structured the way it is.

The cluster.yml file sits at the root of the Kubespray repository. It is the main entry point for cluster deployment. But here is the thing that confused me when I first looked at it: the file itself is surprisingly short. Most of the actual work happens in sub-playbooks and roles that get imported.

The overall flow looks like this:


Common tasks (boilerplate.yml)
↓
Fact gathering (internal_facts.yml)
↓
etcd installation preparation (preinstall, container-engine, download)
↓
etcd installation (install_etcd.yml)
↓
Kubernetes node installation (kubernetes/node)
↓
Control plane installation (kubernetes/control-plane)
↓
kubeadm execution and CNI installation (kubernetes/kubeadm, network_plugin)
↓
Calico Route Reflector (optional)
↓
Windows node patching (optional)
↓
Kubernetes apps installation (kubernetes-apps)
↓
resolv.conf finalization

Each of these stages corresponds to one or more plays in the playbook. Let me show you the actual structure.

The Playbook Structure

Here is the cluster.yml file in its entirety. I am including the whole thing because it is worth reading through:

---
- name: Common tasks for every playbooks
  import_playbook: boilerplate.yml

- name: Gather facts
  import_playbook: internal_facts.yml

- name: Prepare for etcd install
  hosts: "{{ groups['etcd'] | default([]) | union(groups['k8s_cluster'] | default([])) }}"
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, tags: preinstall }
    - { role: container-engine, tags: container-engine, when: deploy_container_engine }
    - { role: download, tags: download, when: "not skip_downloads" }

- name: Install etcd
  vars:
    etcd_cluster_setup: true
    etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"
  import_playbook: install_etcd.yml

- name: Install Kubernetes nodes
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/node, tags: node }

- name: Install the control plane
  hosts: kube_control_plane
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/control-plane, tags: control-plane }

- name: Invoke kubeadm and install a CNI
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/kubeadm, tags: kubeadm }
    - { role: kubernetes/node-label, tags: node-label }
    - { role: kubernetes/node-taint, tags: node-taint, when: node_taints is defined }
    - { role: network_plugin, tags: network }

- name: Install Calico Route Reflector
  hosts: calico_rr
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: network_plugin/calico/rr, tags: ['network', 'calico_rr'] }

- name: Patch Kubernetes for Windows
  hosts: kube_control_plane[0]
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: win_nodes/kubernetes_patch, tags: k8s-windows }
- name: Install Kubernetes apps
  hosts: kube_control_plane
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes-apps, tags: apps }
- name: Apply resolv.conf changes now that cluster DNS is up
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, when: "dns_mode != 'none' and resolvconf_mode == 'host_resolvconf'", tags: resolvconf, dns_late: true }

Play 1 and 2: Boilerplate and Fact Gathering

The first two plays handle common setup tasks:

- name: Common tasks for every playbooks
  import_playbook: boilerplate.yml

- name: Gather facts
  import_playbook: internal_facts.yml

The boilerplate.yml playbook does things like validating your Ansible version, checking that your inventory is properly configured, and setting up bastion host SSH config if you are using one. It is the “sanity check before we do anything destructive” phase.

The internal_facts.yml playbook runs Ansible’s fact gathering across all nodes. This collects information about each machine: what operating system it runs, what IP addresses it has, how much memory, what network interfaces exist, and so on. This information becomes available as variables that later plays can use.

Notice that these use import_playbook rather than defining roles directly. This is a common pattern in Kubespray. The import_playbook directive pulls in another playbook file and executes it as if it were part of the current file.

Play 3: Preparing for etcd Installation

This play is where things start getting interesting:

- name: Prepare for etcd install
  hosts: "{{ groups['etcd'] | default([]) | union(groups['k8s_cluster'] | default([])) }}"
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, tags: preinstall }
    - { role: container-engine, tags: container-engine, when: deploy_container_engine }
    - { role: download, tags: download, when: "not skip_downloads" }

Let me explain each line.

The hosts directive uses Jinja2 templating to target both the etcd group and the k8s_cluster group. The union filter combines these two groups, and the default([]) ensures the playbook does not fail if either group is undefined. This means the play runs on all nodes that will either run etcd or be part of the Kubernetes cluster.

The gather_facts: false setting might seem strange. Why would we skip fact gathering? The answer is that we already gathered facts in the previous play (internal_facts.yml), and Kubespray caches these facts. Look at the ansible.cfg settings:

gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400

The gathering = smart setting tells Ansible to use cached facts if they exist and are still valid. Since we gathered facts just moments ago and they are cached in /tmp as JSON files, there is no need to gather them again. This saves time, especially when you have many nodes.

The any_errors_fatal setting deserves attention. When set to true, if any single host fails a task, the entire playbook stops immediately. This is critical for cluster deployment because a half-configured cluster is worse than no cluster at all. If etcd fails to install on one node, you do not want the playbook to continue and try to bootstrap Kubernetes against a broken etcd cluster.

The environment directive sets environment variables for all tasks in this play. The proxy_disable_env variable is defined in roles/kubespray_defaults/defaults/main/main.yml:

proxy_disable_env:
  http_proxy: ""
  HTTP_PROXY: ""
  https_proxy: ""
  HTTPS_PROXY: ""
  no_proxy: ""
  NO_PROXY: ""

This clears all proxy-related environment variables. Cluster components communicate directly with each other, and you do not want that traffic going through an HTTP proxy.

Understanding Role Import Syntax

Look at how roles are imported:

roles:
  - { role: kubespray_defaults }
  - { role: kubernetes/preinstall, tags: preinstall }
  - { role: container-engine, tags: container-engine, when: deploy_container_engine }
  - { role: download, tags: download, when: "not skip_downloads" }

This is YAML shorthand syntax. It is equivalent to writing:

roles:
  - role: kubespray_defaults
  
  - role: kubernetes/preinstall
    tags: preinstall
  
  - role: container-engine
    tags: container-engine
    when: deploy_container_engine
  
  - role: download
    tags: download
    when: "not skip_downloads"

The curly brace syntax is more compact but can be harder to read. Kubespray uses it throughout.

Each role can have:

A name (the path under the roles/ directory)
Tags for selective execution
Conditions (when clauses) that determine if the role runs

The kubespray_defaults role always runs first and has no tags or conditions. This is intentional. It loads all the default variable values that other roles depend on. If you skip it, nothing else will work.

Why kubespray_defaults Runs First in Every Play

You will notice that every single play in cluster.yml starts with:

roles:
  - { role: kubespray_defaults }

This pattern is not accidental. The kubespray_defaults role lives at roles/kubespray_defaults/ and contains:

roles/kubespray_defaults/
├── defaults/main/
│   ├── main.yml        # 800+ lines of default values
│   └── download.yml    # 1100+ lines of download-related defaults
└── vars/main/
    ├── main.yml
    └── checksums.yml   # Binary checksums for integrity verification

This role does not run any tasks. It just loads variables. Every other role in Kubespray depends on these variables being available. Things like kube_version, container_manager, kube_network_plugin, and hundreds of other settings are defined here.

When Ansible runs a play, it loads role defaults at the beginning. By putting kubespray_defaults first, Kubespray ensures that all default values are loaded before any other role tries to use them.

Tag-Based Selective Execution

Tags are powerful. They let you run only specific parts of the playbook. Look at the tags assigned to each role:

- { role: kubernetes/preinstall, tags: preinstall }
- { role: container-engine, tags: container-engine, when: deploy_container_engine }
- { role: download, tags: download, when: "not skip_downloads" }

If you want to reinstall just the network plugin without rerunning the entire playbook, you can do:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --tags network

This runs only the roles and tasks tagged with “network”. Everything else gets skipped.

Here is a quick reference of the major tags in cluster.yml:

preinstall       → kubernetes/preinstall role
container-engine → container-engine role
download         → download role
node             → kubernetes/node role
control-plane    → kubernetes/control-plane role
kubeadm          → kubernetes/kubeadm role
network          → network_plugin role
apps             → kubernetes-apps role

You can also combine tags:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --tags preinstall,container-engine

This runs only the preinstall and container-engine stages.

Play 4: Installing etcd

- name: Install etcd
  vars:
    etcd_cluster_setup: true
    etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"
  import_playbook: install_etcd.yml

This play imports the install_etcd.yml playbook and passes two variables using the vars directive.

Here is something subtle but important. The etcd_cluster_setup variable is set to true here. But wait, is this variable not already defined somewhere? Yes, it is. In roles/etcd_defaults/defaults/main.yml, you will find:

etcd_cluster_setup: true

So why set it again in the play?

The answer becomes clear when you look at scale.yml:

- name: Install etcd
  vars:
    etcd_cluster_setup: false
    etcd_events_cluster_setup: false
  import_playbook: install_etcd.yml

In scale.yml, when adding new nodes to an existing cluster, etcd_cluster_setup is set to false. The etcd cluster already exists; we do not want to reinitialize it. By setting this variable at the play level, Kubespray controls how install_etcd.yml behaves differently in cluster.yml versus scale.yml.

The etcd_events_cluster_enabled variable controls whether to set up a separate etcd cluster for Kubernetes events. This is an advanced feature for large clusters where event traffic can overwhelm the main etcd cluster. Most deployments leave this disabled.

Play 5: Installing Kubernetes Nodes

- name: Install Kubernetes nodes
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/node, tags: node }

This play targets the k8s_cluster group, which includes both control plane nodes and worker nodes. The kubernetes/node role installs kubelet, kubectl, and kubeadm on all nodes that will be part of the cluster.

Notice the pattern repeating: gather_facts: false (using cached facts), any_errors_fatal (fail fast), environment (no proxy), and kubespray_defaults first.

Play 6: Installing the Control Plane

- name: Install the control plane
  hosts: kube_control_plane
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/control-plane, tags: control-plane }

This play targets only kube_control_plane nodes. The kubernetes/control-plane role prepares these nodes to run the Kubernetes control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.

Play 7: Running kubeadm and Installing CNI

- name: Invoke kubeadm and install a CNI
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/kubeadm, tags: kubeadm }
    - { role: kubernetes/node-label, tags: node-label }
    - { role: kubernetes/node-taint, tags: node-taint, when: node_taints is defined }
    - { role: network_plugin, tags: network }

This is where the cluster actually comes together. The kubernetes/kubeadm role runs kubeadm init on the first control plane node, then kubeadm join on all other nodes.

After the cluster is bootstrapped, the node-label role applies any custom labels you have defined. The node-taint role applies taints, but only if you have defined node_taints in your variables.

Finally, the network_plugin role installs your chosen CNI plugin. If you set kube_network_plugin: calico in your variables, this role installs Calico. If you set kube_network_plugin: flannel, it installs Flannel. The role reads your configuration and acts accordingly.

Play 8: Calico Route Reflector

- name: Install Calico Route Reflector
  hosts: calico_rr
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: network_plugin/calico/rr, tags: ['network', 'calico_rr'] }

This play only runs if you have a calico_rr group defined in your inventory. Calico Route Reflector is an advanced feature for large clusters using BGP routing. Most deployments do not use this, and if the calico_rr group is empty or undefined, this play does nothing.

Notice the tags syntax here: tags: [‘network’, ‘calico_rr’]. This assigns two tags to the role, so you can target it with either — tags network or — tags calico_rr.

Play 9: Windows Node Support

- name: Patch Kubernetes for Windows
  hosts: kube_control_plane[0]
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: win_nodes/kubernetes_patch, tags: k8s-windows }

This play applies patches needed for Windows node support. It targets only the first control plane node (kube_control_plane[0]) because it just needs to apply some manifests to the cluster.

If you are not running Windows workers, this play effectively does nothing. The win_nodes/kubernetes_patch role checks whether Windows support is enabled and skips itself if not.

Play 10: Installing Kubernetes Apps

- name: Install Kubernetes apps
  hosts: kube_control_plane
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes-apps, tags: apps }

This play installs cluster addons: CoreDNS, metrics-server, Helm, Ingress controllers, and whatever else you have enabled in your addons.yml file. It runs on control plane nodes because it needs kubectl access to apply manifests.

Play 11: Finalizing DNS Configuration

- name: Apply resolv.conf changes now that cluster DNS is up
  hosts: k8s_cluster
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, when: "dns_mode != 'none' and resolvconf_mode == 'host_resolvconf'", tags: resolvconf, dns_late: true }

This final play updates /etc/resolv.conf on all cluster nodes now that CoreDNS is running. The kubernetes/preinstall role is called again, but this time with dns_late: true. This variable changes the role’s behavior, making it update DNS configuration rather than doing general preinstall tasks.

The when clause ensures this only runs if you are using coredns (dns_mode != ‘none’) and managing the host’s resolv.conf (resolvconf_mode == ‘host_resolvconf’).

Mapping to Kubernetes The Hard Way

If you have gone through Kubernetes The Hard Way, you might wonder how Kubespray’s stages correspond to those manual steps. Here is the mapping:

Kubespray Role/Play              | The Hard Way Equivalent
---------------------------------|----------------------------------
kubernetes/preinstall            | OS configuration (sysctl, modules)
container-engine/containerd      | Containerd installation
download                         | Downloading binaries
etcd                             | Bootstrapping etcd cluster
kubernetes/node                  | Worker node setup (kubelet)
kubernetes/control-plane         | Control plane bootstrapping
kubernetes/kubeadm               | kubeadm init and join
network_plugin                   | Pod network configuration
kubernetes-apps                  | DNS and addon installation

The certificate generation that took an entire chapter in The Hard Way? Kubespray delegates that to kubeadm, which handles it automatically during kubeadm init.

The kubeconfig file generation? Also handled by kubeadm.

The data encryption configuration? Kubespray has options for that in the variables, and kubeadm configures it.

This is the power of Kubespray. It takes all those manual steps and encodes them into Ansible roles that execute reliably and repeatably.

Conditional Variables That Control Behavior

Several variables act as switches that change what cluster.yml does:

The deploy_container_engine variable defaults to true. If you set it to false, Kubespray skips container runtime installation entirely, assuming you have already installed containerd or another runtime yourself.

- { role: container-engine, tags: container-engine, when: deploy_container_engine }

The skip_downloads variable does what it says. Set it to true if you have pre-downloaded all binaries and images (useful for air-gap deployments):

- { role: download, tags: download, when: "not skip_downloads" }

The etcd_events_cluster_enabled variable controls whether to create a separate etcd cluster for Kubernetes events:

etcd_events_cluster_setup: "{{ etcd_events_cluster_enabled }}"

The dns_mode variable determines which cluster DNS to install. The default is coredns. Setting it to none skips DNS installation:

when: "dns_mode != 'none' and resolvconf_mode == 'host_resolvconf'"

The resolvconf_mode variable controls how Kubespray manages DNS resolution on nodes. The default host_resolvconf means Kubespray will modify /etc/resolv.conf.

Running the Playbook

Now that you understand the structure, running the playbook is straightforward:

cd ~/kubespray

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v

The -i flag specifies your inventory file. The -b flag enables become (sudo). The -v flag enables verbose output.

For a full deployment, expect this to take 15–30 minutes depending on your network speed and hardware. The download stage often takes the longest as it pulls container images and binaries.

If something fails, you can usually fix the issue and rerun the same command. Ansible’s idempotency means tasks that already completed successfully will show as “ok” and run quickly, while the failed tasks retry.

To run only specific stages:

# Just reinstall the network plugin
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --tags network

# Reinstall apps only
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --tags apps

# Run everything except downloads (useful after fixing a download issue)
ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --skip-tags download

The — skip-tags option is the inverse of — tags. It runs everything except the specified tags.

Watching the Execution

When you run cluster.yml, you will see output like this:

PLAY [Common tasks for every playbooks] ****************************************

TASK [Check that python netaddr is installed] **********************************
ok: [localhost]
PLAY [Gather facts] ************************************************************
TASK [Gather minimal facts] ****************************************************
ok: [k8s-ctr]
PLAY [Prepare for etcd install] ************************************************
TASK [kubespray_defaults : Load kubespray defaults] ****************************
ok: [k8s-ctr]
TASK [kubernetes/preinstall : Set facts] ***************************************
ok: [k8s-ctr]
...

Each PLAY corresponds to a play in the playbook. Each TASK corresponds to a task within a role. The output shows which host the task runs on and whether it changed anything.

At the end, you get a PLAY RECAP:

PLAY RECAP *********************************************************************
k8s-ctr    : ok=523  changed=147  unreachable=0  failed=0  skipped=892  rescued=0  ignored=2

This tells you:

ok: Tasks that completed without making changes
changed: Tasks that made changes to the system
unreachable: Hosts that could not be contacted
failed: Tasks that failed
skipped: Tasks that were skipped due to conditions
rescued: Tasks that failed but were rescued by a rescue block
ignored: Tasks that failed but had ignore_errors: true

A successful run has failed=0 and unreachable=0.

The profile_tasks callback enabled in ansible.cfg also shows timing information:

Tuesday 28 January 2026  15:23:45 +0900 (0:00:02.456)

===============================================================================
download : Download_container | Download image if required ----------- 48.84s
download : Download_container | Download image if required ----------- 33.23s
kubernetes/kubeadm : Join to cluster if needed ----------------------- 15.97s
container-engine : Containerd | Unpack containerd archive ------------- 8.34s
...

This helps identify which tasks take the longest, useful for optimization or debugging slow deployments.

Chapter 8: Lab Environment Setup

Before we can deploy a Kubernetes cluster with Kubespray, we need machines to deploy it on. In this chapter, we will set up a lab environment using Vagrant and VirtualBox. By the end of this chapter, you will have a Rocky Linux virtual machine ready to become a single-node Kubernetes cluster.

Prerequisites

You will need the following software installed on your host machine:

VirtualBox version 7.2.4 or later. Earlier versions have compatibility issues with newer Linux kernels, particularly Rocky Linux 10 which ships with kernel 6.x. If you are running an older version of VirtualBox, upgrade it before proceeding.

Vagrant version 2.4.x or later. Vagrant will orchestrate the virtual machine lifecycle and handle the initial provisioning.

Your host machine should have at least 8 GB of RAM because we will allocate 4 GB to the virtual machine. You also need approximately 20 GB of free disk space for the VM disk image and downloaded artifacts.

To verify your installed versions, run these commands:

VBoxManage --version
vagrant --version

If you see version numbers that meet the requirements, you are ready to proceed.

Windows Users: Disable Hyper-V First

If you are running Windows, there is a critical step you must complete before anything else. Windows has a feature called Hyper-V which is a Type-1 hypervisor built into the operating system. When Hyper-V is enabled, it takes exclusive control of the CPU’s hardware virtualization features (Intel VT-x or AMD-V). This prevents VirtualBox from using native virtualization, forcing it to fall back to a compatibility mode called NEM (Native Execution Mode).

The problem is that Rocky Linux 10 uses a modern 6.x kernel that does not play well with NEM mode. When VirtualBox tries to boot Rocky Linux under NEM, you will see a kernel panic within seconds of boot:

[    1.511197] ---[ end trace 0000000000000000 ]---
[    1.513571] RIP: 0010:wait_for_xmitr+0x61/0xc0
[    1.559573] Kernel panic - not syncing: Fatal exception
[    1.563919] ---[ end Kernel panic - not syncing: Fatal exception ]---

I spent an embarrassing amount of time staring at this error before discovering the root cause. The VirtualBox log file contained a telltale message: “HM: HMR3Init: Attempting fall back to NEM: VT-x is not available” which indicated that Hyper-V was claiming the virtualization hardware.

You might wonder why Hyper-V was enabled in the first place. Several Windows features depend on it: WSL2 (Windows Subsystem for Linux 2), Docker Desktop when using the WSL2 backend, and Windows Sandbox. If you have used any of these features, Hyper-V is probably active on your system.

To disable Hyper-V and its related components, open PowerShell as Administrator and run these commands:

Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart
Disable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -NoRestart

Then disable the hypervisor launch at boot:

bcdedit /set hypervisorlaunchtype off

Finally, restart your computer:

Restart-Computer

After the reboot, verify that Hyper-V is truly disabled. Run this command which should produce no output if Hyper-V is off:

Get-WindowsOptionalFeature -Online | Where-Object {$_.FeatureName -like "*Hyper-V*" -and $_.State -eq "Enabled"}

Also verify the boot configuration:

bcdedit /enum | Select-String hypervisorlaunchtype

The output should show “hypervisorlaunchtype Off”. If it shows “Auto” or anything else, the disable did not take effect and you need to troubleshoot further.

Be aware that disabling Hyper-V means WSL2 and Docker Desktop (with WSL2 backend) will stop working. You can re-enable everything after completing this lab, which I will explain at the end of the chapter.

Creating the Vagrantfile

Create a new directory for this lab and create a file named Vagrantfile with the following content:

BOX_IMAGE = "bento/rockylinux-10.0"
BOX_VERSION = "202510.26.0"

Vagrant.configure("2") do |config|
  config.vm.define "k8s-ctr" do |subconfig|
    subconfig.vm.box = BOX_IMAGE
    subconfig.vm.box_version = BOX_VERSION
    subconfig.vm.provider "virtualbox" do |vb|
      vb.customize ["modifyvm", :id, "--groups", "/Kubespray-Lab"]
      vb.customize ["modifyvm", :id, "--nicpromisc2", "allow-all"]
      vb.name = "k8s-ctr"
      vb.cpus = 4
      vb.memory = 4096
      vb.linked_clone = true
    end
    subconfig.vm.host_name = "k8s-ctr"
    subconfig.vm.network "private_network", ip: "192.168.10.10"
    subconfig.vm.network "forwarded_port", guest: 22, host: "60100", auto_correct: true, id: "ssh"
    subconfig.vm.synced_folder "./", "/vagrant", disabled: true
    subconfig.vm.provision "shell", path: "init_cfg.sh"
  end
end

Let me explain what each section does.

The BOX_IMAGE and BOX_VERSION variables specify the Vagrant box to use. We are using the Bento project’s Rocky Linux 10.0 image, which is a well-maintained community box. Pinning the version ensures reproducibility.

The vm.provider block configures VirtualBox-specific settings. The — groups option organizes the VM into a folder called “Kubespray-Lab” in the VirtualBox Manager GUI. The — nicpromisc2 option enables promiscuous mode on the second network adapter, which is necessary for some CNI plugins to function correctly. We allocate 4 CPU cores and 4096 MB of memory, which is sufficient for a single-node cluster. The linked_clone option saves disk space by creating a copy-on-write clone rather than a full copy of the base image.

The network configuration creates two network interfaces. VirtualBox always creates a NAT interface as the first adapter, which provides outbound internet connectivity. We add a private_network with a static IP of 192.168.10.10, which creates a Host-Only network that allows the host machine and VM to communicate directly. The forwarded_port directive maps port 60100 on the host to port 22 on the guest, providing an alternative way to SSH into the VM.

The synced_folder line disables the default Vagrant folder synchronization. We do not need it and disabling it avoids potential permission issues.

Finally, the provision directive tells Vagrant to run a shell script called init_cfg.sh after the VM boots for the first time.

The Initialization Script

Create a file named init_cfg.sh in the same directory as your Vagrantfile:

#!/usr/bin/env bash

echo ">>>> Initial Config Start <<<<"
echo "[TASK 1] Change Timezone and Enable NTP"
timedatectl set-local-rtc 0
timedatectl set-timezone Asia/Seoul
echo "[TASK 2] Disable firewalld and selinux"
systemctl disable --now firewalld >/dev/null 2>&1
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config
echo "[TASK 3] Disable and turn off SWAP & Delete swap partitions"
swapoff -a
sed -i '/swap/d' /etc/fstab
sfdisk --delete /dev/sda 2 >/dev/null 2>&1
partprobe /dev/sda >/dev/null 2>&1
echo "[TASK 4] Config kernel & module"
cat << EOF > /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
modprobe overlay >/dev/null 2>&1
modprobe br_netfilter >/dev/null 2>&1
cat << EOF >/etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sysctl --system >/dev/null 2>&1
echo "[TASK 5] Setting Local DNS Using Hosts file"
sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts
cat << EOF >> /etc/hosts
192.168.10.10 k8s-ctr
EOF
echo "[TASK 6] Delete default routing - Secondary NIC"
SECONDARY_NIC=$(ip -o -4 addr show | grep "192.168.10" | awk '{print $2}')
if [ -n "$SECONDARY_NIC" ]; then
  echo "Found secondary NIC: $SECONDARY_NIC, disabling default route..."
  nmcli connection modify "$SECONDARY_NIC" ipv4.never-default yes 2>/dev/null || true
  nmcli connection up "$SECONDARY_NIC" 2>/dev/null || true
else
  echo "No secondary NIC found, skipping..."
fi
echo "sudo su -" >> /home/vagrant/.bashrc
echo ">>>> Initial Config End <<<<"

This script performs essential pre-configuration that Kubernetes requires. Let me walk through each task in detail.

Task 1 sets the timezone and configures the system clock. The timedatectl set-local-rtc 0 command tells the system to use UTC for the hardware clock, which is the recommended setting for servers. Change Asia/Seoul to your preferred timezone. Accurate time is important for Kubernetes because certificate validation, log correlation, and distributed coordination all depend on synchronized clocks.

Task 2 disables the firewall and sets SELinux to permissive mode. In a production environment, you would configure proper firewall rules and SELinux policies. For a lab environment, disabling them eliminates potential networking issues that could complicate troubleshooting. The firewalld service is stopped and disabled. SELinux is set to permissive mode, which logs policy violations without enforcing them. The sed command makes this change persistent across reboots by modifying the SELinux configuration file.

Task 3 disables swap. Kubernetes requires swap to be disabled because the kubelet is not designed to handle swap memory. When swap is active, it can cause unpredictable latency spikes and interfere with resource management. The swapoff -a command immediately disables all swap. The sed command removes any swap entries from /etc/fstab so swap does not re-enable after a reboot. The sfdisk command attempts to delete the swap partition entirely, though this may fail depending on the disk layout.

Task 4 loads kernel modules and sets sysctl parameters that container networking requires. The overlay module enables OverlayFS, which containerd uses for efficient container image layer management. The br_netfilter module enables netfilter to process traffic traversing network bridges, which is necessary for iptables rules to work correctly with container networks. The sysctl parameters enable IP forwarding and bridge netfilter processing for both IPv4 and IPv6.

Task 5 adds a hosts file entry so the VM can resolve its own hostname to the correct IP address. This is important because Kubernetes components need to resolve node names to IP addresses.

Task 6 is where things get interesting and where I encountered problems on my first attempt. VirtualBox VMs have two network interfaces: the NAT interface (usually eth0 or enp0s3) and the Host-Only interface (usually eth1 or enp0s8). By default, both interfaces might try to set a default route, and the NAT interface typically wins. This causes problems later when Kubespray tries to determine the node’s IP address because it might pick up the NAT IP (10.0.2.15) instead of the Host-Only IP (192.168.10.10).

The original version of this script hardcoded the interface name as enp0s9, but interface names can vary depending on the VirtualBox version and the order of network adapter configuration. I modified the script to dynamically detect the interface by looking for the one that has an IP address in the 192.168.10.0/24 range:

SECONDARY_NIC=$(ip -o -4 addr show | grep "192.168.10" | awk '{print $2}')

The ip -o -4 addr show command outputs IPv4 address information in a single-line format. We grep for our known IP range and extract the interface name with awk. Once we have the interface name, we tell NetworkManager to never use this interface as the default route with the ipv4.never-default option. This ensures that outbound traffic to the internet goes through the NAT interface while still allowing direct communication over the Host-Only network.

The last line adds “sudo su -” to the vagrant user’s bashrc so you automatically become root when you SSH into the VM. This is purely a convenience for lab environments.

Launching the Virtual Machine

With both files in place, launch the VM:

vagrant up k8s-ctr

Vagrant will download the Rocky Linux box if it is not already cached, create the VM, and run the initialization script. The output will show each task being executed. When you see “Initial Config End” followed by the Vagrant completion message, the VM is ready.

Verify the VM is running:

vagrant status

You should see:

Current machine states:
k8s-ctr                   running (virtualbox)

SSH into the VM to verify everything is working:

vagrant ssh k8s-ctr

You should be logged in as the vagrant user and immediately switched to root due to the bashrc modification. Verify the network configuration:

ip addr show

You should see at least two interfaces with IP addresses. One will have 10.0.2.15 (the NAT interface) and another will have 192.168.10.10 (the Host-Only interface). Verify the hostname resolves correctly:

hostname
ping -c 1 k8s-ctr

Both commands should work without errors.

Windows Troubleshooting: VERR_ALREADY_EXISTS Error

If you are on Windows and encounter an error containing VERR_ALREADY_EXISTS when running vagrant up a second time or after a failed first attempt, the problem is leftover VM directories. The — groups setting causes VirtualBox to move the VM into a folder, and if that folder already exists from a previous attempt, VirtualBox throws an error.

To fix this, destroy the VM and clean up the directories:

vagrant destroy -f k8s-ctr

Then remove the leftover directories in PowerShell:

Remove-Item -Recurse -Force "C:\Users\$env:USERNAME\VirtualBox VMs\Kubespray-Lab" -ErrorAction SilentlyContinue
Remove-Item -Recurse -Force "C:\Users\$env:USERNAME\VirtualBox VMs\k8s-ctr" -ErrorAction SilentlyContinue
Remove-Item -Recurse -Force .vagrant -ErrorAction SilentlyContinue

Now try vagrant up again. If the problem persists, you can work around it by removing the — groups line from the Vagrantfile:

# vb.customize ["modifyvm", :id, "--groups", "/Kubespray-Lab"]

The — groups option is purely cosmetic. It organizes VMs into folders in the VirtualBox Manager UI but has no effect on VM operation.

Windows Troubleshooting: Unknown Connection Error

If the init_cfg.sh script fails with “Error: unknown connection ‘enp0s9’” or a similar message, it means the original script’s hardcoded interface name does not match your system. Make sure you are using the modified version of the script that dynamically detects the interface name using the IP address pattern:

SECONDARY_NIC=$(ip -o -4 addr show | grep "192.168.10" | awk '{print $2}')

This approach works regardless of what VirtualBox names the interfaces on your particular system.

Setting Up SSH Keys for Ansible

Kubespray uses Ansible, which connects to target machines via SSH. For password-less authentication, we need to set up SSH key-based access from the Ansible control node to the target nodes. In this single-node lab, the VM will act as both the Ansible control node and the target node, so it needs to be able to SSH to itself.

First, generate an SSH key pair if you do not already have one. From your host machine or from within the VM (depending on where you plan to run Ansible from):

ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519

The -N “” option creates a key without a passphrase. In production, you would use a passphrase and an SSH agent, but for a lab environment, a passphrase-less key simplifies automation.

Copy the public key to the VM. If you are running Ansible from your host machine, use the forwarded port:

ssh-copy-id -o StrictHostKeyChecking=no -p 60100 vagrant@127.0.0.1

Enter “vagrant” when prompted for the password.

If you plan to run Ansible from within the VM itself (which is what we will do in the deployment chapter), SSH into the VM and set up local SSH access:

vagrant ssh k8s-ctr

Then generate a key and copy it to localhost:

ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
ssh-copy-id -o StrictHostKeyChecking=no root@192.168.10.10

Enter the root password when prompted. On this Vagrant box, the root password is typically “vagrant”.

Verify password-less SSH works:

ssh root@192.168.10.10 "hostname"

If this prints “k8s-ctr” without asking for a password, SSH key authentication is working correctly.

Re-enabling Hyper-V After the Lab

If you disabled Hyper-V on Windows and want to restore WSL2 and Docker Desktop functionality after completing this lab, run these commands in an Administrator PowerShell:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart
Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart
Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -NoRestart
bcdedit /set hypervisorlaunchtype auto

Then restart your computer:

Restart-Computer

Your WSL2 distributions and their data will still be intact. Disabling and re-enabling Hyper-V does not affect the stored data.

Verifying the Environment

Before moving on, let us verify that the VM is correctly configured for Kubernetes:

Check that swap is disabled:

swapon --show

This should produce no output. If it shows swap partitions, the disable did not work and you need to troubleshoot.

Check that the kernel modules are loaded:

lsmod | grep -E "overlay|br_netfilter"

You should see both overlay and br_netfilter in the output.

Check the sysctl parameters:

sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward

All three should show a value of 1.

Check the SELinux status:

getenforce

This should show “Permissive”.

Check that the firewall is disabled:

systemctl status firewalld

This should show “inactive (dead)”.

If all these checks pass, your lab environment is ready for Kubespray. In the next chapter, we will configure the Kubespray inventory and customize the cluster variables before deploying Kubernetes.

Chapter 9: Inventory Configuration and Variable Modification

With your lab environment up and running, you now have a Rocky Linux VM sitting at 192.168.10.10, waiting to become a Kubernetes node. But Kubespray does not know anything about your infrastructure yet. You need to tell it what machines exist, what roles they should play, and how the cluster should be configured. This is where inventory configuration comes in.

Kubespray ships with a sample inventory that serves as both a template and a learning resource. The sample contains sensible defaults and commented examples that show you what options are available. Your first step is to copy this sample and make it your own.

cp -rfp /root/kubespray/inventory/sample /root/kubespray/inventory/mycluster

The flags here matter. The -r flag copies recursively since the inventory is a directory containing subdirectories. The -f flag forces the copy without prompting. The -p flag preserves the original file attributes including permissions, ownership, and timestamps. This last flag is often overlooked but can save you from mysterious permission issues later.

After running this command, you have a complete inventory structure at /root/kubespray/inventory/mycluster that mirrors the sample. Now you need to edit it to match your actual infrastructure.

The Inventory File

The heart of any Ansible-based deployment is the inventory file. This file tells Ansible which machines to manage and how to connect to them. In Kubespray, the inventory also determines which machines become control plane nodes, which become workers, and which host etcd.

Open the inventory file and replace its contents with your cluster definition:

cat << EOF > /root/kubespray/inventory/mycluster/inventory.ini
k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10

[kube_control_plane]
k8s-ctr

[etcd:children]
kube_control_plane

[kube_node]
k8s-ctr
EOF

This inventory defines a single-node cluster where one machine plays all roles. Let me walk through each part because understanding this structure will save you hours of debugging later.

The first line defines a host:

k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10

Here, k8s-ctr is the hostname that Ansible will use internally. It is an alias, a convenient name you can reference throughout your playbooks and inventory. The ansible_host variable tells Ansible the actual IP address or hostname to use when establishing SSH connections. When Ansible needs to connect to k8s-ctr, it will SSH to 192.168.10.10.

The ip variable is specific to Kubespray, not a standard Ansible variable. Kubespray uses this to determine which IP address the Kubernetes components should bind to and advertise. This distinction between ansible_host and ip becomes critical in environments with multiple network interfaces.

You might wonder why both variables are set to the same value. In simple environments, they often are identical. But consider a scenario where you have a bastion host or jump server. You might SSH through one IP address but want Kubernetes to communicate over a different internal network. Or consider our VirtualBox environment where each VM has multiple network interfaces.

The VirtualBox NAT IP Trap

If you have been following this tutorial series from the beginning, you have encountered this problem before. VirtualBox creates a NAT interface as the first network adapter on each VM. This interface gets the IP address 10.0.2.15 on every single VM. It is designed for outbound internet access from the VM, not for inter-VM communication.

When Ansible gathers facts about a host, it collects information about all network interfaces. Kubespray then tries to determine which IP address to use for Kubernetes components. If you do not explicitly set the ip variable, Kubespray might pick 10.0.2.15 because it appears first in the interface list. The result is a cluster where the API server advertises itself at 10.0.2.15, and every worker node tries to contact that address, which is their own localhost NAT interface. Nothing works.

I have made this mistake more times than I care to admit. Even after documenting it thoroughly in previous tutorials, I still forget to set the ip variable when setting up new test clusters. The symptom is always the same: kubeadm init succeeds on the control plane, but kubeadm join fails on workers with connection refused errors pointing to 10.0.2.15.

By explicitly setting ip=192.168.10.10, you tell Kubespray to use the Host-Only network interface for all Kubernetes communication. This is the interface where your VMs can actually reach each other.

Understanding Ansible Groups

The inventory file defines several groups using the bracket notation. Each group serves a specific purpose in Kubespray.

[kube_control_plane]
k8s-ctr

This group lists all nodes that should run Kubernetes control plane components. These are the nodes that will run kube-apiserver, kube-controller-manager, and kube-scheduler. In a production cluster, you would list three or five nodes here for high availability. For this single-node lab, only k8s-ctr appears in the list.

[kube_node]
k8s-ctr

This group lists all nodes that should run workloads. In Kubernetes terminology, these are the worker nodes. They run kubelet and kube-proxy and can schedule pods. Notice that k8s-ctr appears here as well. This means your control plane node will also accept workloads. In production, you often want control plane nodes dedicated to cluster management, but for a learning environment, having one node do everything is perfectly fine.

The etcd group is more interesting:

[etcd:children]
kube_control_plane

This uses Ansible’s nested group feature. Instead of listing hosts directly, the :children suffix tells Ansible that the entries below are names of other groups, not hosts. This line says “the etcd group contains all hosts that are members of kube_control_plane.”

This pattern is elegant because it creates an automatic relationship. If you add a node to kube_control_plane, it automatically becomes an etcd member too. For the common case where etcd runs on control plane nodes, this reduces duplication and prevents inconsistencies.

But this is not the only way to configure etcd. In production environments with strict performance requirements, you might want dedicated etcd nodes that do nothing else. In that case, you would define the etcd group directly:

[etcd]
etcd-1 ansible_host=192.168.10.20
etcd-2 ansible_host=192.168.10.21
etcd-3 ansible_host=192.168.10.22

Kubespray supports both patterns. The sample inventory uses the children approach because it matches the common deployment model where etcd is colocated with control plane nodes.

The Implicit all Group

You might have noticed that the host definition line appears outside any group:

k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10

In Ansible, any host defined outside of explicit groups automatically belongs to the special all group. This group is reserved by Ansible and always contains every host in the inventory. Kubespray uses this property extensively. Variables defined in group_vars/all/ apply to every single node, regardless of what other groups they belong to.

Verifying Your Inventory

Before running any playbooks, verify that Ansible interprets your inventory correctly. Mistakes in inventory files often produce confusing errors during playbook execution. It is much easier to catch them early.

Check the etcd group:

ansible -i /root/kubespray/inventory/mycluster/inventory.ini etcd --list-hosts

The output should show:

hosts (1):
    k8s-ctr

Check the control plane group:

ansible -i /root/kubespray/inventory/mycluster/inventory.ini kube_control_plane --list-hosts

hosts (1):
    k8s-ctr

Check the worker node group:

ansible -i /root/kubespray/inventory/mycluster/inventory.ini kube_node --list-hosts

hosts (1):
    k8s-ctr

And verify the all group contains your host:

ansible -i /root/kubespray/inventory/mycluster/inventory.ini all --list-hosts

hosts (1):
    k8s-ctr

All four groups resolve to the same single host, which is exactly what we expect for a single-node cluster. If any of these commands show unexpected results, go back and check your inventory file syntax carefully. A misplaced bracket or typo can cause hosts to end up in the wrong groups.

Modifying Cluster Configuration

With the inventory defining your infrastructure, the next step is configuring how Kubespray should build the cluster. These settings live in the group_vars directory, which contains YAML files that set Ansible variables.

Kubespray provides sensible defaults for everything, but you will almost always want to customize some settings. For this tutorial, we will make several changes to k8s-cluster.yml, the file that controls core Kubernetes settings.

The changes we need to make are:

First, switch the CNI plugin from Calico to Flannel. Calico is the default and is an excellent choice for production, but Flannel is simpler and easier to understand when learning:

sed -i 's|kube_network_plugin: calico|kube_network_plugin: flannel|g' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

Second, switch kube-proxy from IPVS mode to iptables mode. IPVS offers better performance at scale, but iptables is the traditional mode and is easier to debug:

sed -i 's|kube_proxy_mode: ipvs|kube_proxy_mode: iptables|g' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

Third, disable NodeLocal DNSCache. This feature improves DNS performance by running a caching DNS server on each node, but it adds complexity that we do not need for learning:

sed -i 's|enable_nodelocaldns: true|enable_nodelocaldns: false|g' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

Fourth, enable automatic certificate renewal. Kubernetes certificates expire after one year, and this setting ensures they are renewed automatically:

sed -i 's|auto_renew_certificates: false|auto_renew_certificates: true|g' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

Fifth, uncomment the certificate renewal schedule. The setting exists in the file but is commented out by default:

sed -i 's|# auto_renew_certificates_systemd_calendar|auto_renew_certificates_systemd_calendar|g' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

After running all these commands, verify the changes took effect:

grep -iE 'kube_network_plugin:|kube_proxy_mode|enable_nodelocaldns:|^auto_renew_certificates' \
  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

You should see:

kube_network_plugin: flannel
kube_proxy_mode: iptables
enable_nodelocaldns: false
auto_renew_certificates: true
auto_renew_certificates_systemd_calendar: "Mon *-*-1,2,3,4,5,6,7 03:00:00"

The calendar expression for certificate renewal looks cryptic at first glance. It follows the systemd timer calendar format. “Mon — 1,2,3,4,5,6,7 03:00:00” means “every Monday that falls on the 1st through 7th of any month, at 3:00 AM.” In practice, this triggers certificate renewal once a month on the first Monday, at a time when the cluster is likely idle.

Configuring Flannel

Since we switched from Calico to Flannel, we need to configure Flannel-specific settings. Kubespray maintains separate configuration files for each CNI plugin. The Flannel settings live in k8s-net-flannel.yml.

One critical setting for VirtualBox environments is specifying which network interface Flannel should use for VXLAN traffic. Without this setting, Flannel might pick the wrong interface, just like Kubespray picking the wrong IP address when the ip variable is missing.

Check the current contents of the Flannel configuration:

cat inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml

The file contains various Flannel settings, mostly commented out because the defaults work for typical deployments. Add the interface specification:

echo "flannel_interface: enp0s8" >> inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml

The interface name enp0s8 corresponds to the Host-Only network adapter in VirtualBox. This is where your VMs can communicate with each other. If you used different network settings in your Vagrantfile, you might need to adjust this value. You can check available interfaces on your VM by running ip addr.

Verify the setting was added:

grep "^[^#]" inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml

This grep command shows only non-comment lines, filtering out the extensive documentation that Kubespray includes in its configuration files.

What These Settings Actually Do

Let me explain why we chose these particular settings, beyond just simplicity for learning.

Flannel versus Calico represents a fundamental tradeoff in Kubernetes networking. Calico uses BGP routing and supports NetworkPolicy enforcement, making it the preferred choice for production clusters that need fine-grained network security. Flannel uses simpler VXLAN encapsulation and does not support NetworkPolicy at all. For a single-node learning cluster where you are not going to test network policies, Flannel’s simplicity is an advantage. There is less to go wrong, and the networking behavior is easier to understand.

The kube-proxy mode setting affects how Kubernetes implements Service load balancing. In iptables mode, kube-proxy programs iptables rules that distribute traffic across pod endpoints. Every packet traverses the iptables rules, which works fine for moderate traffic loads. In IPVS mode, kube-proxy uses the Linux kernel’s IPVS (IP Virtual Server) module, which is designed for load balancing and handles high connection rates more efficiently. For a learning cluster, iptables mode is easier to inspect and debug. You can see the rules with iptables -L -t nat, which is helpful when troubleshooting service connectivity.

NodeLocal DNSCache improves DNS performance by running a DNS caching agent on every node. Instead of sending all DNS queries to CoreDNS pods that might be running on different nodes, queries go to the local cache first. This reduces latency and network traffic. However, it adds another moving part to the cluster, and for learning purposes, the extra complexity is not worth the performance gain.

Certificate auto-renewal is something you absolutely want in production. Kubernetes uses many certificates: the API server certificate, the kubelet client certificates, the etcd certificates, and more. These certificates typically have a one-year validity period. If they expire, your cluster stops working. The auto-renewal feature uses a systemd timer to run kubeadm certs renew automatically before expiration. The monthly schedule gives plenty of margin since certificates are valid for a full year.

The Variable Hierarchy Reminder

Remember from earlier chapters that Kubespray uses Ansible’s variable precedence system. The files you just modified in group_vars/k8s_cluster/ have a priority of 6 in Ansible’s hierarchy. They will override the default values defined in roles/kubespray_defaults/defaults/ (priority 2) but can themselves be overridden by host_vars files (priority 9) or command-line extra-vars (priority 22).

This means if you ever need to test a different setting without permanently changing your inventory, you can use the -e flag:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  -e "kube_network_plugin=cilium"

This would deploy with Cilium instead of Flannel, overriding the inventory setting just for this run. The inventory files remain unchanged, so subsequent runs without -e would use Flannel again.

This hierarchy is intentional. Kubespray defaults represent best practices that work for most deployments. Your inventory customizations represent your organization’s requirements. Command-line overrides let you experiment without changing anything permanently.

Preparing for Deployment

At this point, your inventory is configured and validated. You have defined which machine will be your cluster node, specified the correct IP address to avoid VirtualBox networking pitfalls, and customized the cluster configuration to use Flannel with iptables-mode kube-proxy.

The configuration is complete. In the next chapter, we will execute the deployment and watch Kubespray transform a bare Rocky Linux VM into a functioning Kubernetes cluster. Before proceeding, it is worth double-checking that SSH connectivity still works:

ansible -i inventory/mycluster/inventory.ini all -m ping

If this returns success, you are ready to deploy. If it fails, troubleshoot SSH connectivity before moving on. No amount of correct inventory configuration will help if Ansible cannot reach your nodes.

Chapter 10: Kubernetes The Kubespray Way — A Comparative Tutorial

This chapter walks through deploying a Kubernetes cluster using Kubespray, deliberately structured to mirror the steps from Kubernetes The Hard Way. The goal here is not just to get a cluster running, but to understand exactly which manual steps Kubespray automates for you. By the end, you will have a working cluster and a clear mental map of what happens behind that single ansible-playbook command.

The tutorial follows the official Kubespray documentation “Setting up your first cluster,” which was itself inspired by Kubernetes The Hard Way. The difference is that instead of manually executing hundreds of commands across multiple nodes, you will define your desired state in inventory files and let Ansible do the work.

Prerequisites

In Kubernetes The Hard Way, the prerequisites chapter covered installing VirtualBox and Vagrant, verifying host system requirements, and planning the network topology. The same applies here, but Kubespray adds a few more requirements on the control node side.

For Kubespray, you need a machine that will run Ansible. This is called the Ansible Control Node, and it sits outside the cluster. From this machine, you will SSH into all the target nodes and orchestrate the installation. The target nodes themselves only need SSH access and Python installed.

Here is the requirement summary:

Ansible Control Node: Linux or Mac with Python 3 installed
Target Nodes: SSH accessible, Python installed
Network: Nodes can communicate with each other, internet access for downloading images
Privileges: Root or sudo access on all target nodes

The original documentation uses Google Cloud Platform for the infrastructure. This tutorial uses Vagrant instead, for several reasons. Vagrant is free, runs locally without network latency, and you have been using it throughout the previous tutorials. There is no reason to change now.

Provisioning Compute Resources

In Kubernetes The Hard Way, you created four VMs: a Jumpbox for running commands, a Server for the control plane, and two worker nodes. The Kubespray setup follows a similar pattern, but the Jumpbox becomes the Ansible Control Node.

Here is the node layout for this tutorial:

controller (192.168.10.10) — Ansible Control Node, equivalent to the Jumpbox
controller-0 (192.168.10.100) — Control Plane node, equivalent to the Server
worker-0 (192.168.10.101) — Worker node
worker-1 (192.168.10.102) — Worker node

Create a directory for this lab and add the following Vagrantfile:


BOX_IMAGE = "bento/rockylinux-10.0" BOX_VERSION = "202510.26.0"
Vagrant.configure("2") do |config|
config.vm.define "controller" do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider "virtualbox" do |vb| vb.customize ["modifyvm", :id, " - groups", "/Kubespray-Lab"] vb.customize ["modifyvm", :id, " - nicpromisc2", "allow-all"] vb.name = "controller" vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = "controller" subconfig.vm.network "private_network", ip: "192.168.10.10" subconfig.vm.synced_folder "./", "/vagrant", disabled: true end
config.vm.define "controller-0" do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider "virtualbox" do |vb| vb.customize ["modifyvm", :id, " - groups", "/Kubespray-Lab"] vb.customize ["modifyvm", :id, " - nicpromisc2", "allow-all"] vb.name = "controller-0" vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = "controller-0" subconfig.vm.network "private_network", ip: "192.168.10.100" subconfig.vm.synced_folder "./", "/vagrant", disabled: true end
config.vm.define "worker-0" do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider "virtualbox" do |vb| vb.customize ["modifyvm", :id, " - groups", "/Kubespray-Lab"] vb.customize ["modifyvm", :id, " - nicpromisc2", "allow-all"] vb.name = "worker-0" vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = "worker-0" subconfig.vm.network "private_network", ip: "192.168.10.101" subconfig.vm.synced_folder "./", "/vagrant", disabled: true end
config.vm.define "worker-1" do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider "virtualbox" do |vb| vb.customize ["modifyvm", :id, " - groups", "/Kubespray-Lab"] vb.customize ["modifyvm", :id, " - nicpromisc2", "allow-all"] vb.name = "worker-1" vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = "worker-1" subconfig.vm.network "private_network", ip: "192.168.10.102" subconfig.vm.synced_folder "./", "/vagrant", disabled: true end
end

Bring up all four VMs:


vagrant up

This takes a few minutes. Once complete, verify all machines are running:


vagrant status

You should see all four VMs in the “running” state.

Configuring SSH Access

Just like in Kubernetes The Hard Way where you set up SSH from the Jumpbox to the other nodes, Ansible needs passwordless SSH access to all target nodes. Log into the controller node:


vagrant ssh controller

Generate an SSH key pair:


ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa

Now copy the public key to each target node. The default password for Vagrant boxes is “vagrant”:


ssh-copy-id root@192.168.10.100 ssh-copy-id root@192.168.10.101 ssh-copy-id root@192.168.10.102

Test that passwordless SSH works:


ssh root@192.168.10.100 "hostname" ssh root@192.168.10.101 "hostname" ssh root@192.168.10.102 "hostname"

Each command should return the hostname without prompting for a password. If this works, you are ready to set up Kubespray.

Setting Up Kubespray

This is where things diverge dramatically from The Hard Way. In The Hard Way, the next several hours would be spent manually generating certificates, creating kubeconfig files, writing systemd unit files, and bootstrapping etcd. With Kubespray, you will spend the next few minutes configuring inventory files, and then a single command handles the rest.

Still on the controller node, create a Python virtual environment. Ansible is a Python application, and using a virtual environment keeps dependencies isolated:

python3 -m venv venv source venv/bin/activate

Clone the Kubespray repository and check out a stable release:


git clone https://github.com/kubernetes-sigs/kubespray.git cd kubespray git checkout release-2.28

Install the required Python packages:

pip install -r requirements.txt

You will see Ansible and its dependencies being installed. The output ends with something like:


Successfully installed MarkupSafe-3.0.3 PyYAML-6.0.3 ansible-9.13.0 ansible-core-2.16.15 …

Configuring the Inventory

Kubespray uses Ansible inventory files to define which hosts belong to which groups. Copy the sample inventory to create your own.

cp -rfp inventory/sample inventory/mycluster

Now edit the inventory file to match your node layout. Open inventory/mycluster/inventory.ini and replace its contents with:

[all] controller-0 ansible_host=192.168.10.100 ip=192.168.10.100 worker-0 ansible_host=192.168.10.101 ip=192.168.10.101 worker-1 ansible_host=192.168.10.102 ip=192.168.10.102
[kube_control_plane] controller-0
[etcd] controller-0
[kube_node] worker-0 worker-1
[calico_rr]
[k8s_cluster:children] kube_control_plane kube_node calico_rr

Let me explain what each section does.

The [all] section lists every host with its connection details. The ansible_host variable tells Ansible which IP address to SSH to. The ip variable is critical and often misunderstood. This tells Kubespray which IP address to use for Kubernetes internal communication. If you omit ip, Kubespray will try to auto-detect it, and on VirtualBox it often picks the NAT interface (10.0.2.15) instead of the host-only network. This causes the cluster to fail because nodes cannot reach each other on 10.0.2.15.

The [kube_control_plane] group contains nodes that will run the Kubernetes control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.

The [etcd] group contains nodes that will run the etcd cluster. In this setup, etcd runs on the same node as the control plane, which is called “stacked etcd.” For production, you might run etcd on dedicated nodes.

The [kube_node] group contains worker nodes where your application pods will run.

The [calico_rr] group is for Calico Route Reflectors, used in large-scale deployments. Leave it empty for now.

The [k8s_cluster:children] group is a meta-group that includes all Kubernetes cluster members. The :children suffix means this group inherits members from the listed child groups.

Notice how the sample inventory uses [etcd:children] with kube_control_plane listed under it. This is a shorthand that says “the etcd group consists of all hosts in the kube_control_plane group.” In this tutorial, we list controller-0 directly under [etcd] for clarity, but both approaches produce the same result.

Understanding the Default Configuration

Before deploying, take a moment to look at the default configuration. The most important file is inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml. You can view the active settings with:

grep "^[^#]" inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml | head -30

Some key defaults you will notice:

kube_network_plugin: calico — The CNI plugin for pod networking
kube_service_addresses: 10.233.0.0/18 — The CIDR range for Kubernetes services
kube_pods_subnet: 10.233.64.0/18 — The CIDR range for pods
container_manager: containerd — The container runtime
kube_proxy_mode: ipvs — How kube-proxy handles service routing

For this tutorial, keep the defaults. In production, you would carefully review and customize these settings before deployment.

One optional but recommended setting is enabling the Metrics Server. Edit inventory/mycluster/group_vars/k8s_cluster/addons.yml and find the metrics_server_enabled line. Change it to:

metrics_server_enabled: true

The Metrics Server provides resource usage data for commands like kubectl top and is required for Horizontal Pod Autoscaler.

Deploying the Cluster

Everything up to this point has been preparation. Now comes the moment of truth. From the kubespray directory, run:

ansible-playbook -i inventory/mycluster/inventory.ini 
 -u root -b -v 
 - private-key=~/.ssh/id_rsa 
 cluster.yml

Let me break down this command:

-i inventory/mycluster/inventory.ini specifies the inventory file
-u root tells Ansible to connect as the root user
-b enables "become" mode for privilege escalation (though root does not need it)
-v enables verbose output so you can see what is happening
--private-key specifies the SSH key to use
cluster.yml is the main playbook that deploys everything

Press enter and watch the output scroll by. This takes 15 to 30 minutes depending on your hardware and internet speed. You will see hundreds of tasks execute across multiple plays.

If everything succeeds, the final output looks like:

PLAY RECAP ********************************************************************* controller-0 : ok=XXX changed=XX unreachable=0 failed=0 … worker-0 : ok=XXX changed=XX unreachable=0 failed=0 … worker-1 : ok=XXX changed=XX unreachable=0 failed=0 …

The key metric is failed=0 for all hosts.

Troubleshooting: Inventory Path Issues

If you see warnings like “Unable to parse inventory” and all plays show “skipping: no hosts matched,” check your inventory path. The official documentation uses a directory path:

-i inventory/mycluster/

However, this does not work reliably in all environments. Specify the file directly:

-i inventory/mycluster/inventory.ini

Troubleshooting: USERNAME Variable

If you get an error about expecting an argument for -u, or if the ansible-playbook help is displayed instead of running the playbook, the $USERNAME environment variable is not set. The official documentation assumes you ran:

USERNAME=$(whoami)

Either set this variable or replace $USERNAME with root directly in the command.

Troubleshooting: VirtualBox NAT IP Problem

This is the most common issue with VirtualBox environments, and you have encountered it before in previous tutorials. If the deployment fails during the worker node join phase with an error like:

error execution phase preflight: couldn't validate the identity of the API Server: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused

The problem is that Kubespray detected the wrong IP address. VirtualBox VMs have a NAT interface as their first network adapter, which gets the IP 10.0.2.15. This IP is not routable between VMs. The fix is to explicitly set the ip variable in your inventory, which you already did in the configuration above.

If you hit this error, verify your inventory.ini has ip= set for every host:

[all] controller-0 ansible_host=192.168.10.100 ip=192.168.10.100 worker-0 ansible_host=192.168.10.101 ip=192.168.10.101 worker-1 ansible_host=192.168.10.102 ip=192.168.10.102

Then run the playbook again. Thanks to Ansible’s idempotency, it will skip the already-completed tasks and continue from where it failed.

Troubleshooting: etcd Health Check Failure After IP Change

Here is a more insidious problem. Suppose you initially ran the playbook without the ip variable set, the control plane was configured with 10.0.2.15, then you fixed the inventory and ran again. This time you might see:

TASK [etcd : Configure | Wait for etcd cluster to be healthy] ********************* fatal: [controller-0]: FAILED! => { "cmd": "… /usr/local/bin/etcdctl endpoint - cluster health …", "stderr": "… dial tcp 192.168.10.100:2379: connect: connection refused …" }

The etcd service was already installed and configured to bind to 10.0.2.15. Changing the ip variable does not automatically reconfigure already-running services. Ansible’s idempotency means “if the service is running, don’t touch it.”

The solution is to reset the cluster and start fresh:

ansible-playbook -i inventory/mycluster/inventory.ini 
 -u root -b -v 
 - private-key=~/.ssh/id_rsa 
 reset.yml

When prompted with “Are you sure you want to reset cluster state?”, type yes. This removes all Kubernetes components and etcd data, returning the nodes to a clean state.

After the reset completes, run cluster.yml again:

ansible-playbook -i inventory/mycluster/inventory.ini 
 -u root -b -v 
 - private-key=~/.ssh/id_rsa 
 cluster.yml

This time, everything installs correctly with the right IP addresses.

Note: When you run cluster.yml after reset.yml, you might see an error during the etcd version check:

fatal: [controller-0]: FAILED! => {"msg": "[Errno 2] No such file or directory: b'/usr/local/bin/etcd'"}

This is expected. The reset removed the etcd binary, so the version check fails. Kubespray handles this gracefully and proceeds to install etcd. Check the final PLAY RECAP — if failed=0, everything is fine.

Accessing the Cluster

The cluster is deployed, but you still need kubectl access from outside the cluster. Kubespray created a kubeconfig file on the control plane node at /etc/kubernetes/admin.conf. You need to copy this to your Ansible control node.

Still on the controller node, create the kubectl configuration directory:

mkdir -p ~/.kube

Copy the kubeconfig from the control plane:

scp root@192.168.10.100:/etc/kubernetes/admin.conf ~/.kube/config

There is one more step. The kubeconfig file was generated for use on the control plane node itself, so it points to localhost:

cat ~/.kube/config | grep server

You will see server: https://127.0.0.1:6443. This does not work from the controller node because the API server is not running locally. Fix it by replacing 127.0.0.1 with the actual control plane IP:

sed -i ‘s/127.0.0.1/192.168.10.100/g’ ~/.kube/config

Now install kubectl. If your controller node does not have it already:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" chmod +x kubectl sudo mv kubectl /usr/local/bin/

Test the connection:

kubectl get nodes

You should see all three nodes:

NAME STATUS ROLES AGE VERSION
controller-0 Ready control-plane 11m v1.32.11
worker-0 Ready  10m v1.32.11 worker-1 Ready  10m v1.32.11

The cluster is ready.

Smoke Tests

Run the same verification tests from Kubernetes The Hard Way to confirm everything works.

If you enabled Metrics Server, check that resource metrics are available:

kubectl top nodes

After a few minutes for metrics collection to start:

NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
 controller-0 98m 7% 1939Mi 66%
 worker-0 40m 2% 977Mi 99%
 worker-1 43m 3% 994Mi 101%

Network — Pod to Pod Communication

Open two terminal windows, both connected to the controller node.

In the first terminal, create a pod and note its IP:

kubectl run myshell1 -it - rm - image busybox - sh

Once inside the pod:

hostname -i

Note the IP address, something like 10.233.107.4.

In the second terminal, create another pod and ping the first one:

kubectl run myshell2 -it - rm - image busybox - sh

Inside the second pod:

ping 10.233.107.4

You should see successful ping responses:

64 bytes from 10.233.107.4: seq=0 ttl=62 time=0.686 ms 64 bytes from 10.233.107.4: seq=1 ttl=62 time=0.620 ms

This confirms pod-to-pod networking works. Calico is doing its job.

Exit both pods with exit or Ctrl-D.

Deployments

Create a simple deployment:

kubectl create deployment nginx — image=nginx

Watch the pod come up:

kubectl get pods -l app=nginx

Wait until STATUS shows Running.

Port Forwarding

Forward a local port to the nginx pod:

POD_NAME=$(kubectl get pods -l app=nginx -o jsonpath="{.items[0].metadata.name}") kubectl port-forward $POD_NAME 8080:80

In another terminal:

curl — head http://127.0.0.1:8080

You should see HTTP/1.1 200 OK and nginx headers.

Logs

Retrieve container logs:

kubectl logs $POD_NAME

You will see nginx startup logs and, if you ran the curl command, the access log entry.

Exec

Execute a command inside the container:

kubectl exec -ti $POD_NAME — nginx -v

Output: nginx version: nginx/1.x.x

Services — NodePort

Expose the deployment as a NodePort service:

kubectl expose deployment nginx — port 80 — type NodePort

Find the assigned node port:

NODE_PORT=$(kubectl get svc nginx -o jsonpath='{.spec.ports[0].nodePort}')
echo $NODE_PORT

Access the service through a worker node’s IP:

curl -I http://192.168.10.101:$NODE_PORT

You should get HTTP/1.1 200 OK.

Local DNS

Test cross-namespace DNS resolution. First, create a namespace and deploy nginx there:

kubectl create namespace dev
kubectl create deployment nginx - image=nginx -n dev
kubectl expose deployment nginx - port 80 - type ClusterIP -n dev

Now create a pod in the default namespace and access the service in the dev namespace by DNS name:

kubectl run curly -it — rm — image curlimages/curl:7.70.0 — /bin/sh

Inside the pod:

curl — head http://nginx.dev:80

This resolves nginx.dev to nginx.dev.svc.cluster.local and returns the nginx response. CoreDNS is working correctly.

Cleaning Up

When you are done experimenting, clean up the Kubernetes resources:

kubectl delete namespace dev
kubectl delete deployment nginx
kubectl delete svc nginx

To reset the cluster but keep the VMs (for example, to redeploy with different settings):

ansible-playbook -i inventory/mycluster/inventory.ini 
 -u root -b -v 
 - private-key=~/.ssh/id_rsa 
 reset.yml

To completely remove everything:

exit # Leave the controller VM vagrant destroy -f

Chapter 11: containerd Configuration File Explained

I kept running into the same problem. Every time I needed to configure containerd, I found myself searching “containerd config.toml” and piecing together information from five different sources. After the third time doing this in a month, I decided to sit down and actually understand the configuration file properly. This chapter is the result of that effort.

Understanding TOML Before We Start

containerd uses TOML for its configuration file. If you’ve never encountered TOML before, you might wonder why not just use JSON or YAML like everything else in the Kubernetes ecosystem. The answer comes down to readability and structure.

TOML stands for Tom’s Obvious, Minimal Language. It looks similar to INI files but has a more rigorous specification. Here’s what basic TOML looks like:

# Comments start with hash
key = "value"
number = 42
boolean = true

[section]
  nested_key = "nested_value"

[section.subsection]
  deep_key = "deep_value"

containerd chose TOML because its configuration is deeply hierarchical. Plugins have sub-configurations, which have their own sub-configurations. TOML handles this nesting elegantly while remaining readable.

JSON would work but becomes a mess of brackets. YAML works but its significant whitespace causes endless debugging sessions when someone mixes tabs and spaces.

The bracket notation in TOML defines what are called tables, which are essentially sections or namespaces. When you see [grpc], everything that follows belongs to the grpc configuration until the next table declaration. When you see [plugins."io.containerd.grpc.v1.cri"], that's a table with a quoted key because the key contains dots that would otherwise be interpreted as nested tables.

This distinction matters. Consider the difference:

# This creates nested tables: plugins.io.containerd.grpc.v1.cri
[plugins.io.containerd.grpc.v1.cri]

# This creates a single table with a key that happens to contain dots
[plugins."io.containerd.grpc.v1.cri"]

containerd uses the second form. The entire string “io.containerd.grpc.v1.cri” is the plugin identifier, not a hierarchy.

Where containerd Looks for Configuration

When containerd starts, it looks for configuration in this order. First, it checks if you passed a — config flag:

containerd --config /path/to/custom-config.toml

If you did, it uses that file and ignores everything else. If you didn’t pass — config, it looks at the default path:

/etc/containerd/config.toml

If that file doesn’t exist either, containerd runs with built-in defaults. You can see these defaults anytime:

containerd config default

This command dumps the complete default configuration to stdout. It’s incredibly useful when you’re trying to figure out what options exist. If you want to see what configuration containerd is actually running with right now, use:

containerd config dump

The difference matters. The default command shows what containerd would use if you had no config file. The dump command shows the merged result of your config file plus defaults for anything you didn’t specify.

Configuration File Versions

The config file has a version field, and this trips people up constantly. If you don’t specify a version, containerd assumes version 1, which is deprecated and removed in containerd 2.x. You should always explicitly set the version.

For containerd 1.x, use version 2:

version = 2

For containerd 2.x, use version 3:

version = 3

The version numbers refer to the configuration schema, not the containerd version. This naming is confusing but we’re stuck with it.

What changed between versions? Version 1 to 2 added the “io.containerd.” prefix to all plugin identifiers. Version 2 to 3 reorganized the CRI plugin structure and changed some default values.

containerd performs automatic migration when it starts. If you have a version 2 config and run containerd 2.x, it converts the config in memory to version 3 format. Your original file stays unchanged, but there’s a small performance cost at startup. If you want to avoid this, you can pre-migrate your config:

containerd config migrate /etc/containerd/config.toml > /etc/containerd/config.toml.new

Review the output before replacing your original. Migrated configs aren’t backward compatible, so you can’t easily roll back to an older containerd version.

Global Settings at the Top

The top of config.toml contains settings that apply to containerd as a whole, not to any specific plugin.

version = 3
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
oom_score = 0

The root directory is where containerd stores persistent data. Images, container metadata, snapshots — everything that should survive a reboot lives here. Don’t put this on a tmpfs.

The state directory holds runtime state. Sockets, PID files, information about running containers. This data is ephemeral and gets recreated on restart. That’s why it defaults to /run, which is typically a tmpfs.

The temp setting specifies where containerd creates temporary files. If empty, it uses the system default (usually /tmp).

The oom_score adjusts how the Linux OOM killer prioritizes containerd. Values range from -1000 to 1000. Lower values mean the process is less likely to be killed. A value of 0 means no adjustment from the default.

There’s also an imports field that lets you split configuration across multiple files:

version = 3
imports = ["/etc/containerd/conf.d/*.toml"]

This works like nginx’s conf.d pattern. You can drop additional config files into that directory and they’ll be merged. In version 3, the default imports path is /etc/containerd/conf.d/*.toml. Files are processed in glob order, and later files override earlier ones.

This is useful for managing configuration in production. Your base config stays in config.toml, while environment-specific settings go in conf.d. For example:

/etc/containerd/config.toml              # Base configuration
/etc/containerd/conf.d/50-registry.toml  # Private registry settings
/etc/containerd/conf.d/99-nvidia.toml    # GPU runtime configuration

The gRPC Section

containerd communicates with clients through gRPC. This is not an implementation choice — it’s a requirement. The Container Runtime Interface (CRI) that Kubernetes uses is defined as a gRPC service using Protocol Buffers. Any container runtime that wants to work with Kubernetes must speak gRPC.

[grpc]
  address = "/run/containerd/containerd.sock"
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

The address is the Unix socket path where containerd listens. The kubelet connects to this socket to create and manage containers. The uid and gid control who owns the socket file. Setting both to 0 means root owns it, so only root can connect.

The message size limits are 16MB by default. You rarely need to change these unless you’re doing something unusual with very large container specs.

There’s also support for TCP with TLS if you need remote access:

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = "0.0.0.0:10000"
  tcp_tls_cert = "/etc/containerd/cert.pem"
  tcp_tls_key = "/etc/containerd/key.pem"
  tcp_tls_ca = "/etc/containerd/ca.pem"

Most people leave tcp_address empty because exposing containerd over the network is a significant security risk.

The ttrpc Section

You’ll also see a ttrpc section:

[ttrpc]
  address = ""
  uid = 0
  gid = 0

TTRPC is a lightweight RPC protocol that containerd developed for communication with shims. A shim is the process that sits between containerd and the actual container process. When containerd needs to talk to a shim, it uses TTRPC instead of gRPC because the overhead is lower.

In most deployments, you can ignore this section entirely. The default empty address means TTRPC is disabled, and containerd uses other mechanisms for shim communication.

Debug and Metrics

For troubleshooting, there’s a debug section:

[debug]
  address = "/run/containerd/debug.sock"
  uid = 0
  gid = 0
  level = "info"
  format = "text"

The level can be trace, debug, info, warn, error, fatal, or panic. During initial setup, setting this to debug helps immensely. In production, info or warn is appropriate.

The format can be text or json. Text is readable for humans watching logs. JSON is better when you’re feeding logs into a log aggregation system.

containerd can also expose Prometheus metrics:

[metrics]
  address = "127.0.0.1:1338"
  grpc_histogram = false

By default, address is empty and metrics are disabled. Set an address to enable the metrics endpoint. The grpc_histogram option adds detailed gRPC latency histograms but increases cardinality significantly.

The Plugins Section

Now we get to the heart of containerd configuration. Almost everything in containerd is a plugin. The CRI implementation that Kubernetes talks to? A plugin. The snapshotter that manages image layers? A plugin. The runtime that actually runs containers? A plugin.

Each plugin has its own configuration section under [plugins]:

[plugins."io.containerd.grpc.v1.cri"]
  # CRI plugin configuration

[plugins."io.containerd.gc.v1.scheduler"]
  # Garbage collection scheduler

[plugins."io.containerd.snapshotter.v1.overlayfs"]
  # OverlayFS snapshotter

The plugin identifiers follow a pattern: io.containerd.... For example, io.containerd.grpc.v1.cri is a gRPC service plugin, version 1, named cri.

Let me walk through the important plugins.

The garbage collection scheduler (io.containerd.gc.v1.scheduler) cleans up unused content:

[plugins."io.containerd.gc.v1.scheduler"]
  pause_threshold = 0.02
  deletion_threshold = 0
  mutation_threshold = 100
  schedule_delay = "0s"
  startup_delay = "100ms"

The thresholds control how aggressively garbage collection runs. Higher values mean more frequent collection. The startup_delay prevents GC from running immediately when containerd starts, giving the system time to stabilize.

The metadata plugin (io.containerd.metadata.v1.bolt) stores containerd’s metadata in a BoltDB database:

[plugins."io.containerd.metadata.v1.bolt"]
  content_sharing_policy = "shared"

The content_sharing_policy determines whether multiple containers can share the same content. Shared is the default and appropriate for most cases.

The overlayfs snapshotter (io.containerd.snapshotter.v1.overlayfs) manages container filesystem layers. On most Linux systems, this is the default snapshotter and usually needs no configuration.

The CRI Plugin in Detail

For Kubernetes integration, the CRI plugin is what matters most. In version 2 configs, everything is under one plugin:

[plugins."io.containerd.grpc.v1.cri"]

In version 3, it’s split into two:

[plugins."io.containerd.cri.v1.images"]
[plugins."io.containerd.cri.v1.runtime"]

I’ll focus on version 2 since that’s what most people are still using. The concepts translate directly to version 3.

The most commonly configured options:

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.10"
  max_concurrent_downloads = 3
  max_container_log_line_size = 16384

The sandbox_image is the pause container that Kubernetes uses to hold network namespaces for pods. If you’re in an air-gapped environment, you need to change this to point to your internal registry.

The max_concurrent_downloads limits how many image layers containerd downloads in parallel. Increase this if you have good network bandwidth and want faster image pulls. Decrease it if you’re overwhelming your registry or network.

Underneath the CRI plugin, there’s a containerd subsection for runtime configuration:

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  default_runtime_name = "runc"

The snapshotter setting determines which snapshotter plugin manages image layers. OverlayFS is the standard choice on modern Linux.

The default_runtime_name specifies which runtime to use when one isn’t explicitly requested. This points to a runtime defined in the runtimes subsection.

Configuring Container Runtimes

The runtimes subsection is where you define the OCI runtimes that containerd can use:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    BinaryName = ""
    SystemdCgroup = true

This defines a runtime named “runc”. The runtime_type specifies which shim to use. The io.containerd.runc.v2 shim is the standard choice for runc.

The options section contains runtime-specific settings. For runc, the important ones are:

BinaryName specifies the path to the runc binary. If empty, containerd searches PATH for “runc”. You can set an explicit path like “/usr/local/bin/runc” if you have multiple runc versions installed.

SystemdCgroup is critical for Kubernetes. When true, runc uses systemd to manage cgroups instead of the filesystem directly. This must match your kubelet configuration. If kubelet is configured with cgroupDriver: systemd (which is the default and recommended), then SystemdCgroup must be true. Mismatching these causes subtle and frustrating problems.

You can define multiple runtimes for different use cases:

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
    SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
  runtime_type = "io.containerd.kata.v2"

Here I’ve defined three runtimes. The default is runc for normal containers. There’s also an nvidia runtime for GPU workloads and a kata runtime for VM-isolated containers.

To use a non-default runtime in Kubernetes, you create a RuntimeClass:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

The handler field must match the runtime name in your containerd config. Then pods can request this runtime:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  runtimeClassName: nvidia
  containers:
  - name: cuda
    image: nvidia/cuda:12.0-base

CNI Configuration

The CRI plugin also manages CNI (Container Network Interface) settings:

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"

The bin_dir is where CNI plugin binaries live. The conf_dir is where CNI configuration files go. These are the standard paths that most CNI plugins expect.

In a Kubernetes cluster, you typically don’t configure CNI through containerd directly. Instead, you install a CNI plugin like Calico or Flannel, which drops its own configuration files into conf_dir. But understanding these paths helps when troubleshooting network issues.

Registry Configuration

Registry configuration changed significantly between versions. In version 2, there were two approaches. The old way used inline configuration:

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
    endpoint = ["https://registry-1.docker.io"]

The new way (recommended) uses a config_path:

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

With config_path, you create a directory structure:

/etc/containerd/certs.d/
├── docker.io/
│   └── hosts.toml
├── gcr.io/
│   └── hosts.toml
└── my-registry.example.com/
    ├── hosts.toml
    └── ca.crt

Each registry gets its own directory with a hosts.toml file:

# /etc/containerd/certs.d/docker.io/hosts.toml
server = "https://docker.io"

[host."https://registry-1.docker.io"]
  capabilities = ["pull", "resolve"]
[host."https://mirror.example.com"]
  capabilities = ["pull", "resolve"]
  skip_verify = false

This approach is more flexible. You can add registries without editing the main config file, and you can include CA certificates alongside the host configuration.

The capabilities field specifies what operations are allowed. Pull means downloading images, resolve means looking up image metadata, push means uploading images. Most configurations only need pull and resolve.

The skip_verify option disables TLS certificate verification. Never set this to true in production. If you’re using a private CA, put the CA certificate in the same directory and reference it:

[host."https://my-registry.example.com"]
  capabilities = ["pull", "resolve", "push"]
  ca = "/etc/containerd/certs.d/my-registry.example.com/ca.crt"

A Complete Configuration Example

Let me put this all together into a complete version 2 configuration that you might use in production:

version = 2

root = "/var/lib/containerd"
state = "/run/containerd"
[grpc]
  address = "/run/containerd/containerd.sock"
  uid = 0
  gid = 0
[debug]
  level = "info"
[metrics]
  address = "127.0.0.1:1338"
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.10"
  max_concurrent_downloads = 3
  [plugins."io.containerd.grpc.v1.cri".containerd]
    snapshotter = "overlayfs"
    default_runtime_name = "runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true
  [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"

This configuration enables metrics, uses systemd cgroups (matching a typical kubelet configuration), and uses the hosts.toml approach for registry configuration.

K3s and containerd

If you use K3s, you’ll encounter a different configuration pattern. K3s embeds containerd and generates its own configuration file at:

/var/lib/rancher/k3s/agent/etc/containerd/config.toml

This file has a comment at the top:

# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.

K3s regenerates this file on every start. If you need to customize containerd in K3s, you create a template file:

/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

K3s uses Go templates, so you can include the base configuration and add your customizations:

version = 2

{{ template "base" . }}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
    SystemdCgroup = true

The {{ template "base" . }} line includes K3s's default configuration. Your additions come after.

For simple registry configuration changes in K3s, there’s an easier approach. Edit /etc/rancher/k3s/registries.yaml:

mirrors:
  docker.io:
    endpoint:
      - "https://mirror.example.com"
configs:
  "my-registry.example.com":
    tls:
      ca_file: "/etc/rancher/k3s/certs/ca.crt"

K3s translates this YAML into the appropriate containerd configuration automatically.

Verifying Your Configuration

After changing containerd configuration, restart the service and verify everything is working:

systemctl restart containerd
systemctl status containerd

Check that the configuration loaded correctly:

containerd config dump | grep -A5 "runtimes.runc"

Test that you can pull images:

crictl pull docker.io/library/alpine:latest

If you’re running Kubernetes, verify that kubelet can communicate with containerd:

crictl info

This command returns JSON with containerd’s current state. Look for the runtime configuration to confirm your changes took effect.

When troubleshooting, the containerd logs are essential:

journalctl -u containerd -f

Watch these logs while performing operations to understand what containerd is doing and where failures occur.

The configuration might seem overwhelming at first, but most of it is things you’ll never touch. In practice, you’ll spend 90% of your configuration time in the CRI plugin section, dealing with runtimes and registries. The rest is set once and forgotten.

Chapter 12: High Availability Configuration

When you run a single control plane node, you have a single point of failure. The API server goes down, and suddenly kubectl stops working, new pods cannot be scheduled, and your cluster is effectively dead. In production, this is unacceptable. You need high availability.

Kubespray was built with HA in mind from the start. Unlike kubeadm where you have to manually configure load balancers, set up additional control plane nodes, and worry about etcd quorum, Kubespray handles most of this automatically. But “automatically” does not mean “magically.” You still need to understand what is happening under the hood, because when something breaks at 3 AM, you will be the one fixing it.

Let me walk you through how Kubespray implements HA, starting with the control plane.

Control Plane High Availability

A Kubernetes control plane consists of three main components: the API server, the controller manager, and the scheduler. Each of these has different HA characteristics.

The API server is stateless. It reads from and writes to etcd, but it does not maintain any state itself. This means you can run multiple API server instances simultaneously, and they will all work correctly. Every API server instance is active and can handle requests. This is called active-active configuration.

The controller manager and scheduler are different. They maintain internal state and make decisions that could conflict if multiple instances ran simultaneously. Imagine two controller managers both deciding to create a replacement pod for a failed one.

You would end up with two replacement pods instead of one. To prevent this, Kubernetes uses leader election. Only one instance is active at any time, while the others wait in standby. If the leader fails, one of the standby instances takes over. This is active-standby configuration.

When you configure multiple control plane nodes in Kubespray, here is what happens. On each control plane node, Kubespray deploys the API server, controller manager, and scheduler as static pods. The API servers all run and accept requests. The controller managers and schedulers all run, but only one of each is the leader. The others continuously try to acquire the leadership lease stored in etcd.

You can see the leader election in action by checking the lease objects:

kubectl get lease -n kube-system

You will see leases named kube-controller-manager and kube-scheduler. The holderIdentity field shows which node currently holds the leadership.

To configure an HA control plane with Kubespray, you simply list multiple nodes in the kube_control_plane group. Here is an example inventory for a three-node HA setup:

[all]
k8s-ctr1 ansible_host=192.168.10.11 ip=192.168.10.11
k8s-ctr2 ansible_host=192.168.10.12 ip=192.168.10.12
k8s-ctr3 ansible_host=192.168.10.13 ip=192.168.10.13
k8s-w1 ansible_host=192.168.10.21 ip=192.168.10.21
k8s-w2 ansible_host=192.168.10.22 ip=192.168.10.22

[kube_control_plane]
k8s-ctr1
k8s-ctr2
k8s-ctr3
[etcd]
k8s-ctr1
k8s-ctr2
k8s-ctr3
[kube_node]
k8s-w1
k8s-w2
[k8s_cluster:children]
kube_control_plane
kube_node

Notice that all three control plane nodes are also in the etcd group. This is a common pattern called stacked etcd, where etcd runs on the same nodes as the control plane. I will discuss the alternative, external etcd, later in this chapter.

With this inventory, Kubespray will deploy everything needed for HA. But there is one critical piece missing from this picture: how do clients know which API server to talk to?

The Load Balancing Problem

When you have three API servers running on three different nodes, you need some way to distribute traffic among them. If a client always connects to k8s-ctr1:6443 and that node goes down, the client cannot reach the cluster even though k8s-ctr2 and k8s-ctr3 are perfectly healthy.

There are two approaches to solving this: external load balancers and client-side load balancing.

External load balancers sit in front of your API servers and distribute incoming connections. In AWS, you would use an ELB or NLB. In GCP, you would use a GCP Load Balancer. On-premises, you might use HAProxy with keepalived and a virtual IP. The external load balancer provides a single endpoint that clients connect to, and it forwards requests to healthy API servers.

Kubespray does not automatically configure external load balancers. This is a deliberate design decision, not an oversight. The reason is simple: external load balancers are infrastructure-level components that vary dramatically between environments. An AWS NLB requires AWS API calls to provision. A GCP Load Balancer requires GCP API calls. An on-premises HAProxy setup requires virtual IP configuration, which involves network infrastructure that Kubespray cannot possibly know about.

Kubespray is a tool for configuring software on top of your operating system. It is not an infrastructure provisioning tool. That is what Terraform is for. The expected workflow in production is to use Terraform to provision your VMs and external load balancer, then use Kubespray to deploy Kubernetes on those VMs.

However, Kubespray does provide one option for external load balancing that works without infrastructure provisioning: kube-vip. This is a software-based solution that creates a virtual IP address using either ARP (for layer 2 networks) or BGP (for layer 3 networks). If you enable kube-vip in Kubespray, it will deploy as a static pod on your control plane nodes and manage a floating VIP that always points to a healthy API server.

To enable kube-vip, you set these variables in your group_vars:

kube_vip_enabled: true
kube_vip_arp_enabled: true
kube_vip_address: 192.168.10.100

The kube_vip_address should be an unused IP in your network that will become the virtual IP for your API server endpoint.

Client-Side Load Balancing

For internal cluster communication, Kubespray uses a different approach: client-side load balancing. Instead of routing traffic through an external load balancer, each worker node runs its own local load balancer that distributes requests across all API servers.

Here is how it works. On each worker node, Kubespray deploys nginx as a lightweight TCP load balancer. This nginx listens on localhost:6443. The kubelet on that worker node is configured to connect to https://localhost:6443 as its API server endpoint. When kubelet makes a request, nginx receives it and forwards it to one of the actual API servers.

This might seem like an unnecessary extra hop, but it has significant advantages. First, there is no single point of failure in the load balancing layer. If the nginx on one worker node crashes, only that worker node is affected. Other worker nodes continue operating normally. Second, you do not need to provision or maintain any external infrastructure. Third, the latency added by the local nginx proxy is negligible since it is all localhost communication.

The nginx configuration that Kubespray generates looks something like this:

stream {
    upstream kube_apiserver {
        least_conn;
        server 192.168.10.11:6443 max_fails=3 fail_timeout=30s;
        server 192.168.10.12:6443 max_fails=3 fail_timeout=30s;
        server 192.168.10.13:6443 max_fails=3 fail_timeout=30s;
    }

    server {
        listen 127.0.0.1:6443;
        proxy_pass kube_apiserver;
        proxy_timeout 10m;
        proxy_connect_timeout 1s;
    }
}

The upstream block lists all your API servers. The least_conn directive tells nginx to send new connections to the server with the fewest active connections, which provides good load distribution. The max_fails and fail_timeout parameters control health checking. If an API server fails three times within 30 seconds, nginx temporarily removes it from the pool.

The server block listens only on 127.0.0.1, meaning only processes on the local machine can connect. This is intentional. The nginx proxy is not meant to be accessed from other machines.

The kubelet configuration on worker nodes points to this local proxy:

apiVersion: v1
kind: Config
clusters:
- cluster:
    certificate-authority-data: 
    server: https://localhost:6443
  name: cluster.local

When you look at this kubeconfig, you might wonder how the kubelet can connect to localhost:6443 and reach the API server. The answer is nginx. Every request to localhost:6443 goes through nginx, which forwards it to a real API server.

Kubespray calls this approach “localhost load balancing” or “nginx proxy.” You can choose between nginx and haproxy as the local proxy by setting the loadbalancer_apiserver_type variable:

loadbalancer_apiserver_type: nginx

loadbalancer_apiserver_type: haproxy

Both work well. nginx is the default and is lighter weight. haproxy provides more sophisticated load balancing options if you need them.

One important detail: control plane nodes do not need this local proxy. They can connect directly to their local API server at localhost:6443 because the API server is running on the same machine. Kubespray only deploys the nginx proxy on nodes that are in kube_node but not in kube_control_plane.

If you have nodes that are both control plane and worker nodes (which is common in smaller clusters), Kubespray handles this correctly. The node uses its local API server directly rather than going through nginx.

etcd High Availability

The API server is stateless, but etcd is not. etcd stores all cluster state: every pod definition, every service, every secret, every configmap. If you lose etcd data, you lose your cluster. This makes etcd HA critically important.

etcd uses the Raft consensus algorithm to replicate data across multiple nodes. I will not go into the full details of Raft here, but the key concept you need to understand is quorum. A quorum is the minimum number of nodes that must agree for a write to be committed. In Raft, the quorum is a majority: more than half of the total nodes.

For a 3-node etcd cluster, the quorum is 2. This means writes require at least 2 nodes to acknowledge them before they are considered committed. If one node is down, the remaining 2 nodes can still form a quorum and the cluster continues operating. But if 2 nodes are down, the single remaining node cannot form a quorum, and the cluster stops accepting writes.

For a 5-node etcd cluster, the quorum is 3. The cluster can tolerate 2 node failures.

This is why etcd clusters should always have an odd number of nodes. Consider what happens with 4 nodes. The quorum is 3 (more than half of 4). The cluster can tolerate only 1 failure, same as a 3-node cluster. But you have to maintain an extra node. You get no additional fault tolerance for the extra operational burden.

Here is the fault tolerance for different cluster sizes:

Nodes    Quorum    Tolerated Failures
1        1         0
3        2         1
5        3         2
7        4         3

Notice that going from 1 to 3 nodes gives you the ability to survive 1 failure. Going from 3 to 5 nodes lets you survive 2 failures. In practice, 3 nodes is sufficient for most production environments. You use 5 nodes when you need to perform rolling maintenance on etcd nodes while still tolerating a failure.

Never run 2 or 4 nodes. With 2 nodes, quorum is 2, so a single node failure makes the cluster unavailable. You would be better off with a single node at that point, since the operational complexity of 2 nodes buys you nothing.

In the inventory example I showed earlier, all three control plane nodes were also etcd nodes:

[etcd]
k8s-ctr1
k8s-ctr2
k8s-ctr3

This is stacked etcd, where etcd runs on the same machines as the control plane. The alternative is external etcd, where etcd runs on dedicated machines separate from the control plane.

Stacked etcd is simpler to set up and requires fewer machines. For a minimum HA setup with stacked etcd, you need 3 nodes (which are both control plane and etcd). With external etcd, you would need 3 control plane nodes plus 3 etcd nodes, totaling 6 machines.

External etcd has advantages in larger environments. etcd can be resource-intensive under heavy load, and isolating it on dedicated machines ensures that control plane components do not compete with etcd for CPU and memory. External etcd also allows you to scale and maintain the etcd cluster independently from the control plane.

Kubespray supports both configurations. For stacked etcd, you put the same nodes in both kube_control_plane and etcd groups. For external etcd, you put different nodes in each group:

[kube_control_plane]
k8s-ctr1
k8s-ctr2
k8s-ctr3

[etcd]
k8s-etcd1
k8s-etcd2
k8s-etcd3

With this configuration, Kubespray will deploy the control plane on k8s-ctr1 through k8s-ctr3, and etcd on k8s-etcd1 through k8s-etcd3. The API servers on the control plane nodes will be configured to connect to the external etcd cluster.

etcd Deployment Types in Kubespray

Kubespray supports two ways to run etcd, controlled by the etcd_deployment_type variable.

When etcd_deployment_type is set to “host”, etcd runs as a systemd service directly on the host. Kubespray downloads the etcd binary, generates all necessary certificates, creates the systemd unit file, and starts the service. This is the default in Kubespray and what I have been describing throughout this chapter.

The systemd unit file looks like this:

[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
User=etcd
Type=notify
EnvironmentFile=/etc/etcd.env
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

The environment file /etc/etcd.env contains all the etcd configuration:

ETCD_NAME=k8s-ctr1
ETCD_DATA_DIR=/var/lib/etcd
ETCD_LISTEN_PEER_URLS=https://192.168.10.11:2380
ETCD_LISTEN_CLIENT_URLS=https://192.168.10.11:2379,https://127.0.0.1:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.10.11:2380
ETCD_ADVERTISE_CLIENT_URLS=https://192.168.10.11:2379
ETCD_INITIAL_CLUSTER=k8s-ctr1=https://192.168.10.11:2380,k8s-ctr2=https://192.168.10.12:2380,k8s-ctr3=https://192.168.10.13:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=k8s-etcd-cluster
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CLIENT_CERT_AUTH=true
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem
ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CLIENT_CERT_AUTH=true

When etcd_deployment_type is set to “kubeadm”, etcd runs as a static pod managed by kubeadm. This is how kubeadm normally deploys etcd when you run kubeadm init. The etcd manifest goes in /etc/kubernetes/manifests/etcd.yaml, and kubelet manages the pod.

The host deployment type gives you more control over etcd lifecycle. You can start, stop, and manage etcd independently of Kubernetes. The kubeadm deployment type integrates etcd more tightly with the Kubernetes control plane, which can simplify some operations but makes etcd management dependent on kubelet.

Most Kubespray users stick with the default host deployment type.

etcd Certificates

etcd uses TLS for all communication, both between etcd nodes (peer communication) and between clients and etcd (client communication). Kubespray generates a complete certificate hierarchy for etcd.

The CA certificate and key are stored at:

/etc/ssl/etcd/ssl/ca.pem
/etc/ssl/etcd/ssl/ca-key.pem

For each etcd node, Kubespray generates member certificates used for peer communication:

/etc/ssl/etcd/ssl/member-k8s-ctr1.pem
/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem

For administrative access to etcd (like running etcdctl commands), Kubespray generates admin certificates:

/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem
/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem

For the Kubernetes API server to connect to etcd, Kubespray generates node certificates:

/etc/ssl/etcd/ssl/node-k8s-ctr1.pem
/etc/ssl/etcd/ssl/node-k8s-ctr1-key.pem

When you need to run etcdctl commands, you must provide the appropriate certificates:

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \
  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \
  --endpoints=https://192.168.10.11:2379 \
  endpoint health

If you forget the certificates, etcdctl will fail to connect because etcd requires mutual TLS authentication for all connections.

To check the health of your entire etcd cluster:

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \
  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \
  --endpoints=https://192.168.10.11:2379,https://192.168.10.12:2379,https://192.168.10.13:2379 \
  endpoint health

This will show the health status of each endpoint:

https://192.168.10.11:2379 is healthy: successfully committed proposal: took = 2.5ms
https://192.168.10.12:2379 is healthy: successfully committed proposal: took = 2.7ms
https://192.168.10.13:2379 is healthy: successfully committed proposal: took = 2.4ms

To see which node is the current etcd leader:

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \
  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \
  --endpoints=https://192.168.10.11:2379,https://192.168.10.12:2379,https://192.168.10.13:2379 \
  endpoint status --write-out=table

The output includes a column showing whether each node is the leader.

Putting It All Together

Let me walk through what happens when you deploy an HA cluster with Kubespray.

First, Kubespray reads your inventory and identifies which nodes belong to which groups. It sees three nodes in kube_control_plane, three nodes in etcd (the same three), and two nodes in kube_node.

Kubespray begins by bootstrapping all nodes: installing Python, configuring the OS, loading kernel modules, setting kernel parameters, and disabling swap.

Next, Kubespray installs the container runtime (containerd) on all nodes. This is needed because the Kubernetes components run in containers (as static pods) on control plane nodes.

Then Kubespray sets up the etcd cluster. It generates the CA certificate, generates member certificates for each etcd node, creates the systemd unit files, and starts etcd on all three nodes. The etcd nodes discover each other using the ETCD_INITIAL_CLUSTER configuration and form a cluster.

With etcd running, Kubespray initializes the first control plane node using kubeadm init. The kubeadm configuration specifies the external etcd endpoints:

etcd:
  external:
    endpoints:
      - https://192.168.10.11:2379
      - https://192.168.10.12:2379
      - https://192.168.10.13:2379
    caFile: /etc/ssl/etcd/ssl/ca.pem
    certFile: /etc/ssl/etcd/ssl/node-k8s-ctr1.pem
    keyFile: /etc/ssl/etcd/ssl/node-k8s-ctr1-key.pem

This tells kubeadm not to deploy its own etcd, but instead connect to the external etcd cluster that Kubespray already set up.

After the first control plane node is ready, Kubespray joins the other two control plane nodes using kubeadm join with the — control-plane flag. These nodes receive copies of the certificates and become additional control plane members.

For the worker nodes, Kubespray first deploys the nginx load balancer. It generates the nginx configuration with all three API server addresses in the upstream block, starts nginx, and verifies it is working. Then it runs kubeadm join on the worker nodes with their kubeconfig pointing to localhost:6443.

Finally, Kubespray deploys the CNI plugin, CoreDNS, and any other addons you have enabled.

The result is a fully functional HA cluster. You can lose any one control plane node and the cluster continues operating. You can lose any one etcd node and data remains safe. Worker nodes automatically failover to healthy API servers through their local nginx proxies.

Testing Your HA Setup

After deploying an HA cluster, you should test that failover actually works. Here is how to do it.

First, verify the cluster is healthy:

kubectl get nodes
kubectl get pods -A

All nodes should be Ready and all system pods should be Running.

Check that all API servers are responding:

kubectl get --raw='/healthz' --server=https://192.168.10.11:6443 --insecure-skip-tls-verify
kubectl get --raw='/healthz' --server=https://192.168.10.12:6443 --insecure-skip-tls-verify
kubectl get --raw='/healthz' --server=https://192.168.10.13:6443 --insecure-skip-tls-verify

Each should return “ok”.

Now simulate a control plane failure. SSH into one of your control plane nodes and stop the kubelet:

systemctl stop kubelet

This stops all static pods including the API server, controller manager, and scheduler on that node.

From another machine, verify the cluster is still operational:

kubectl get nodes
kubectl run test --image=nginx --restart=Never
kubectl get pods

The node you stopped will show NotReady, but you can still create pods. They will be scheduled on the remaining nodes.

Check the leader election shifted if needed:

kubectl get lease -n kube-system kube-controller-manager -o yaml
kubectl get lease -n kube-system kube-scheduler -o yaml

If the stopped node was the leader, you will see a different holderIdentity now.

Start the kubelet again:

systemctl start kubelet

The node will rejoin the cluster and become Ready again.

For etcd, the test is similar but more dangerous. Do not do this on a production cluster without proper backups. Stop etcd on one node:

systemctl stop etcd

Verify the etcd cluster is still healthy from another node:

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr2.pem \
  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr2-key.pem \
  --endpoints=https://192.168.10.12:2379,https://192.168.10.13:2379 \
  endpoint health

The two remaining endpoints should be healthy. Create a configmap to verify writes work:

kubectl create configmap test-ha --from-literal=key=value
kubectl get configmap test-ha -o yaml

Start etcd again:

systemctl start etcd

The node rejoins the cluster and syncs data from the other members.

These tests give you confidence that your HA setup actually works. Run them during initial deployment and periodically afterward to catch any configuration drift.

Chapter 13: Certificate Management and Auto-Renewal

If you have ever run a Kubernetes cluster in production for more than a few months, you have probably experienced that sinking feeling when you realize your cluster certificates are about to expire. Or worse, you have already experienced the chaos that ensues when they do expire and suddenly nothing works. The API server refuses connections, kubelet cannot authenticate, and your perfectly healthy cluster becomes completely unresponsive.

Kubernetes relies heavily on TLS certificates for secure communication between components. The API server needs certificates. The kubelet needs certificates. The controller-manager, scheduler, etcd — they all need certificates. And by default, kubeadm generates certificates with a one-year validity period. That sounds like plenty of time until you realize that a year passes quickly, and nobody remembered to put certificate renewal on the calendar.

Kubespray addresses this problem elegantly by providing automatic certificate renewal out of the box. In this chapter, we will explore how Kubernetes certificates work, how Kubespray configures automatic renewal, and how you can manually manage certificates when needed.

Understanding Kubernetes Certificates

Before diving into renewal, let us understand what certificates exist in a Kubernetes cluster. When you run kubeadm init, it generates a whole tree of certificates under /etc/kubernetes/pki. Each certificate serves a specific purpose in the authentication and encryption chain.

The certificate authority sits at the root. Everything else chains back to it.

/etc/kubernetes/pki/
├── ca.crt                    # Kubernetes CA certificate
├── ca.key                    # Kubernetes CA private key
├── apiserver.crt             # API server serving certificate
├── apiserver.key
├── apiserver-kubelet-client.crt    # API server → kubelet client cert
├── apiserver-kubelet-client.key
├── apiserver-etcd-client.crt       # API server → etcd client cert
├── apiserver-etcd-client.key
├── front-proxy-ca.crt        # Front proxy CA (for API aggregation)
├── front-proxy-ca.key
├── front-proxy-client.crt    # Front proxy client certificate
├── front-proxy-client.key
├── etcd/
│   ├── ca.crt                # etcd CA certificate
│   ├── ca.key
│   ├── server.crt            # etcd server certificate
│   ├── server.key
│   ├── peer.crt              # etcd peer communication
│   ├── peer.key
│   ├── healthcheck-client.crt
│   └── healthcheck-client.key
└── sa.pub                    # Service account public key
sa.key                    # Service account private key

The CA certificates (ca.crt, etcd/ca.crt, front-proxy-ca.crt) have a default validity of 10 years. You generally do not need to worry about these expiring anytime soon. However, all the other certificates — the ones actually used for day-to-day communication — expire after just one year.

When Kubespray deploys a cluster with etcd_deployment_type set to host (which runs etcd as a systemd service rather than a static pod), etcd gets its own separate certificate infrastructure under /etc/ssl/etcd/ssl:

/etc/ssl/etcd/ssl/
├── ca.pem                    # etcd CA
├── ca-key.pem
├── member-k8s-ctr.pem        # etcd member certificate
├── member-k8s-ctr-key.pem
├── admin-k8s-ctr.pem         # etcd admin certificate (for etcdctl)
├── admin-k8s-ctr-key.pem
├── node-k8s-ctr.pem          # client cert for API server → etcd
└── node-k8s-ctr-key.pem

This separation exists because Kubespray manages etcd independently from kubeadm when using host deployment. The etcd certificates are generated by Kubespray’s etcd role, not by kubeadm.

Checking Certificate Expiration

Before we talk about renewal, you should know how to check when your certificates expire. The kubeadm tool provides a convenient command for this:

kubeadm certs check-expiration

The output looks something like this:

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Jan 28, 2027 12:34 UTC   364d            ca                      no
apiserver                  Jan 28, 2027 12:34 UTC   364d            ca                      no
apiserver-etcd-client      Jan 28, 2027 12:34 UTC   364d            etcd-ca                 no
apiserver-kubelet-client   Jan 28, 2027 12:34 UTC   364d            ca                      no
controller-manager.conf    Jan 28, 2027 12:34 UTC   364d            ca                      no
front-proxy-client         Jan 28, 2027 12:34 UTC   364d            front-proxy-ca          no
scheduler.conf             Jan 28, 2027 12:34 UTC   364d            ca                      no
CERTIFICATE AUTHORITY      EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                         Jan 25, 2036 12:34 UTC   9y              no
etcd-ca                    Jan 25, 2036 12:34 UTC   9y              no
front-proxy-ca             Jan 25, 2036 12:34 UTC   9y              no

Notice the RESIDUAL TIME column. When that number starts getting small — say, less than 30 days — you should be concerned. When it hits zero, you have a problem.

If you want to check a specific certificate file directly, openssl works perfectly:

openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

This outputs:

notBefore=Jan 28 12:34:00 2026 GMT notAfter=Jan 28 12:34:00 2027 GMT

For the etcd certificates managed by Kubespray:

openssl x509 -in /etc/ssl/etcd/ssl/member-k8s-ctr.pem -noout -dates

You can also check who signed the certificate and what names it is valid for:

openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject:"
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"

The Subject Alternative Name (SAN) field is particularly important for the API server certificate. It must include all the names and IPs that clients use to connect to the API server. If you add a new control plane node or change the API server’s external DNS name, you might need to regenerate this certificate with the new SANs.

Enabling Automatic Renewal in Kubespray

Kubespray makes certificate renewal straightforward. In your inventory’s group_vars, you enable automatic renewal with a single variable:

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
auto_renew_certificates: true

That is it. When this variable is set to true, Kubespray configures a systemd timer on each control plane node that automatically renews certificates on a monthly schedule.

But what exactly does “monthly” mean? Kubespray allows you to customize the schedule using systemd’s calendar syntax:

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
auto_renew_certificates: true
auto_renew_certificates_systemd_calendar: "Mon --1,2,3,4,5,6,7 03:00:00"

This particular schedule means “every Monday that falls on the 1st through 7th of any month, at 3:00 AM.” In other words, the first Monday of every month at 3 AM. This is a sensible default — it runs monthly, during off-peak hours, on a predictable day.

You can adjust this to whatever schedule fits your operational needs. Maybe you want it to run weekly:

auto_renew_certificates_systemd_calendar: “Mon — * 03:00:00”

Or perhaps you want more control and prefer to run it only on specific months:

auto_renew_certificates_systemd_calendar: “*-1,4,7,10–01 03:00:00”

The systemd calendar syntax is flexible but has a learning curve. You can test your calendar expressions with the systemd-analyze tool:

systemd-analyze calendar “Mon — 1,2,3,4,5,6,7 03:00:00”

This outputs when the next trigger times would be, helping you verify your schedule is correct.

How the Auto-Renewal System Works

When Kubespray runs with auto_renew_certificates enabled, it creates two systemd units on each control plane node: a timer and a service.

The timer unit lives at /etc/systemd/system/k8s-certs-renew.timer:

[Unit]
Description=Renew K8S control plane certificates
[Timer]
OnCalendar=Mon --1,2,3,4,5,6,7 03:00:00
Persistent=true
[Install]
WantedBy=timers.target

The Persistent=true setting is important. It means that if the system was powered off when the timer should have fired, systemd will run the service immediately upon next boot. This prevents missed renewals due to maintenance windows or unexpected downtime.

The service unit at /etc/systemd/system/k8s-certs-renew.service defines what actually happens when the timer fires:

[Unit] Description=Renew K8S control plane certificates After=kubelet.service
[Service] Type=oneshot ExecStart=/usr/local/bin/k8s-certs-renew.sh

The service is of type oneshot, meaning it runs once and exits rather than staying resident. The After=kubelet.service ensures the kubelet is already running before we attempt renewal.

The actual renewal logic lives in the shell script at /usr/local/bin/k8s-certs-renew.sh:

#!/bin/bash
set -e

# Renew all certificates
/usr/local/bin/kubeadm certs renew all

# Touch the static pod manifests to trigger kubelet to restart them
# kubelet watches these files and restarts pods when they change

find /etc/kubernetes/manifests/ -type f -exec touch {} ;

The script does three things. First, it calls kubeadm certs renew all to generate new certificates. Second, it touches the static pod manifest files. This is a clever trick — the kubelet watches these manifest files for changes, and when it detects a modification (even just a timestamp change), it restarts the corresponding static pods. This ensures the API server, controller-manager, and scheduler pick up the new certificates without requiring a full node reboot. Third, it restarts the kubelet service itself to ensure it loads any renewed client certificates.

You can verify the timer is active and check when it last ran:

systemctl list-timers — all | grep k8s-certs-renew

The output shows the next scheduled run and the last time it triggered:

NEXT                        LEFT          LAST                        PASSED       UNIT                    ACTIVATES
Mon 2026-02-02 03:00:00 KST 4 days left   Mon 2026-01-06 03:00:00 KST 3 weeks ago  k8s-certs-renew.timer   k8s-certs-renew.service

To see detailed status of the timer:

systemctl status k8s-certs-renew.timer

And to check the logs from the last renewal:

journalctl -u k8s-certs-renew.service

Manual Certificate Renewal

Sometimes you need to renew certificates manually. Perhaps you are approaching expiration and do not want to wait for the next scheduled run. Or maybe you need to regenerate certificates after adding new SANs.

The kubeadm certs renew command handles this. To renew all certificates at once:

kubeadm certs renew all

The output confirms each certificate was renewed:

[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler to use renewed

If you only want to renew specific certificates, kubeadm supports that too:

kubeadm certs renew apiserver
kubeadm certs renew apiserver-kubelet-client
kubeadm certs renew apiserver-etcd-client
kubeadm certs renew front-proxy-client
kubeadm certs renew admin.conf
kubeadm certs renew controller-manager.conf
kubeadm certs renew scheduler.conf

After manual renewal, you must restart the control plane components to pick up the new certificates. The simplest approach is to restart the kubelet, which will restart all static pods:

systemctl restart kubelet

Alternatively, you can use the touch trick to trigger individual pod restarts:

touch /etc/kubernetes/manifests/kube-apiserver.yaml
touch /etc/kubernetes/manifests/kube-controller-manager.yaml
touch /etc/kubernetes/manifests/kube-scheduler.yaml

Give it a minute, then verify the components are back up:

kubectl get pods -n kube-system

You should see the control plane pods with recent start times.

Renewing etcd Certificates

When Kubespray deploys etcd using etcd_deployment_type: host, the etcd certificates are separate from the Kubernetes PKI and are not managed by kubeadm. Kubespray generates these certificates using its own etcd role.

To renew etcd certificates, you would typically re-run the Kubespray etcd role. However, for manual renewal, you need to understand the certificate structure.

The etcd certificates under /etc/ssl/etcd/ssl serve different purposes:

The member certificate (member-hostname.pem) is used by the etcd server for both serving clients and communicating with peer etcd nodes. The admin certificate (admin-hostname.pem) is used by etcdctl for administrative operations. The node certificate (node-hostname.pem) is used by the Kubernetes API server to authenticate when connecting to etcd.

To check etcd certificate expiration:

openssl x509 -in /etc/ssl/etcd/ssl/member-k8s-ctr.pem -noout -dates
openssl x509 -in /etc/ssl/etcd/ssl/admin-k8s-ctr.pem -noout -dates
openssl x509 -in /etc/ssl/etcd/ssl/node-k8s-ctr.pem -noout -dates

If you need to regenerate etcd certificates, the safest approach is to re-run Kubespray’s etcd role:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml — tags etcd

This regenerates the certificates and restarts etcd as needed.

For testing etcd connectivity with the admin certificate:

ETCDCTL_API=3 etcdctl 
--cacert=/etc/ssl/etcd/ssl/ca.pem 
--cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem 
--key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem 
--endpoints=https://192.168.10.10:2379 
endpoint health

A healthy response looks like:

https://192.168.10.10:2379 is healthy: successfully committed proposal: took = 12.345ms

Certificate Renewal and Kubeconfig Files

One thing that catches people off guard is that kubeadm certs renew all also regenerates the kubeconfig files for admin, controller-manager, and scheduler. These kubeconfig files contain embedded client certificates.

After renewal, if you have copied the admin.conf to your local machine (as ~/.kube/config), your local copy is now outdated. The embedded certificate in your local copy is the old one, which will eventually expire.

You need to copy the renewed admin.conf again:

# On the control plane node
cat /etc/kubernetes/admin.conf

# Or copy it to your local machine
scp root@192.168.10.10:/etc/kubernetes/admin.conf ~/.kube/config

Remember to update the server address if needed. The default admin.conf generated by kubeadm points to 127.0.0.1:6443, which only works from the control plane node itself:

sed -i ‘s/127.0.0.1/192.168.10.10/g’ ~/.kube/config

Kubelet Certificate Rotation

The kubelet has its own certificate rotation mechanism that works independently from the control plane certificates. When you look at the kubelet configuration, you will see:

# /etc/kubernetes/kubelet-config.yaml
rotateCertificates: true serverTLSBootstrap: true

With rotateCertificates: true, the kubelet automatically requests new client certificates from the API server before the current ones expire. This happens transparently without any manual intervention.

The kubelet’s certificates are stored in /var/lib/kubelet/pki:

/var/lib/kubelet/pki/
├── kubelet.crt          # kubelet serving certificate
├── kubelet.key
├── kubelet-client-current.pem    # current client certificate (symlink)
├── kubelet-client-2026-01-28-12-34-56.pem    # actual client cert file
└── ...

The kubelet-client-current.pem is a symlink that points to the currently active client certificate. When rotation occurs, a new certificate file is created and the symlink is updated. Old certificate files are kept around for a while before being cleaned up.

You can check the kubelet’s client certificate expiration:

openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

If serverTLSBootstrap is enabled, the kubelet’s serving certificate (used when the API server connects to the kubelet for logs, exec, port-forward, etc.) is also managed automatically. However, these certificates require manual approval by default. You can check pending CSRs:

kubectl get csr

If you see pending CSRs for kubelet serving certificates, approve them:

kubectl certificate approve

In a Kubespray-deployed cluster, this is typically handled automatically or approved in bulk during initial cluster setup.

Troubleshooting Certificate Issues

When certificates expire or become invalid, the symptoms are often confusing. The API server might refuse connections. kubectl commands fail with TLS errors. Pods cannot be scheduled. Here are some common scenarios and how to diagnose them.

If kubectl fails with a certificate error:

kubectl get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid

First, check if the issue is your local kubeconfig or the server certificates:

# Check your local kubeconfig's embedded certificate
kubectl config view - raw -o jsonpath='{.users[0].user.client-certificate-data}' | base64 -d | openssl x509 -noout -dates
SSH to the control plane and check server certificates
kubeadm certs check-expiration

If the API server certificate has expired, you cannot use kubectl to check things. You need to SSH directly to the control plane node and work locally.

If etcd refuses connections:

ETCDCTL_API=3 etcdctl - endpoints=https://127.0.0.1:2379 endpoint health Error: context deadline exceeded

Check etcd logs for certificate errors:

journalctl -u etcd | grep -i “certificate|tls|x509”

If you see certificate expiration errors, you need to regenerate the etcd certificates and restart etcd.

If the API server cannot connect to etcd:

kubectl get pods -n kube-system

# or check API server logs
journalctl -u kubelet | grep apiserver

Look for errors like:

connection error: desc = “transport: authentication handshake failed: x509: certificate has expired”

This indicates the API server’s client certificate for etcd (apiserver-etcd-client.crt) has expired. Renew it with:

kubeadm certs renew apiserver-etcd-client

Then restart the API server:

touch /etc/kubernetes/manifests/kube-apiserver.yaml

Best Practices for Certificate Management

After going through all this, here are some practical recommendations for managing certificates in production.

First, always enable automatic renewal. There is no good reason not to have auto_renew_certificates: true in your Kubespray configuration. The overhead is negligible, and it prevents the most common certificate disaster.

Second, monitor certificate expiration. Even with auto-renewal, you should monitor expiration dates. Set up alerts that fire when any certificate has less than 30 days of validity remaining. This gives you time to investigate if auto-renewal is not working for some reason.

Third, test renewal before it matters. Do not wait until certificates are about to expire to find out if your renewal process works. Periodically run manual renewal in a test environment. Verify that services come back up correctly after renewal.

Fourth, keep your kubeconfig files updated. After any certificate renewal, remember to update the kubeconfig files on any machine that accesses the cluster remotely. This includes CI/CD systems, monitoring tools, and developer workstations.

Fifth, document your certificate architecture. Know which certificates exist, what they are for, and how they are renewed. When something goes wrong at 3 AM, you want this information readily available, not scattered across blog posts and Stack Overflow answers.

Finally, consider the CA certificate expiration. While the CA certificates are valid for 10 years by default, that day will eventually come. CA rotation is significantly more complex than regular certificate renewal because every certificate signed by the old CA must be regenerated. Plan for this well in advance — years in advance, ideally.

The peace of mind that comes from properly configured certificate management is worth every minute spent understanding how it works. When your cluster crosses the one-year mark and everything continues running smoothly because certificates renewed themselves automatically, you will appreciate having set this up correctly from the beginning.

Chapter 14: Cluster Operations — Upgrades, Scaling, and Backup

Running a Kubernetes cluster is not a one-time event. Once you have a production cluster humming along, you will inevitably need to upgrade it when new versions come out, add or remove nodes as your workload changes, and maintain backups in case something goes catastrophically wrong.

Kubespray provides playbooks for all of these operations, but as we discussed in the opening chapters, understanding what these playbooks actually do is crucial. Blindly running upgrade-cluster.yml without understanding the implications can leave you with a half-upgraded cluster in an inconsistent state.

Let me walk you through each of these operational tasks in detail.

Cluster Upgrades

Kubernetes moves fast. A new minor version drops every three to four months, and security patches come even more frequently. If you want to stay current and benefit from bug fixes, performance improvements, and new features, you need to upgrade regularly. The alternative is running an increasingly outdated cluster that drifts further from community support.

Kubespray handles upgrades through the upgrade-cluster.yml playbook. The basic invocation looks like this:

ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.34.0

That single command kicks off a carefully orchestrated sequence of operations across your entire cluster. But before you hit enter, you need to understand what is about to happen.

The upgrade proceeds in a specific order. First, Kubespray upgrades etcd if a new version is required. Then it moves to the control plane nodes, upgrading them one at a time. After the control plane is fully upgraded, it proceeds to the worker nodes, again one at a time. Finally, it upgrades the CNI plugin and any addons that need updating.

This sequenced approach is not arbitrary. Kubernetes has strict version skew policies that dictate which component versions can talk to each other. The kube-apiserver must be the newest component. The kubelet on nodes can be up to two minor versions behind the apiserver, but no newer. The controller-manager and scheduler must match the apiserver version. If you violate these constraints, components will refuse to communicate, and your cluster will break.

The upgrade-cluster.yml playbook respects these constraints automatically, but you still need to follow the one-minor-version-at-a-time rule. If you are running Kubernetes 1.30 and want to get to 1.33, you cannot jump directly. You must upgrade to 1.31 first, verify everything works, then upgrade to 1.32, verify again, and finally upgrade to 1.33. Each hop requires a separate playbook execution:

ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.31.0

Wait for completion, run your smoke tests, check that all pods are running, verify that your applications are healthy. Then proceed to the next version:

ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml-e kube_version=v1.32.0

And so on. This is tedious but necessary. Skipping versions is asking for trouble.

Before any upgrade, you should verify the current state of your cluster. Check the existing versions:

kubectl get nodes -o wide

This shows you the kubelet version on each node. Check the control plane component versions:

kubectl get pods -n kube-system -o custom-columns=
“NAME:.metadata.name,IMAGE:.spec.containers[*].image” | grep -E “apiserver|controller|scheduler”

Make sure everything is healthy before you start:

kubectl get nodes kubectl get pods -A | grep -v Running | grep -v Completed

Any pods stuck in CrashLoopBackOff or Pending state should be investigated and fixed before upgrading. Upgrades will not magically fix existing problems; they are more likely to make them worse.

You should also take an etcd backup before upgrading. I will cover the backup procedure in detail later in this chapter, but the short version is:

ETCDCTL_API=3 etcdctl snapshot save /backup/pre-upgrade-snapshot.db 
 - cacert=/etc/ssl/etcd/ssl/ca.pem 
 - cert=/etc/ssl/etcd/ssl/admin-$(hostname).pem 
 - key=/etc/ssl/etcd/ssl/admin-$(hostname)-key.pem 
 - endpoints=https://127.0.0.1:2379

With your backup secured and your cluster verified healthy, you can proceed with the upgrade. Watch the output carefully. Ansible will show you each task as it executes, and the profile_tasks callback will tell you how long each step takes. If something fails, the playbook will stop. Do not panic. Read the error message carefully, fix the underlying issue, and re-run the playbook. Thanks to Ansible’s idempotency, tasks that already completed successfully will be skipped, and execution will resume from approximately where it left off.

After the upgrade completes, run your verification checks again:

kubectl get nodes -o wide kubectl get pods -A kubectl get cs

That last command checks component status, though note that it has been deprecated in newer Kubernetes versions and may not show useful information.

One more thing about upgrades: test them in a non-production environment first. Always. Set up a dev or staging cluster that mirrors your production configuration, run the upgrade there, exercise your applications, look for problems. Only after you have confidence that the upgrade is safe should you touch production. This advice seems obvious, but the number of teams that skip this step because they are in a hurry is distressingly high.

Node Scaling

Clusters are not static. Your workload grows, and you need more capacity. A node fails catastrophically, and you need to replace it. You are migrating to new hardware and need to gracefully transition. All of these scenarios involve adding or removing nodes from your cluster.

Adding nodes is straightforward with Kubespray. First, update your inventory file to include the new node:

[all] k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10 k8s-w1 ansible_host=192.168.10.21 ip=192.168.10.21 k8s-w2 ansible_host=192.168.10.22 ip=192.168.10.22 k8s-w3 ansible_host=192.168.10.23 ip=192.168.10.23 # New node
[kube_control_plane] k8s-ctr
[etcd] k8s-ctr
[kube_node] k8s-w1 k8s-w2 k8s-w3 # New node added to worker group

Make sure the new node meets all the prerequisites. It needs SSH access configured with your Ansible control node’s public key. It needs Python installed. The network needs to allow communication with the existing cluster nodes on the required ports.

Then run the scale playbook with the limit option to target only the new node:

ansible-playbook -i inventory/mycluster/inventory.ini scale.yml
— limit=k8s-w3

The limit flag is important. Without it, the playbook would run against all nodes, which is unnecessary and time-consuming. With the limit, Ansible only processes the new node, installing the container runtime, kubelet, and kube-proxy, then joining it to the cluster.

After the playbook completes, verify that the new node has joined:

kubectl get nodes

You should see k8s-w3 in the Ready state. Pods will start scheduling on it automatically based on resource availability and any node selectors or affinities in your workloads.

Removing nodes requires more care. You cannot just yank a node out of the cluster. There might be pods running on it, including pods that are the only replicas of their deployment. If you remove the node without preparation, those workloads will be disrupted.

The correct procedure starts with draining the node. This tells Kubernetes to gracefully evict all pods from the node and mark it as unschedulable so no new pods land on it:

kubectl drain k8s-w1 — ignore-daemonsets — delete-emptydir-data

The ignore-daemonsets flag is necessary because DaemonSet pods run on every node by design and cannot be evicted in the normal sense. The delete-emptydir-data flag allows draining even if pods are using emptyDir volumes, acknowledging that data in those volumes will be lost.

Watch the drain operation:

kubectl get pods -A -o wide | grep k8s-w1

You should see pods terminating and being recreated on other nodes. Wait until no pods remain on the node except for DaemonSet pods.

Once the node is drained, you can remove it from the cluster using Kubespray:

ansible-playbook -i inventory/mycluster/inventory.ini remove-node.yml
-e node=k8s-w1

This playbook cleans up the node, removing Kubernetes components and resetting it to a pre-Kubernetes state. It also removes the node object from the Kubernetes API.

After removal, update your inventory file to remove the node entry. Keeping stale entries in your inventory will cause confusion and errors in future playbook runs.

If you are removing a control plane node or an etcd node, the process is more complex. You need to ensure you maintain quorum in etcd (never go below three nodes in a production etcd cluster), and you need to handle the control plane components carefully. Kubespray’s remove-node.yml handles most of this, but you should understand the implications. Removing a control plane node from a three-node HA setup leaves you with two nodes, which is actually worse than three for fault tolerance because you have lost redundancy without reducing the quorum requirement. Plan your control plane sizing carefully.

Backup and Recovery

etcd is the brain of your Kubernetes cluster. Every object you create, every configuration change you make, every secret you store ends up in etcd. If you lose etcd data without a backup, you lose your cluster. Full stop. You will have to rebuild everything from scratch.

This is not a theoretical concern. Disk failures happen. Corruption happens. Operator errors happen. Someone might accidentally delete critical data. Without backups, you are one bad day away from a very long night of rebuilding.

Creating an etcd backup is straightforward with etcdctl. The exact command depends on your etcd deployment, but for a Kubespray-deployed etcd using the host deployment type, it looks like this:

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db
— cacert=/etc/ssl/etcd/ssl/ca.pem
— cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem
— key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem
— endpoints=https://192.168.10.10:2379

Let me break down these options. The ETCDCTL_API=3 environment variable ensures you are using the v3 API, which is required for modern etcd. The snapshot save command creates a point-in-time snapshot of the entire etcd database. The certificate options authenticate you to etcd, which requires TLS client authentication. The endpoints option specifies which etcd member to connect to.

After creating a snapshot, verify its integrity:

ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db

This shows you the hash, revision, total keys, and total size of the snapshot. If the command succeeds, your snapshot is valid.

You should automate backups. Running manual backups occasionally is not sufficient for production. Here is a simple script that takes daily backups and retains them for seven days:

#!/bin/bash BACKUP_DIR="/backup/etcd" DATE=$(date +%Y%m%d-%H%M%S) SNAPSHOT="$BACKUP_DIR/etcd-snapshot-$DATE.db"
mkdir -p "$BACKUP_DIR"
ETCDCTL_API=3 etcdctl snapshot save "$SNAPSHOT" 
 - cacert=/etc/ssl/etcd/ssl/ca.pem 
 - cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem 
 - key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem 
 - endpoints=https://192.168.10.10:2379
if [ $? -eq 0 ]; then echo "Backup successful: $SNAPSHOT" find "$BACKUP_DIR" -name "etcd-snapshot-*.db" -mtime +7 -delete echo "Old backups cleaned up" else echo "Backup failed!" exit 1 fi

Save this as /usr/local/bin/etcd-backup.sh, make it executable with chmod +x, and set up a cron job or systemd timer to run it daily.

For a systemd timer approach, create two files. First, the service unit:

[Unit] Description=etcd backup service After=etcd.service
[Service] Type=oneshot ExecStart=/usr/local/bin/etcd-backup.sh User=root

Save this as /etc/systemd/system/etcd-backup.service. Then create the timer unit:

[Unit] Description=Daily etcd backup
[Timer] OnCalendar=daily Persistent=true
[Install] WantedBy=timers.target

Save this as /etc/systemd/system/etcd-backup.timer. Enable and start the timer:

systemctl daemon-reload systemctl enable etcd-backup.timer systemctl start etcd-backup.timer

Verify the timer is active:

systemctl list-timers | grep etcd

Store your backups off-node. A backup sitting on the same disk as the data it protects is not really a backup. Copy snapshots to remote storage, whether that is an NFS share, an S3 bucket, or another server entirely. The specific mechanism depends on your infrastructure, but the principle is universal: backups must be stored separately from the primary data.

Now let us talk about recovery, which you hopefully will never need but absolutely must understand. Restoring from an etcd snapshot is a destructive operation. You are replacing the current etcd state with the state from the snapshot. Any changes made after the snapshot was taken will be lost.

The recovery procedure varies depending on whether you have a single etcd node or a multi-node cluster. For a single node, the process is:

First, stop etcd:

systemctl stop etcd

Back up the current data directory, just in case:

mv /var/lib/etcd /var/lib/etcd.broken

Restore the snapshot to a new data directory:

ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db
— data-dir=/var/lib/etcd

Set the correct ownership on the restored data:

chown -R etcd:etcd /var/lib/etcd

Start etcd:

systemctl start etcd

Verify etcd is healthy:

ETCDCTL_API=3 etcdctl endpoint health
— cacert=/etc/ssl/etcd/ssl/ca.pem
— cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem
— key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem
— endpoints=https://192.168.10.10:2379

Then check that Kubernetes can communicate with etcd:

kubectl get nodes

If the nodes appear and the cluster responds, your recovery was successful.

For multi-node etcd clusters, the procedure is more involved because you need to restore each member and reconfigure them to form a new cluster with the same data. The etcdctl snapshot restore command has additional flags for this:

ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db
— data-dir=/var/lib/etcd
— name=k8s-ctr1
— initial-cluster=k8s-ctr1=https://192.168.10.11:2380,k8s-ctr2=https://192.168.10.12:2380,k8s-ctr3=https://192.168.10.13:2380
— initial-cluster-token=etcd-cluster-1
— initial-advertise-peer-urls=https://192.168.10.11:2380

You run this command on each etcd node, adjusting the name and initial-advertise-peer-urls for each member. The initial-cluster-token should be different from the original cluster token to prevent the restored cluster from accidentally joining the old cluster if any members are still running.

After restoring all members, start etcd on each node and verify the cluster formed correctly:

ETCDCTL_API=3 etcdctl member list
— cacert=/etc/ssl/etcd/ssl/ca.pem
— cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem
— key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem
— endpoints=https://192.168.10.11:2379

You should see all three members listed with their correct peer URLs.

Multi-node recovery is complex enough that I strongly recommend practicing it in a test environment before you ever need to do it in production. Set up a throwaway cluster, take a backup, intentionally break something, and practice the recovery. Do this until the procedure is second nature. When disaster strikes production at 3 AM, you do not want to be reading documentation for the first time.

One final note on backups: etcd snapshots capture the Kubernetes state, but they do not capture everything. Persistent volume data is not in etcd. Application data in databases running on your cluster is not in etcd. Those need their own backup strategies. etcd backup is necessary but not sufficient for complete disaster recovery. Plan accordingly.

Chapter 15: Air-Gap (Offline) Deployment

Not every Kubernetes cluster lives in an environment with internet access. Banks, government agencies, defense contractors, healthcare organizations, and many manufacturing facilities run their infrastructure in isolated networks. These air-gapped environments present a unique challenge: how do you deploy a Kubernetes cluster when you cannot reach Docker Hub, GitHub releases, or the official Kubernetes registry?

Kubespray handles this scenario, but it requires preparation. You need to stage all the binaries, container images, and packages inside your isolated network before running the deployment. This chapter walks through the entire process.

The fundamental problem with offline deployment

When Kubespray runs in a normal environment, it downloads dozens of components from the internet. The containerd binary comes from GitHub releases. The runc binary comes from the OpenContainers project. CNI plugins, etcd, kubelet, kubectl, kubeadm — all of these get pulled during the deployment process. On top of that, Kubernetes itself needs container images: the pause image, CoreDNS, kube-proxy, and whatever CNI plugin you chose.

In an air-gapped network, none of these downloads work. The deployment fails immediately when Ansible tries to fetch the first binary. You need to solve two separate problems: getting the binaries into the network, and getting the container images into the network.

What you need inside the air-gap

Before starting, you need infrastructure inside the isolated network:

A private container registry is mandatory. Harbor and Nexus are popular choices, but any OCI-compliant registry works. This registry will hold all the container images that Kubernetes needs. You need to pre-populate it with images before deployment.

An HTTP file server is necessary for binaries. Kubespray downloads binaries like containerd, runc, etcd, and the Kubernetes components from URLs. You need to host these files on an internal server and tell Kubespray where to find them.

An Ansible control node must exist inside the air-gapped network. You cannot run Ansible from outside and expect it to reach into the isolated environment. The machine running ansible-playbook needs network access to all the target nodes.

Figuring out what to download

Kubespray maintains a list of everything it needs in the variable files. The file roles/kubespray_defaults/defaults/main/download.yml contains over a thousand lines of download URLs and version mappings. You need to examine this file to understand what to stage.

For a specific version of Kubespray, you can generate the complete list of required files. The download role has a feature that helps with this:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  --tags download \
  -e download_run_once=true \
  -e download_localhost=true

This command runs only the download tasks and stores everything locally on the Ansible control node. In a connected environment, you would run this on a machine with internet access, then transfer the downloaded files into the air-gapped network.

The downloaded files end up in the directory specified by local_release_dir, which defaults to /tmp/releases. You will find binaries for containerd, runc, crictl, nerdctl, CNI plugins, etcd, and the Kubernetes components.

Staging the binaries

Once you have the binaries, you need to host them on an HTTP server inside your air-gapped network. The structure matters because Kubespray constructs download URLs based on patterns.

Create a directory structure on your internal web server:

/var/www/files/
├── kubernetes/
│   └── v1.32.0/
│       ├── kubeadm
│       ├── kubectl
│       └── kubelet
├── containerd/
│   └── v2.0.0/
│       └── containerd-2.0.0-linux-amd64.tar.gz
├── runc/
│   └── v1.2.0/
│       └── runc.amd64
├── cni-plugins/
│   └── v1.6.0/
│       └── cni-plugins-linux-amd64-v1.6.0.tgz
└── etcd/
    └── v3.5.15/
        └── etcd-v3.5.15-linux-amd64.tar.gz

Then configure Kubespray to use your internal server. In inventory/mycluster/group_vars/all/offline.yml, you override the download URLs:

files_repo: "http://files.internal.example.com"

kubeadm_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubeadm"
kubectl_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubectl"
kubelet_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubelet"

containerd_download_url: "{{ files_repo }}/containerd/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"

runc_download_url: "{{ files_repo }}/runc/v{{ runc_version }}/runc.{{ image_arch }}"

cni_download_url: "{{ files_repo }}/cni-plugins/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz"

etcd_download_url: "{{ files_repo }}/etcd/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"

Handling container images

Container images are the trickier part. Kubernetes pulls images constantly during normal operation, not just during initial deployment. The pause image gets pulled for every pod. CoreDNS images, CNI plugin images, metrics-server images — they all need to come from somewhere.

The solution is a private registry that mirrors the public registries. You need to populate this registry with every image Kubespray deploys, then configure containerd to use your registry instead of the public ones.

First, identify all required images. Kubespray provides a way to list them:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  --tags download \
  -e download_container=true \
  -e skip_downloads=false \
  --list-tasks 2>&1 | grep -i image

For a more complete list, check the download.yml defaults file. The images typically include:

registry.k8s.io/pause:3.10
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/kube-proxy:v1.32.0
registry.k8s.io/metrics-server/metrics-server:v0.7.2
quay.io/coreos/flannel:v0.26.1
docker.io/flannel/flannel-cni-plugin:v1.6.0

The exact list depends on your configuration choices. If you use Calico instead of Flannel, you need Calico images. If you enable the Nginx ingress controller, you need those images too.

Populating the private registry

On a machine with internet access, pull all the required images, retag them for your private registry, and push them. This is tedious but straightforward:

# Pull from public registry
docker pull registry.k8s.io/pause:3.10

# Tag for your private registry
docker tag registry.k8s.io/pause:3.10 registry.internal.example.com/pause:3.10

# Push to your private registry
docker push registry.internal.example.com/pause:3.10

For dozens of images, you want a script. Here is a simple approach:

#!/bin/bash

PRIVATE_REGISTRY="registry.internal.example.com"

IMAGES=(
  "registry.k8s.io/pause:3.10"
  "registry.k8s.io/coredns/coredns:v1.11.3"
  "registry.k8s.io/kube-proxy:v1.32.0"
  "quay.io/coreos/flannel:v0.26.1"
  "docker.io/flannel/flannel-cni-plugin:v1.6.0"
)

for IMAGE in "${IMAGES[@]}"; do
  # Extract image name without registry prefix
  NAME=$(echo $IMAGE | sed 's|.*/||')
  
  docker pull $IMAGE
  docker tag $IMAGE ${PRIVATE_REGISTRY}/${NAME}
  docker push ${PRIVATE_REGISTRY}/${NAME}
done

The output of this process is a tarball or a synchronized registry that you can transfer into the air-gapped network.

Configuring containerd to use your registry

Containerd needs to know where to find images. The modern approach uses the registry configuration directory at /etc/containerd/certs.d/. Kubespray can set this up automatically if you configure the right variables.

In inventory/mycluster/group_vars/all/containerd.yml, configure registry mirrors:

containerd_registries_mirrors:
  - prefix: registry.k8s.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
  - prefix: docker.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
  - prefix: quay.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false

This configuration tells containerd to redirect all image pulls to your internal registry. When Kubernetes tries to pull registry.k8s.io/pause:3.10, containerd intercepts that request and fetches from registry.internal.example.com/pause:3.10 instead.

If your internal registry uses a self-signed certificate, you have two options. You can set skip_verify: true, which is insecure but works. The better approach is to distribute your CA certificate to all nodes and reference it:

containerd_registries_mirrors:
  - prefix: registry.k8s.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        ca_file: /etc/containerd/certs.d/ca.crt

You need to ensure the CA certificate file exists on all nodes before containerd starts. Kubespray can copy this file using the containerd role if you configure it properly, or you can handle it in your pre-deployment automation.

The resulting directory structure on each node looks like this:

/etc/containerd/certs.d/
├── registry.k8s.io/
│   └── hosts.toml
├── docker.io/
│   └── hosts.toml
└── quay.io/
    └── hosts.toml

Each hosts.toml file contains the mirror configuration:

server = "https://registry.k8s.io"

[host."https://registry.internal.example.com"]
  capabilities = ["pull", "resolve"]
  ca = "/etc/containerd/certs.d/ca.crt"

Overriding the sandbox image

One image deserves special attention: the pause image. Kubernetes uses this image for every pod sandbox. The default location is registry.k8s.io/pause:3.10, but in an air-gapped environment, this pull fails.

You must override the sandbox image location in your Kubespray configuration. In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml:

kube_image_repo: "registry.internal.example.com"

Or more specifically for just the pause image:

pod_infra_image_repo: "registry.internal.example.com"
pod_infra_image_tag: "3.10"

Kubespray constructs the full pause image path from these variables. Getting this wrong causes every pod creation to fail, so verify it carefully.

Running the offline deployment

With binaries staged on your internal HTTP server and images populated in your private registry, you can run the deployment. The command looks almost identical to a normal deployment:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  -e download_run_once=false \
  -e download_localhost=false

The download_run_once=false flag tells Kubespray to download binaries on each node rather than downloading once and distributing. In an offline setup with a local file server, this is fine because the downloads hit your internal server.

If you pre-staged all binaries directly on the nodes (perhaps using a separate Ansible playbook or image), you can skip downloads entirely:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \
  -e skip_downloads=true

This assumes the binaries already exist in the expected locations on each node.

Handling package manager dependencies

Kubespray installs some components through the operating system package manager. On RHEL-based systems, this means yum or dnf. On Debian-based systems, apt. In an air-gapped environment, these package repositories are unreachable.

You have a few options. The cleanest is to set up an internal mirror of the required repositories. Tools like reposync for RHEL or apt-mirror for Debian can create local copies of public repositories. Then configure your nodes to point at the internal mirror.

On Rocky Linux, for example, you would modify /etc/yum.repos.d/ files to point at your internal server:

[baseos]
name=Rocky Linux $releasever - BaseOS
baseurl=http://repo.internal.example.com/rocky/$releasever/BaseOS/$basearch/os/
gpgcheck=1
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-Rocky-9

Alternatively, you can pre-install all required packages on a golden image and use that image for your Kubernetes nodes. Kubespray tries to install packages like conntrack, socat, and other utilities. If they are already present, the installation tasks succeed without reaching external repositories.

The third option is to tell Kubespray to skip package management entirely if you have prepared the nodes:

override_system_hostname: false

This is risky because you might miss required dependencies. The safer approach is proper repository mirroring.

Testing the offline deployment

Before deploying to production, test your offline setup thoroughly. A common mistake is missing one image or one binary. The deployment proceeds for thirty minutes, then fails when it cannot pull a specific component.

Run the deployment in check mode first:

ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --check

This does not catch all issues because some tasks behave differently in check mode, but it catches obvious configuration problems.

Then run a real deployment against a test environment. Watch the logs carefully. Any “connection refused” or “timeout” errors indicate something reaching out to the internet. Any “image pull failed” errors indicate missing images in your registry.

After deployment, verify that no external connections occurred. On each node, check the containerd logs:

journalctl -u containerd | grep -i "registry.k8s.io\|docker.io\|quay.io"

If you see successful pulls to external registries, your mirror configuration is wrong. The pulls should go to your internal registry.

Maintaining an air-gapped cluster

Initial deployment is only half the battle. You need to maintain the cluster over time, which means upgrades, new workloads, and security patches.

For upgrades, repeat the staging process. Download the new versions of binaries and images on a connected machine, transfer them into the air-gap, update your internal servers, then run the Kubespray upgrade playbook.

For new workloads, every container image your applications need must exist in your private registry. This requires a process: developers request images, someone pulls them on a connected machine, transfers them in, and pushes to the internal registry. Without this workflow, deployments fail.

Security patches for the underlying operating system require repository updates. Your internal mirror needs regular synchronization with upstream repositories, done through a controlled process that transfers packages into the air-gapped network.

None of this is automatic. Air-gapped environments trade convenience for isolation. The security benefit comes with an operational cost that you must plan for.

Common pitfalls

The most frequent mistake is forgetting one image. Kubespray deploys successfully, but later a DaemonSet fails to start because its image is missing from the private registry. Always verify your image list against the actual deployment configuration.

Certificate issues cause endless frustration. If your private registry uses TLS with a certificate signed by an internal CA, that CA must be trusted by containerd on every node. Missing CA distribution causes “x509: certificate signed by unknown authority” errors.

DNS resolution inside the air-gap requires attention. Your nodes need to resolve the internal registry hostname. If DNS is not available, use IP addresses or populate /etc/hosts.

Binary architecture mismatches occur when you download amd64 binaries but deploy to arm64 nodes, or vice versa. Check the architecture of your target nodes and download the correct binaries.

Version mismatches between staged files and Kubespray configuration cause subtle failures. If you stage containerd 2.0.0 but Kubespray expects 2.0.1, the checksum verification fails. Keep your staged files synchronized with the version variables in your inventory.

A working offline.yml example

Here is a more complete example of the offline configuration file that ties everything together:

# inventory/mycluster/group_vars/all/offline.yml

# Internal file server for binaries
files_repo: "http://files.internal.example.com"

# Kubernetes binaries
kubeadm_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubeadm"
kubectl_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubectl"
kubelet_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubelet"
# Container runtime binaries
containerd_download_url: "{{ files_repo }}/containerd/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
runc_download_url: "{{ files_repo }}/runc/v{{ runc_version }}/runc.{{ image_arch }}"
nerdctl_download_url: "{{ files_repo }}/nerdctl/v{{ nerdctl_version }}/nerdctl-{{ nerdctl_version }}-linux-{{ image_arch }}.tar.gz"
crictl_download_url: "{{ files_repo }}/crictl/v{{ crictl_version }}/crictl-v{{ crictl_version }}-linux-{{ image_arch }}.tar.gz"
# CNI plugins
cni_download_url: "{{ files_repo }}/cni-plugins/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz"
# etcd
etcd_download_url: "{{ files_repo }}/etcd/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"
# Helm (if enabled)
helm_download_url: "{{ files_repo }}/helm/v{{ helm_version }}/helm-v{{ helm_version }}-linux-{{ image_arch }}.tar.gz"
# Internal container registry
kube_image_repo: "registry.internal.example.com"
gcr_image_repo: "registry.internal.example.com"
docker_image_repo: "registry.internal.example.com"
quay_image_repo: "registry.internal.example.com"

And the corresponding containerd configuration:

# inventory/mycluster/group_vars/all/containerd.yml

containerd_registries_mirrors:
  - prefix: registry.k8s.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt
        
  - prefix: docker.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt
        
  - prefix: quay.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt
        
  - prefix: ghcr.io
    mirrors:
      - host: https://registry.internal.example.com
        capabilities: ["pull", "resolve"]
        skip_verify: false
        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt

This configuration assumes you have distributed your internal CA certificate to /etc/pki/ca-trust/source/anchors/ on all nodes and run update-ca-trust to install it.

The effort required for air-gapped deployment is substantial, but it works. Organizations running Kubernetes in isolated networks do this every day. The key is methodical preparation: identify every external dependency, stage it internally, and verify the configuration before deployment.

Chapter 16: Kubespray Execution Flow — The Complete Picture

When you run ansible-playbook cluster.yml, you’re not just executing a single script. You’re kicking off a carefully orchestrated sequence of 15 PLAYs containing roughly 559 TASKs that transform bare Linux machines into a production-ready Kubernetes cluster. Understanding this flow isn’t just academic curiosity — it’s essential knowledge for troubleshooting failures, optimizing deployment times, and knowing exactly what changed on your systems.

Let me walk you through the entire journey, from the moment you hit enter to when your cluster becomes ready.

The Starting Point

Everything begins with cluster.yml in the Kubespray root directory. This file doesn’t contain much actual logic — it’s primarily an orchestrator that imports other playbooks and assigns roles to host groups. If you open it up, you’ll see something like this:

- name: Check Ansible version
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Check minimal Ansible version
      assert:
        that:
          - ansible_version.full is version(minimal_ansible_version, '>=')
        msg: "Ansible version must be >= {{ minimal_ansible_version }}"

- name: Add kube-master nodes to kube_control_plane
  hosts: kube-master
  gather_facts: false
  tasks:
    - name: Add nodes to kube_control_plane group
      group_by:
        key: kube_control_plane
- name: Gather facts
  import_playbook: facts.yml
- name: Prepare for etcd install
  hosts: "{{ etcd_hosts | default('etcd') }}"
  gather_facts: false
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray_defaults }
    - { role: kubernetes/preinstall, tags: preinstall }
    - { role: container-engine, tags: container-engine, when: deploy_container_engine }
    - { role: download, tags: download, when: "not skip_downloads" }

Notice how every play starts with the kubespray_defaults role. This isn’t optional decoration — it loads hundreds of default variables that subsequent roles depend on. Without it, you’d get undefined variable errors everywhere.

Play 1: Ansible Version Verification

The very first thing Kubespray does is check whether your Ansible version is compatible. This happens on localhost, not on any remote nodes. The check is simple but critical:

- name: Check minimal Ansible version
  assert:
    that:
      - ansible_version.full is version('2.14.0', '>=')
      - ansible_version.full is version('2.18.0', '<')
    msg: "Ansible version must be >= 2.14.0 and < 2.18.0"

If you’re running Ansible 2.12 or Ansible 2.19, the playbook stops right here. I’ve seen people waste hours troubleshooting weird failures only to discover they had an incompatible Ansible version. This check saves you from that.

Play 2: Inventory Validation

Before touching any remote machine, Kubespray validates your inventory structure. It checks things like:

Do you have at least one node in kube_control_plane?
Is your etcd node count an odd number (1, 3, or 5)?
Are there any duplicate hostnames?

The etcd check is particularly important:

- name: Check that etcd hosts are odd
  assert:
    that:
      - groups['etcd'] | length is odd
    msg: "etcd hosts must be an odd number (1, 3, or 5)"
  when: etcd_cluster_setup | default(true)

This isn’t Kubespray being pedantic. etcd uses Raft consensus, which requires a majority (quorum) to operate. With 3 nodes, you can lose 1 and still function. With 4 nodes, you can still only lose 1 — so that extra node bought you nothing but more complexity. Stick to odd numbers.

Play 3: Bootstrap OS

Now we start touching remote machines. The bootstrap-os role prepares the operating system for Kubernetes installation. This is where all those “prerequisites” you’d normally configure manually get handled automatically.

First, it ensures Python is available (Ansible needs it):

- name: Install python3
  raw: dnf install -y python3 python3-dnf
  when: ansible_os_family == "RedHat"
  changed_when: false

Then it loads kernel modules required for container networking:

- name: Load kernel modules
  modprobe:
    name: "{{ item }}"
    state: present
  loop:
    - overlay
    - br_netfilter

The overlay module enables OverlayFS, which containerd uses for efficient container image layer management. The br_netfilter module allows iptables to see bridged traffic — without this, your pod networking won’t work properly.

Next come the sysctl parameters:

- name: Set sysctl parameters
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: yes
  loop:
    - { name: 'net.bridge.bridge-nf-call-iptables', value: '1' }
    - { name: 'net.bridge.bridge-nf-call-ip6tables', value: '1' }
    - { name: 'net.ipv4.ip_forward', value: '1' }

That ip_forward setting is crucial. Without it, your node can’t route packets between pods. I’ve debugged many “pods can’t reach each other” issues that came down to this single parameter being 0.

Swap gets disabled too:

- name: Disable swap
  command: swapoff -a
  when: ansible_swaptotal_mb > 0

- name: Remove swap from fstab
  lineinfile:
    path: /etc/fstab
    regexp: '^([^#].*\s+swap\s+.*)$'
    line: '# \1'
    backrefs: yes

Kubernetes historically required swap to be off because the scheduler’s resource calculations assumed no swap. While newer versions support swap (with feature gates), Kubespray still disables it by default for compatibility.

Play 4: Gather Facts

Ansible’s fact gathering collects detailed information about each node — IP addresses, CPU count, memory, disk space, OS version, network interfaces. Kubespray uses this data extensively:

- name: Gather facts
  hosts: all
  gather_facts: true
  tasks:
    - name: Gather minimal facts
      setup:
        gather_subset:
          - '!all'
          - '!min'
          - 'network'
          - 'hardware'
          - 'virtual'

The gathered facts get cached to /tmp (as configured in ansible.cfg):

gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400

This caching means subsequent plays don’t need to re-gather facts, significantly speeding up the deployment. The 86400-second timeout (24 hours) means you can re-run the playbook multiple times during initial setup without the overhead of fresh fact gathering each time.

Play 5: Container Engine Installation

This is where things get interesting. The container-engine role installs containerd, runc, and CNI plugins. Let me break down the sequence.

First, runc gets downloaded and installed:

- name: Download runc binary
  get_url:
    url: "https://github.com/opencontainers/runc/releases/download/v{{ runc_version }}/runc.{{ image_arch }}"
    dest: "{{ local_release_dir }}/runc-{{ runc_version }}"
    mode: '0755'
    checksum: "sha256:{{ runc_checksums[image_arch] }}"

- name: Copy runc to bin directory
  copy:
    src: "{{ local_release_dir }}/runc-{{ runc_version }}"
    dest: "{{ bin_dir }}/runc"
    mode: '0755'
    remote_src: yes

Notice the checksum verification. Kubespray maintains checksums for every binary it downloads in roles/kubespray_defaults/vars/main/checksums.yml. This prevents both download corruption and supply chain attacks.

CNI plugins come next:

- name: Download CNI plugins
  get_url:
    url: "https://github.com/containernetworking/plugins/releases/download/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz"
    dest: "{{ local_release_dir }}/cni-plugins-{{ cni_version }}.tgz"
    checksum: "sha256:{{ cni_binary_checksums[image_arch] }}"

- name: Extract CNI plugins
  unarchive:
    src: "{{ local_release_dir }}/cni-plugins-{{ cni_version }}.tgz"
    dest: /opt/cni/bin
    remote_src: yes

The CNI plugins land in /opt/cni/bin, which is the standard location that kubelet and containerd expect. You’ll find binaries like bridge, host-local, loopback, portmap there after installation.

Now containerd itself:

- name: Download containerd
  get_url:
    url: "https://github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
    dest: "{{ local_release_dir }}/containerd-{{ containerd_version }}.tar.gz"
    checksum: "sha256:{{ containerd_checksums[image_arch] }}"

- name: Extract containerd
  unarchive:
    src: "{{ local_release_dir }}/containerd-{{ containerd_version }}.tar.gz"
    dest: /usr/local
    remote_src: yes

The containerd tarball extracts to /usr/local, putting binaries in /usr/local/bin. This includes containerd itself, containerd-shim-runc-v2, and ctr (the containerd CLI).

The configuration file gets generated from a template:

- name: Generate containerd config
  template:
    src: config.toml.j2
    dest: /etc/containerd/config.toml
    mode: '0644'
  notify: restart containerd

That config.toml is critical. Here’s what a typical generated configuration looks like:

version = 2

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.9"
  
  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "runc"
    
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true
        
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"

That SystemdCgroup = true setting is vital. It tells containerd to use systemd’s cgroup driver instead of cgroupfs. This must match what kubelet uses, and since modern systemd-based distributions (basically everything now) work better with the systemd driver, this is the correct choice.

Finally, the systemd service gets installed and started:

- name: Install containerd systemd service
  template:
    src: containerd.service.j2
    dest: /etc/systemd/system/containerd.service
  notify:
    - reload systemd
    - restart containerd

- name: Enable and start containerd
  systemd:
    name: containerd
    enabled: yes
    state: started
    daemon_reload: yes

Play 6: Download Binaries and Images

The download role is one of the longer-running parts of deployment. It fetches all the binaries and container images needed for the cluster:

- name: Download kubeadm
  get_url:
    url: "https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubeadm"
    dest: "{{ local_release_dir }}/kubeadm-{{ kube_version }}"
    checksum: "sha256:{{ kubeadm_checksums[image_arch] }}"

- name: Download kubelet
  get_url:
    url: "https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubelet"
    dest: "{{ local_release_dir }}/kubelet-{{ kube_version }}"
    checksum: "sha256:{{ kubelet_checksums[image_arch] }}"
- name: Download kubectl
  get_url:
    url: "https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubectl"
    dest: "{{ local_release_dir }}/kubectl-{{ kube_version }}"
    checksum: "sha256:{{ kubectl_checksums[image_arch] }}"

Container images get pulled to each node:

- name: Pull required images
  command: "{{ bin_dir }}/crictl pull {{ item }}"
  loop:
    - "registry.k8s.io/pause:3.9"
    - "registry.k8s.io/kube-apiserver:v{{ kube_version }}"
    - "registry.k8s.io/kube-controller-manager:v{{ kube_version }}"
    - "registry.k8s.io/kube-scheduler:v{{ kube_version }}"
    - "registry.k8s.io/kube-proxy:v{{ kube_version }}"
    - "registry.k8s.io/coredns/coredns:v{{ coredns_version }}"
  when: inventory_hostname in groups['kube_control_plane']

Kubespray optimizes this by downloading once and distributing, controlled by the download_run_once variable. In large clusters, this prevents hammering external registries with duplicate requests from every node.

Play 7: etcd Installation

If you’re using the default etcd_deployment_type: host, etcd gets installed as a systemd service rather than a Kubernetes static pod. This gives you independent etcd management — you can restart etcd without affecting the kubelet, and etcd logs go to journald rather than container logs.

The etcd role starts by creating the etcd user and directories:

- name: Create etcd user
  user:
    name: etcd
    shell: /sbin/nologin
    system: yes
    create_home: no

- name: Create etcd directories
  file:
    path: "{{ item }}"
    state: directory
    owner: etcd
    group: etcd
    mode: '0700'
  loop:
    - /var/lib/etcd
    - /etc/ssl/etcd/ssl

Certificate generation is the complex part. etcd uses mutual TLS for all communications, meaning both client and server authenticate each other with certificates. Kubespray generates several certificate types:

# CA certificate - signs all other etcd certificates
/etc/ssl/etcd/ssl/ca.pem
/etc/ssl/etcd/ssl/ca-key.pem

# Member certificate - used by etcd servers for peer communication
/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem
/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem
# Admin certificate - used by etcdctl for management
/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem
/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem
# Node certificate - used by kube-apiserver to connect to etcd
/etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem
/etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem

The certificate generation uses openssl commands:

- name: Generate etcd member certificate
  command: >
    openssl req -new -key {{ etcd_cert_dir }}/member-{{ inventory_hostname }}-key.pem
    -out {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.csr
    -subj "/CN=etcd-member-{{ inventory_hostname }}"
    -config {{ etcd_cert_dir }}/openssl.conf

- name: Sign etcd member certificate
  command: >
    openssl x509 -req -in {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.csr
    -CA {{ etcd_cert_dir }}/ca.pem
    -CAkey {{ etcd_cert_dir }}/ca-key.pem
    -CAcreateserial
    -out {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.pem
    -days 3650
    -extensions v3_req
    -extfile {{ etcd_cert_dir }}/openssl.conf

The etcd environment file contains all the configuration:

# /etc/etcd.env
ETCD_NAME={{ inventory_hostname }}
ETCD_DATA_DIR=/var/lib/etcd

ETCD_LISTEN_PEER_URLS=https://{{ etcd_address }}:2380
ETCD_LISTEN_CLIENT_URLS=https://{{ etcd_address }}:2379,https://127.0.0.1:2379
ETCD_ADVERTISE_CLIENT_URLS=https://{{ etcd_address }}:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://{{ etcd_address }}:2380

ETCD_INITIAL_CLUSTER={{ etcd_initial_cluster }}
ETCD_INITIAL_CLUSTER_STATE={{ etcd_cluster_state }}
ETCD_INITIAL_CLUSTER_TOKEN=k8s-etcd-cluster

# TLS configuration
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CLIENT_CERT_AUTH=true

ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem
ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CLIENT_CERT_AUTH=true

The systemd service file:

# /etc/systemd/system/etcd.service
[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
User=etcd
Type=notify
EnvironmentFile=/etc/etcd.env
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

After starting etcd, Kubespray verifies the cluster is healthy:

- name: Wait for etcd cluster to be healthy
  command: >
    {{ bin_dir }}/etcdctl endpoint health
    --endpoints=https://{{ etcd_address }}:2379
    --cacert={{ etcd_cert_dir }}/ca.pem
    --cert={{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem
    --key={{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem
  register: etcd_health
  until: etcd_health.rc == 0
  retries: 10
  delay: 5

Play 8: Kubernetes Node Components

The kubernetes/node role installs components needed on ALL nodes — both control plane and workers. This includes kubelet, kubectl, and kubeadm.

Binary installation:

- name: Install kubelet
  copy:
    src: "{{ local_release_dir }}/kubelet-{{ kube_version }}"
    dest: "{{ bin_dir }}/kubelet"
    mode: '0755'
    remote_src: yes

- name: Install kubectl
  copy:
    src: "{{ local_release_dir }}/kubectl-{{ kube_version }}"
    dest: "{{ bin_dir }}/kubectl"
    mode: '0755'
    remote_src: yes

- name: Install kubeadm
  copy:
    src: "{{ local_release_dir }}/kubeadm-{{ kube_version }}"
    dest: "{{ bin_dir }}/kubeadm"
    mode: '0755'
    remote_src: yes

The kubelet configuration gets generated:

# /etc/kubernetes/kubelet-config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
clusterDNS:
  - {{ kube_dns_service_ip }}
clusterDomain: {{ dns_domain }}
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
maxPods: {{ kubelet_max_pods }}
podCIDR: {{ kube_pods_subnet }}
resolvConf: /etc/resolv.conf
rotateCertificates: true
serverTLSBootstrap: true

The kubelet systemd service:

# /etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart={{ bin_dir }}/kubelet \
  --config=/etc/kubernetes/kubelet-config.yaml \
  --kubeconfig=/etc/kubernetes/kubelet.conf \
  --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
  --hostname-override={{ kube_override_hostname | default(inventory_hostname) }}
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

At this point, kubelet is installed but not yet started on most nodes. It needs the cluster to be initialized first.

Play 9: Control Plane Initialization

This is where the cluster actually comes to life. The kubernetes/control-plane role runs kubeadm init on the first control plane node.

Kubespray generates a kubeadm configuration file:

# /etc/kubernetes/kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: {{ ip | default(ansible_default_ipv4.address) }}
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///var/run/containerd/containerd.sock
  name: {{ inventory_hostname }}
  kubeletExtraArgs:
    node-ip: {{ ip | default(ansible_default_ipv4.address) }}
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
  certSANs:
    - kubernetes
    - kubernetes.default
    - kubernetes.default.svc
    - kubernetes.default.svc.{{ dns_domain }}
    - localhost
    - 127.0.0.1
    - {{ ip | default(ansible_default_ipv4.address) }}
  extraArgs:
    etcd-cafile: /etc/ssl/etcd/ssl/ca.pem
    etcd-certfile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem
    etcd-keyfile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem
    etcd-servers: {{ etcd_endpoints }}
certificatesDir: /etc/kubernetes/pki
clusterName: {{ cluster_name }}
controlPlaneEndpoint: {{ kube_apiserver_endpoint }}
etcd:
  external:
    endpoints:
      {% for host in groups['etcd'] %}
      - https://{{ hostvars[host]['ip'] }}:2379
      {% endfor %}
    caFile: /etc/ssl/etcd/ssl/ca.pem
    certFile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem
    keyFile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem
imageRepository: {{ kube_image_repo }}
kubernetesVersion: v{{ kube_version }}
networking:
  dnsDomain: {{ dns_domain }}
  podSubnet: {{ kube_pods_subnet }}
  serviceSubnet: {{ kube_service_addresses }}

Notice the etcd.external section. Because Kubespray deployed etcd separately (with etcd_deployment_type: host), kubeadm doesn’t manage etcd — it just connects to the existing external cluster.

The actual initialization:

- name: Initialize first control plane node
  command: >
    {{ bin_dir }}/kubeadm init
    --config=/etc/kubernetes/kubeadm-config.yaml
    --upload-certs
    --skip-phases=addon/coredns
  register: kubeadm_init
  when: inventory_hostname == groups['kube_control_plane'][0]

The — skip-phases=addon/coredns flag is interesting. Kubespray installs CoreDNS separately with its own configuration rather than using kubeadm’s default CoreDNS deployment. This gives more control over DNS configuration.

After initialization, kubeadm creates static pod manifests in /etc/kubernetes/manifests/:

/etc/kubernetes/manifests/
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
└── kube-scheduler.yaml

The kubelet watches this directory and automatically starts these pods. Within seconds, your API server should be responding.

For additional control plane nodes (in HA setups), Kubespray runs kubeadm join with the — control-plane flag:

- name: Join additional control plane nodes
  command: >
    {{ bin_dir }}/kubeadm join {{ kubeadm_join_endpoint }}
    --token {{ kubeadm_token }}
    --discovery-token-ca-cert-hash sha256:{{ kubeadm_ca_hash }}
    --control-plane
    --certificate-key {{ kubeadm_certificate_key }}
  when: 
    - inventory_hostname in groups['kube_control_plane']
    - inventory_hostname != groups['kube_control_plane'][0]

Play 10: Worker Node Join

Worker nodes join the cluster using kubeadm join without the — control-plane flag:

- name: Join worker nodes to cluster
  command: >
    {{ bin_dir }}/kubeadm join {{ kubeadm_join_endpoint }}
    --token {{ kubeadm_token }}
    --discovery-token-ca-cert-hash sha256:{{ kubeadm_ca_hash }}
  when: inventory_hostname in groups['kube_node']

After joining, kubelet on each worker starts and registers the node with the API server. You can verify with:

kubectl get nodes
NAME       STATUS     ROLES           AGE   VERSION
k8s-ctr    NotReady   control-plane   2m    v1.32.0
k8s-w1     NotReady             1m    v1.32.0
k8s-w2     NotReady             1m    v1.32.0

The nodes show NotReady because there’s no CNI plugin yet. Pods can’t get IP addresses, so the node isn’t considered ready for workloads.

Play 11: CNI Plugin Installation

The network_plugin role installs your chosen CNI. Let’s look at Flannel as an example since that’s what the lab environment uses.

Kubespray generates a Flannel configuration:

- name: Create Flannel configmap
  template:
    src: flannel-config.yaml.j2
    dest: /etc/kubernetes/flannel/flannel-config.yaml

The template produces something like:

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "{{ kube_pods_subnet }}",
      "Backend": {
        "Type": "{{ flannel_backend_type }}"
      }
    }

The DaemonSet gets applied:

- name: Apply Flannel manifests
  command: "{{ bin_dir }}/kubectl apply -f /etc/kubernetes/flannel/"
  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

Once Flannel pods are running on each node, they configure the CNI and nodes become Ready:

kubectl get nodes
NAME       STATUS   ROLES           AGE   VERSION
k8s-ctr    Ready    control-plane   5m    v1.32.0
k8s-w1     Ready              4m    v1.32.0
k8s-w2     Ready              4m    v1.32.0

If you’re using Calico instead, the process is similar but with more components — Calico deploys a DaemonSet for the node agent plus a Deployment for the Typha component (in larger clusters) and potentially a BGP route reflector.

Play 12: Kubernetes Apps (Addons)

The kubernetes-apps role installs cluster addons. CoreDNS is the critical one:

- name: Apply CoreDNS manifests
  command: "{{ bin_dir }}/kubectl apply -f /etc/kubernetes/coredns/"
  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

The CoreDNS configuration includes the Corefile:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes {{ dns_domain }} in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
          max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

If metrics_server_enabled is true, Metrics Server gets deployed:

- name: Apply Metrics Server manifests
  command: "{{ bin_dir }}/kubectl apply -f /etc/kubernetes/metrics-server/"
  when: metrics_server_enabled

Similarly for other addons like Helm, Ingress controllers, or cert-manager.

Play 13: Node Labels and Taints

After everything is running, Kubespray applies any configured node labels and taints:

- name: Apply node labels
  command: >
    {{ bin_dir }}/kubectl label node {{ inventory_hostname }}
    {{ item.key }}={{ item.value }}
    --overwrite
  loop: "{{ node_labels | dict2items }}"
  when: node_labels is defined

- name: Apply node taints
  command: >
    {{ bin_dir }}/kubectl taint node {{ inventory_hostname }}
    {{ item }}
    --overwrite
  loop: "{{ node_taints }}"
  when: node_taints is defined

Control plane nodes get the standard taint by default:

node-role.kubernetes.io/control-plane:NoSchedule

This prevents regular workloads from running on control plane nodes. In a single-node cluster, Kubespray removes this taint so pods can actually schedule somewhere.

Play 14: DNS Configuration Finalization

The final play cleans up DNS configuration on all nodes:

- name: Update resolv.conf
  template:
    src: resolv.conf.j2
    dest: /etc/resolv.conf
  when: 
    - dns_mode != 'none'
    - resolvconf_mode == 'host_resolvconf'

This ensures nodes can resolve cluster DNS names if needed.

The Complete Timeline

From start to finish, here’s approximately what happens during a typical single-node deployment:

0:00  - Start playbook
0:01  - Ansible version check, inventory validation
0:02  - Bootstrap OS begins (kernel modules, sysctl)
0:03  - Fact gathering
0:04  - Container engine installation starts
0:06  - Runc, containerd installed and running
0:07  - Binary downloads begin (kubeadm, kubelet, kubectl)
0:10  - Image pulls begin
0:15  - etcd certificate generation
0:16  - etcd installation and startup
0:17  - etcd health check passes
0:18  - Kubernetes node components installed
0:19  - kubeadm init runs
0:20  - Control plane static pods starting
0:21  - API server responding
0:22  - CNI plugin deployment
0:23  - Nodes become Ready
0:24  - CoreDNS deployment
0:25  - Addon deployment
0:26  - Final DNS configuration
0:27  - Playbook complete

A single-node cluster typically takes 25–30 minutes. A three-node HA cluster might take 35–45 minutes. The bulk of the time is usually spent downloading binaries and container images.

Watching the Execution

When you run the playbook, you’ll see output like this:

PLAY [Check Ansible version] ***************************************************

TASK [Check minimal Ansible version] *******************************************
ok: [localhost]

PLAY [Add kube-master nodes to kube_control_plane] *****************************

TASK [Add nodes to kube_control_plane group] ***********************************
ok: [k8s-ctr]

PLAY [Gather facts] ************************************************************

TASK [Gathering Facts] *********************************************************
ok: [k8s-ctr]

PLAY [Prepare for etcd install] ************************************************

TASK [kubespray_defaults : Load defaults] **************************************
ok: [k8s-ctr]

TASK [kubernetes/preinstall : Load kernel modules] *****************************
changed: [k8s-ctr] => (item=overlay)
changed: [k8s-ctr] => (item=br_netfilter)

...

PLAY RECAP *********************************************************************
k8s-ctr    : ok=487  changed=112  unreachable=0  failed=0  skipped=847
localhost  : ok=4    changed=0    unreachable=0  failed=0  skipped=0

That PLAY RECAP tells you what happened. 487 tasks ran successfully, 112 made changes, 847 were skipped (because they didn’t apply to this configuration), and critically, zero failed.

The skipped count is high because Kubespray supports many configurations — Calico, Cilium, Docker, multiple cloud providers. Tasks for features you’re not using get skipped.

If something fails, you’ll see:

TASK [etcd : Wait for etcd cluster to be healthy] ******************************
FAILED - RETRYING: Wait for etcd cluster to be healthy (4 retries left).
FAILED - RETRYING: Wait for etcd cluster to be healthy (3 retries left).
FAILED - RETRYING: Wait for etcd cluster to be healthy (2 retries left).
FAILED - RETRYING: Wait for etcd cluster to be healthy (1 retries left).
fatal: [k8s-ctr]: FAILED! => {"attempts": 4, "changed": false, "cmd": ["etcdctl", "endpoint", "health"...

The task name tells you exactly where it failed. In this case, etcd didn’t become healthy. You’d then check etcd logs:

journalctl -u etcd -f

One of the best things about Kubespray is that you can fix the issue and re-run the playbook. Ansible’s idempotency means already-completed tasks won’t run again (they’ll show “ok” instead of “changed”), and it’ll pick up where it left off.

Understanding the flow — what happens in what order and why — makes troubleshooting dramatically easier. When you know that CNI comes after kubeadm init, you won’t waste time debugging pod networking on a cluster that hasn’t finished initializing yet.

From Cluster Operations to Service Lifecycle Management

The Kubernetes infrastructure ran on a conventional setup. An Ansible server sat at the center, establishing passwordless SSH authentication to every node in the clusters. A typical four-node cluster looked like this:

Ansible Server
      │
      ├── Passwordless Auth ──► k8s-node1 (Master + Node + etcd)
      ├── Passwordless Auth ──► k8s-node2 (Master + Node + etcd)
      ├── Passwordless Auth ──► k8s-node3 (Node)
      └── Passwordless Auth ──► k8s-node4 (Node)

The first two nodes pulled double duty as both masters and workers while also running etcd. The remaining nodes served purely as workers. Underneath all of this sat the Infrastructure Layer, with the Container Layer running on top of it.

Kubespray handled provisioning and configuration management across these machines.

The Problems with Kubespray

Managing cluster nodes through Kubespray introduced several pain points that compounded over time. Here is a snippet of what a typical Kubespray playbook looks like:

roles:
  - { role: kubespray-defaults }
  - { role: bastion-ssh-config, tags: ["localhost", "bastion"] }

roles:
  - { role: kubespray-defaults }
  - { role: bootstrap-os, tags: bootstrap-os }

roles:
  - { role: kubespray-defaults }
  - role: etcd

roles:
  - { role: kubespray-defaults, when: "not skip_downloads and download_run_once and not download_localhost" }
  - { role: kubernetes/preinstall, tags: preinstall, when: "not skip_downloads and download_run_once and not download_localhost" }
  - { role: download, tags: download, when: "not skip_downloads and download_run_once and not download_localhost" }

roles:
  - { role: kubespray-defaults }
  - { role: kubernetes/preinstall, tags: preinstall }
  - { role: container-engine, tags: "container-engine", when: deploy_container_engine }
  - { role: download, tags: download, when: "not skip_downloads" }
  - role: etcd

roles:
  - { role: kubespray-defaults }
  - { role: kubernetes/node, tags: node }

roles:
  - { role: kubespray-defaults }

roles:
  - { role: kubespray-defaults }
  - { role: kubernetes/kubeadm, tags: kubeadm }
  - { role: kubernetes/node-label, tags: node-label }
  - { role: network_plugin, tags: network }

roles:
  - { role: kubespray-defaults }
  - { role: kubernetes/preinstall, when: "dns_mode != 'none' and resolvconf_mode == 'host_resolvconf'", tags: resolvconf, dns_late: true }

Every role has conditions, tags, and dependencies. When something breaks, tracing through this maze takes time.

The Node Exclusion Workflow

Whenever a node needed to be removed from the cluster for maintenance, the process turned into a coordination dance between two teams:

┌─────────────────────────────────────────────────────────────────────┐
│                         PRE-WORK (Infra Team)                       │
├─────────────────────────────────────────────────────────────────────┤
│  Node Exclusion Request                                             │
│         │                                                           │
│         ▼                                                           │
│  Select Exclusion Group                                             │
│         │                                                           │
│         ▼                                                           │
│  Perform Upgrade                                                    │
│         │                                                           │
│         ▼                                                           │
│  Node Exclusion                                                     │
│         │                                                           │
│         ▼                                                           │
│  Hand Over Node Information ─────────────────────────────────────┐  │
└─────────────────────────────────────────────────────────────────────┘
                                                                   │
┌─────────────────────────────────────────────────────────────────────┐
│                       MAIN WORK (DevOps Team)                    │◄─┘
├─────────────────────────────────────────────────────────────────────┤
│  Analysis & Remediation                                             │
│         │                                                           │
│         ▼                                                           │
│  Node Join (takes ~1 hour)                                          │
│         │                                                           │
│         ▼                                                           │
│  System Check ──────► Normal ──────► Done                           │
│         │                                                           │
│         ▼                                                           │
│      Abnormal                                                       │
│         │                                                           │
│         └──────► Loop back to Analysis                              │
└─────────────────────────────────────────────────────────────────────┘

The Infra team handled the pre-work: accepting the exclusion request, selecting the right group, performing the upgrade, actually excluding the node, then handing off node information. The DevOps team then analyzed any issues, attempted to join the node back (which alone took about an hour), ran system checks, and if anything came back abnormal, looped back through analysis and remediation.

The Kubernetes Version Upgrade Problem

Kubernetes enforces a strict policy: upgrades can only happen one minor version at a time. Running v1.17 and needing to reach v1.20 means stepping through each intermediate version:

v1.17 ──► v1.18 ──► v1.19 ──► v1.20
(current)                    (target)

Attempting to skip versions fails:

v1.17 ────────────X────────────► v1.20
       (not allowed)

Sequential upgrades are mandatory:

v1.17 ──✓──► v1.18 ──✓──► v1.19 ──✓──► v1.20

Time Costs with Kubespray

Using Kubespray, each operation had predictable but painful time costs:

# Operation Time Required
- Node Join 30 minutes per node group
- OS/Kernel Upgrade 30 minutes per node group
K- ubernetes Upgrade 2 hours per version

Simple operations took at least 30 minutes. Complex multi-version upgrades could consume 6 to 12 hours. The question became: how can this be reduced?

Evaluating Amazon EKS

Managed Kubernetes offerings like Amazon EKS have clear appeal:

┌─────────────────────────────────────────────────────────────────┐
│                         AWS Cloud                               │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    EKS VPC                              │    │
│  │  ┌──────────────────────────────────────────────────┐   │    │
│  │  │         EKS Managed Control Plane                │   │    │
│  │  │                                                  │   │    │
│  │  │    ┌─────┐    ┌─────┐    ┌─────┐                │   │    │
│  │  │    │etcd │    │ api │    │     │                │   │    │
│  │  │    └─────┘    └─────┘    └─────┘                │   │    │
│  │  │      AZ-1       AZ-2       AZ-3                 │   │    │
│  │  └──────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                              │                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Customer VPC                          │    │
│  │  ┌────────────────────┐  ┌────────────────────┐         │    │
│  │  │ Managed Node Group │  │ Managed Node Group │         │    │
│  │  │    EKS Add-ons     │  │    EKS Add-ons     │         │    │
│  │  └────────────────────┘  └────────────────────┘         │    │
│  │           EKS Managed Data Plane                        │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

AWS manages the control plane (etcd, API server) in their own VPC. Workloads run in managed node groups within the customer VPC. EKS add-ons handle integration, and multi-AZ distribution comes without managing the complexity directly.

The Requirements

Before choosing a solution, three requirements guided the decision:

Automated node joining without DevOps engineer involvement — Nodes should join clusters without someone running playbooks manually every time.
Simple architecture that is easy to debug — When things go wrong, tracing problems should be quick without wading through layers of abstraction.
GitOps-capable configuration — All cluster configuration should live in Git, enabling version control, code review, and automated deployment.

Building a Custom Bootstrap System

Rather than adopting a fully managed solution, a custom bootstrap system addressed these specific requirements:

┌────────────────────────────────────────────────────────────────────┐
│                         Kubernetes Cluster                         │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐                           │
│   │  node   │  │  node   │  │  node   │   (existing workers)      │
│   └─────────┘  └─────────┘  └─────────┘                           │
│                                                                    │
│   ┌────────────────────────────────────────────────────────────┐  │
│   │                 public namespace                           │  │
│   │   ┌─────┐  ┌─────────┐  ┌───────┐                          │  │
│   │   │ api │  │ cronjob │  │ sched │                          │  │
│   │   └─────┘  └─────────┘  └───────┘                          │  │
│   └────────────────────────────────────────────────────────────┘  │
│                                                                    │
│   ┌────────────────────────────────────────────────────────────┐  │
│   │            master (control plane)                          │  │
│   │                                                            │  │
│   │   create-k8s-material                                      │  │
│   │          │                                                 │  │
│   │          ▼                                                 │  │
│   │   kubernetes-bootstrap.datacenter1.example.com             │  │
│   │          │                                                 │  │
│   │          ▼                                                 │  │
│   │   kubernetes-material (ConfigMap)                          │  │
│   │                                                            │  │
│   └────────────────────────────────────────────────────────────┘  │
│                                                                    │
│   ┌────────────────────────────────────────────────────────────┐  │
│   │         kubernetes-bootstrap namespace                     │  │
│   │                                                            │  │
│   │   /kubernetes    /store                                    │  │
│   │                                                            │  │
│   └────────────────────────────────────────────────────────────┘  │
│                                                                    │
│   ┌─────────────────────────────────────────────────────────────┐ │
│   │        New Node Group               │  ◄── pulls bootstrap  │ │
│   │   ┌──────┐  ┌──────┐  ┌──────┐      │      materials and    │ │
│   │   │ node │  │ node │  │ node │      │      joins auto       │ │
│   │   └──────┘  └──────┘  └──────┘      │                       │ │
│   └─────────────────────────────────────────────────────────────┘ │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

A CronJob periodically generates the Kubernetes materials needed for nodes to join: certificates, tokens, and configuration files. These materials get stored in a ConfigMap within the kubernetes-bootstrap namespace. The bootstrap endpoint is accessible at a predictable URL like kubernetes-bootstrap.datacenter1.example.com.

When a new node group needs to join, it pulls the necessary bootstrap materials from this endpoint and completes the join process automatically. No DevOps engineer needs to intervene.

Node Naming Convention

A naming convention encodes cluster membership directly in the hostname:

kube-{ClusterA}-{DataCenter}-worker01  =  {DataCenter}.{ClusterA}.example.com

For example, a node named kube-serviceA-dc1-worker01 automatically maps to dc1.serviceA.example.com. The bootstrap process determines which cluster a node should join based purely on its hostname. No manual configuration required.

File System Structure

The bootstrap materials are organized by Kubernetes version:

- v1.21.6
  - etc
  - NetworkManager/conf.d
  - calico/certs
  - cni/net.d
  - containerd
  - kubernetes
  - nginx
  - systemd/system

- v1.25.6
  - etc
  - NetworkManager/conf.d
  - cni/net.d
  - containerd
  - kubernetes
  - nginx
  - systemd/system

Each version directory contains the complete set of configuration files needed for that Kubernetes release. Note that v1.21.6 includes calico/certs while v1.25.6 does not — different versions may have different requirements. When a node bootstraps, it pulls the appropriate version’s files based on its configured target.

Cluster Configuration

A central configuration file defines cluster-wide settings and per-node overrides:

common:
  cluster: 1.25.6
  runcVersion: 1.x.x
  criVersion: 1.x.x
  cniPluginVersion: 1.x.x
  cudaVersion: "12.2.1_535.86.10_linux"
  clusterDNS: "8.8.8.8"

master:
  k8s-master-dc2:
    address: x.x.x.x

worker:
  kube-serviceA-dc1-gpu-worker01:
    gpu: true
  previous-version-node:
    version:
      kubernetes: 1.21.6

The common section sets defaults for the entire cluster: target Kubernetes version, versions for runc, CRI, CNI plugins, and CUDA drivers (for GPU nodes), plus the cluster DNS server.

The master section lists master nodes with their addresses.

The worker section allows per-node overrides. A GPU worker can be flagged with gpu: true to ensure CUDA drivers get installed. A node that needs to stay on an older Kubernetes version can specify that explicitly—see previous-version-node which overrides the cluster default of 1.25.6 with 1.21.6.

Automated Notifications

The system sends Slack notifications whenever a node join succeeds or fails. A successful join looks like:

Node A                                                    10:10 AM
─────────────────────────────────────────────────────────────────────
[Cluster A] Successfully joined the cluster as a service node.
kubelet: v1.25.6

When something goes wrong, the failure notification includes the actual error output:

Node B                                                    10:10 AM
─────────────────────────────────────────────────────────────────────
[Cluster A] Failed to join the cluster.

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR CRI]: container runtime is not running: output: 
  time="2024-04-02T00:14:27+09:00" level=fatal msg="unable to determine 
  runtime API version: rpc error: code = Unavailable desc = connection 
  error: desc = \"transport: Error while dialing dial unix 
  /var/run/containerd/containerd.sock: connect: no such file or directory\""
, error: exit status 1

[preflight] If you know what you are doing, you can make a check 
non-fatal with --ignore-preflight-errors=...

To see the stack trace of this error execute with --v=5 or higher

Engineers can diagnose the problem directly from the notification without SSH-ing into the machine. In this example, containerd is not running — the socket file does not exist.

Results from the New Bootstrap System

Operation Before After Improvement Node Join 30 min/node group 0 min/node group 100% OS/Kernel Upgrade 30 min/node group 2 min/node group 94% Kubernetes Upgrade 2 hr/version 30 min/version 75%

Node joining went from 30 minutes to zero because it happens automatically without human intervention. OS and kernel upgrades dropped from 30 minutes to about 2 minutes — nodes upgrade, reboot, and rejoin on their own. Kubernetes version upgrades went from 2 hours to 30 minutes per version.

Every time a new service came online, a predictable set of tasks followed:

Development and production clusters each needed their own deployment pipelines
Different communication interfaces (HTTP, gRPC, Kafka) required different deployment configurations
Services written in different languages needed different build settings

The combinatorial explosion meant a lot of pipeline creation work for each new service.

A pipeline-as-code approach puts deployment configuration in each service repository. A service at git@git.example.com:example/ includes a pipeline definition file:

name: serviceA
repository:
  name: serviceA
  branch: develop
pipeline:
  type: service
  platform: linux/arm64
build:
  environments:
    JAVA_HOME: /opt/jdk-21
  batch:
    enabled: false
  docker:
    dockerfile: Dockerfile
resources:
  alpha:
    instance: 1
    cpu: '0.05'
    memory: 1024MB

This configuration captures everything needed to build and deploy the service:

repository: which repo and branch to track
pipeline: service type and target platform architecture
build.environments: build-time environment variables like JAVA_HOME
build.batch.enabled: whether batch processing is enabled
build.docker.dockerfile: which Dockerfile to use
resources: resource allocations for different environments (alpha shown here)

The UI reads these files directly and generates appropriate pipelines without manual configuration. Developers define their service requirements; the system handles the rest.

The deployment story had grown complicated. Service developers would request tag updates for batch jobs. DevOps engineers would manually update tags. A service might have multiple versions deployed simultaneously:

Service A
    │
    ├── Version 1
    ├── Version 2
    └── Version 3

Each version required its own tag management, and coordination between developers and DevOps created delays.

Two main deployment strategies were supported:

Blue-Green Deployment: Zero-downtime releases with instant rollback capability. The new version (green) runs alongside the old version (blue). Once validated, traffic switches entirely to green. If something goes wrong, switching back to blue is immediate.

Canary Deployment: Gradual rollouts that test new versions with a subset of traffic before full deployment. Maybe 5% of traffic goes to the new version initially. If metrics look good, the percentage increases until the new version handles all traffic.

One problem that emerged was lag between canary and blue-green environments. A canary deployment might run successfully, but the corresponding blue-green deployment would lag behind by days or even weeks. This created inconsistency in what code was actually serving production traffic.

An automated system detects and alerts on these situations:

vivasystem                                                10:10 AM
─────────────────────────────────────────────────────────────────────
Lag detected between canary and blue-green deployments.

Service A - 12 day lag  @kim-A
Deployment owners: @kim-B, @kim-C, @kim-D
[Blue-green page]  [Trigger blue-green]

Service B - 12 day lag  @kim-B
Deployment owners: @kim-E, @kim-F, @kim-G
[Blue-green page]  [Trigger blue-green]

Service C - 12 day lag  @kim-C
Deployment owners: @kim-H, @kim-I
[Blue-green page]  [Trigger blue-green]

Service D - 12 day lag  @kim-D
Deployment owners: @kim-J, @kim-K, @kim-L
[Blue-green page]  [Trigger blue-green]

Service E - 12 day lag  @kim-E
Deployment owners: @kim-M, @kim-N, @kim-O
[Blue-green page]  [Trigger blue-green]

Service F - 12 day lag  @kim-F
Deployment owners: @kim-P
[Blue-green page]  [Trigger blue-green]

Service G - 12 day lag  @kim-G
Deployment owners: @kim-Q, @kim-R, @kim-S
[Blue-green page]  [Trigger blue-green]

Service H - 12 day lag  @kim-H
Deployment owners: @kim-T, @kim-U
[Blue-green page]  [Trigger blue-green]

Each entry shows the service name, how many days it has lagged, who originally deployed the canary, and the deployment owners responsible for the service. Direct links let teams jump to the blue-green page or trigger the deployment immediately.

Over time, services accumulate. Some get deprecated, others get replaced, and some just stop receiving traffic. Without active cleanup, these zombie services continue consuming cluster resources.

A systematic approach identifies services marked for deletion versus active services:

Services communicate in different ways, so multiple detection methods are needed:

┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│   ┌──────────────────┐                                            │
│   │  Consumer Group  │──┐                                         │
│   └──────────────────┘  │                                         │
│                         │                                         │
│   ┌──────────────────┐  │     ┌─────────┐     ┌─────────────┐    │
│   │  Consumer Group  │──┼────►│  batch  │────►│ API Server  │    │
│   │  (not deleted)   │  │     └─────────┘     └─────────────┘    │
│   └──────────────────┘  │                            │            │
│                         │                            ▼            │
│   ┌──────────────────┐  │                     ┌───────────┐       │
│   │  Consumer Group  │──┘                     │   Kafka   │       │
│   └──────────────────┘                        └───────────┘       │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Consumer groups pull messages from Kafka, trigger batch processing, and make API calls. Any of these communication paths could indicate an active service.

For HTTP-based services, the Istio service mesh handles service-to-service communication:

┌────────────────────────────────────────────────────────────────────┐
│                        Istio Mesh                                  │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  Ingress        ┌─────────────────────────────────┐    Egress     │
│  traffic ──────►│         Data Plane              │───► traffic   │
│                 │                                 │                │
│                 │  ┌───────────┐  ┌───────────┐  │                │
│                 │  │ Service A │  │ Service B │  │                │
│                 │  │  ┌─────┐  │  │  ┌─────┐  │  │                │
│                 │  │  │Proxy│  │  │  │Proxy│  │  │                │
│                 │  │  └─────┘  │  │  └─────┘  │  │                │
│                 │  └───────────┘  └───────────┘  │                │
│                 │         │              │       │                │
│                 │         └──────┬───────┘       │                │
│                 │                │               │                │
│                 │         Mesh traffic           │                │
│                 └─────────────────────────────────┘                │
│                                  │                                 │
│  ┌───────────────────────────────────────────────────────────────┐│
│  │                      Control Plane                            ││
│  │  ┌─────────────────────────────────────────────────────────┐  ││
│  │  │                       istiod                            │  ││
│  │  │                                                         │  ││
│  │  │   Discovery    Configuration    Certificates           │  ││
│  │  └─────────────────────────────────────────────────────────┘  ││
│  └───────────────────────────────────────────────────────────────┘│
└────────────────────────────────────────────────────────────────────┘

Each service has an Envoy sidecar proxy that handles mesh traffic. The control plane (istiod) manages discovery, configuration, and certificates. Istio collects metrics on all traffic flowing through the mesh.

Aggregating the istio_request_total metric over a 7-day window:

istio_request_total aggregated for 7 days
    │
    ├── Metric present ──► Service is receiving traffic
    │
    └── No metric ──► Service has no HTTP communication history

Any service showing zero requests during that period gets flagged as potentially unused.

Services communicating via Kafka rather than HTTP need different detection. Spring Actuator’s Kafka metrics work here:

kafka_consumer_fetch_manager_records_consumed_rate

This metric shows the rate at which a consumer processes messages. A sustained rate of zero indicates an idle consumer that is not actually doing any work.

Batch jobs run on schedules rather than responding to traffic, so traffic-based metrics do not help. Instead, checking the kube_cronjob_spec_suspend metric:

kube_cronjob_spec_suspend
    │
    ├── Value = 1 ──► CronJob is suspended (not running)
    │
    └── Value = 0 ──► CronJob is active

Services whose batch jobs have been suspended for extended periods are candidates for cleanup.

The system generates weekly notifications identifying potentially unused services. For Kafka-based services:

vivasystem                                                10:10 AM
─────────────────────────────────────────────────────────────────────
Unused Service Verification Request

Based on KAFKA requests, the following services have had no traffic 
for one week. Please verify whether these services are still in use.

[@kim-A, @kim-B, @kim-C, @kim-D] Service A
[@kim-E, @kim-F, @kim-G, @kim-H] Service B
[@kim-I, @kim-J, @kim-K, @kim-L] Service C
[@kim-M, @kim-N, @kim-O] Service D
[@kim-P, @kim-Q] Service E
[@kim-R, @kim-S, @kim-T] Service F
[@kim-U, @kim-V, @kim-W, @kim-X] Service G
[@kim-Y, @kim-Z] Service H

For HTTP-based services:

vivasystem                                                10:10 AM
─────────────────────────────────────────────────────────────────────
Unused HTTP Service Verification Request

Based on HTTP requests, the following services have had no traffic 
for one week. Please verify whether these services are still in use.

[@kim-A, @kim-B, @kim-C, @kim-D] Service I
[@kim-E, @kim-F, @kim-G, @kim-H] Service J
[@kim-I, @kim-J, @kim-K, @kim-L] Service K
[@kim-M, @kim-N, @kim-O] Service L
[@kim-P, @kim-Q] Service M
[@kim-R, @kim-S, @kim-T] Service N
[ ] Service O
[ ] Service P

Each notification tags the service owners so they can confirm whether the service should be decommissioned. Services without owner information (empty brackets) require additional investigation.

Eighty services removed. Over a thousand CPU cores and a terabyte of memory returned to the cluster. Fewer services also means fewer potential alert sources, simpler dependency graphs, and easier capacity planning.

Summary

The improvements span two levels.

At the cluster level, replacing Kubespray with a custom bootstrap system eliminated hours of manual node management work. Node join time dropped by 100%, OS upgrades by 94%, and Kubernetes upgrades by 75%.

At the service level, pipeline-as-code reduced the friction of onboarding new services. Automated deployment lag detection kept environments consistent. Systematic unused service detection recovered resources that would otherwise sit idle — 80 services removed, 1,280 cores and 1,280 GB of memory reclaimed.

The bootstrap system runs as a CronJob. The notification system queries existing metrics stores. The pipeline definitions are YAML files in Git repositories. Identifying specific pain points, measuring them, and building targeted solutions made the difference.

Building a Kubernetes Cluster with kubeadm: From Theory to Practice

Sigrid Jin — Sat, 24 Jan 2026 19:58:57 GMT

Chapter 1: Introduction

If you’ve ever tried setting up a Kubernetes cluster manually, following something like Kelsey Hightower’s “Kubernetes The Hard Way”, you know the pain. Generating certificates with OpenSSL, writing systemd unit files, configuring etcd, setting up each control plane component one by one… It’s educational, sure, but it’s also tedious and error-prone. Nobody wants to do that every time they need a new cluster.

That’s where kubeadm comes in.

kubeadm is the official cluster bootstrapping tool provided by the Kubernetes project. It’s maintained by the Cluster Lifecycle SIG (Special Interest Group), which means it’s not some third-party tool — it’s part of the Kubernetes ecosystem itself.

The official documentation describes it like this:

“Kubeadm is a tool built to provide kubeadm init and kubeadm join as best-practice ‘fast paths’ for creating Kubernetes clusters.”

In plain English: kubeadm lets you spin up a working Kubernetes cluster with just two commands. kubeadm init on your control plane node, kubeadm join on your worker nodes, and you're done. Well, mostly done—but we'll get to that.

Here’s the thing about kubeadm that trips people up at first: it’s intentionally limited in scope.

The official docs put it this way:

“kubeadm performs the actions necessary to get a minimum viable cluster up and running.”

Notice the phrase “minimum viable cluster.” kubeadm isn’t trying to be an all-in-one solution. It’s not Ansible. It’s not Terraform. It doesn’t provision your VMs, install Docker, or set up your network plugin. It does exactly one thing: bootstrap a Kubernetes cluster. Nothing more, nothing less.

This design choice is deliberate. By keeping kubeadm focused on bootstrapping, the Kubernetes team made it flexible enough to work anywhere — on-premises data centers, cloud VMs, bare metal servers, your laptop, whatever. The idea is that higher-level tools (like kubespray, Cluster API, or even minikube) can build on top of kubeadm to provide more complete solutions.

So what exactly happens when you run kubeadm init? Quite a lot, actually:

Certificate Generation

kubeadm creates all the TLS certificates your cluster needs. This includes:

The cluster CA (Certificate Authority)
API server certificates
etcd certificates (server, peer, and client certs)
Front proxy certificates for API aggregation
Service account signing keys

If you’ve done this manually before, you know this alone saves hours of work.

kubeconfig Files

It generates kubeconfig files for various components:

admin.conf for cluster administrators
kubelet.conf for the kubelet
controller-manager.conf for the controller manager
scheduler.conf for the scheduler

Each of these contains the credentials and cluster information needed for that component to talk to the API server.

Static Pod Manifests

Here’s something interesting: kubeadm doesn’t run the control plane components as regular system services. Instead, it creates Static Pod manifests in /etc/kubernetes/manifests/. The kubelet watches this directory and automatically runs whatever Pod definitions it finds there.

This means etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all run as Pods managed by the kubelet — even though the kubelet itself runs as a systemd service. It’s a clever approach that makes the control plane self-healing to some degree.

Bootstrap Token Setup

kubeadm sets up the infrastructure for worker nodes to join the cluster securely. This includes creating bootstrap tokens and configuring the RBAC rules needed for the TLS bootstrap process. We’ll dive deeper into this when we cover kubeadm join.

Core Addons

Finally, kubeadm installs two essential addons:

CoreDNS: The cluster DNS service
kube-proxy: The network proxy that runs on every node

These are deployed as regular Kubernetes workloads (Deployment and DaemonSet respectively), not as Static Pods.

This is equally important to understand. kubeadm explicitly does NOT handle:

Machine Provisioning

kubeadm won’t create VMs or provision bare metal servers for you. You need to have your machines ready before running kubeadm. Whether you’re using Vagrant, Terraform, cloud console, or physically racking servers — that’s on you.

Container Runtime Installation

Before kubeadm can do anything, you need a CRI-compatible container runtime already installed and running. containerd, CRI-O, whatever you prefer — kubeadm expects it to be there. It won’t install it for you.

kubelet Installation

Same story with kubelet. You need to install the kubelet binary and set up its systemd service yourself. kubeadm will configure the kubelet, but it won’t install it.

CNI Plugin Installation

This one catches a lot of people off guard. After kubeadm init completes, your node will show as "NotReady" and CoreDNS pods will be stuck in "Pending" state. Why? Because kubeadm doesn't install a CNI (Container Network Interface) plugin.

You need to install Flannel, Calico, Cilium, or another CNI plugin yourself. Without it, pods can’t get IP addresses and the cluster network doesn’t work.

Additional Addons

Things like the Kubernetes Dashboard, monitoring solutions (Prometheus, Grafana), ingress controllers, storage provisioners — none of these are kubeadm’s concern. Install them yourself after the cluster is up.

The documentation explicitly states:

“Instead, we expect higher-level and more tailored tooling to be built on top of kubeadm.”

And that’s exactly what happened. Tools like minikube, kind, kubespray, and Cluster API all use kubeadm under the hood while adding their own automation layers on top.

If you’ve gone through Kubernetes The Hard Way, it’s helpful to understand how kubeadm maps to that manual process:

Manual Step (Hard Way) kubeadm Equivalent Generate CA and TLS certificates with OpenSSL kubeadm init phase certs Create kubeconfig files manually kubeadm init phase kubeconfig Set up etcd as a systemd service kubeadm init phase etcd (Static Pod) Configure and run control plane as systemd services kubeadm init phase control-plane (Static Pods) Configure kubelet kubeadm init phase kubelet-start Distribute certificates to worker nodes manually TLS Bootstrap with bootstrap tokens

There are some notable differences beyond just automation:

Static Pods vs systemd Services

The Hard Way runs control plane components directly as systemd services. kubeadm uses Static Pods instead. This means the kubelet manages these components, which has implications for how you troubleshoot and manage them.

etcd Communication

The Hard Way (for educational simplicity) often configures etcd with plain HTTP. kubeadm always uses HTTPS with mutual TLS. Every etcd connection requires proper certificates — even health checks.

Worker Node Authentication

The Hard Way has you manually copying certificates to worker nodes. kubeadm uses a much more elegant approach called TLS Bootstrap, where worker nodes use a temporary token to authenticate and then request their own certificates through a CSR (Certificate Signing Request) process.

Control Plane kubelet

Here’s a subtle but important difference: in The Hard Way, the control plane node doesn’t necessarily need kubelet since you’re running everything as systemd services. With kubeadm, kubelet is required on the control plane because it’s responsible for running the control plane components as Static Pods.

kubeadm’s command structure is pretty straightforward. The commands fall into three categories:

Lifecycle Commands

These manage the cluster lifecycle:

kubeadm init — Initialize a control plane node
kubeadm join — Join a node to the cluster (worker or additional control plane)
kubeadm upgrade — Upgrade the cluster to a new Kubernetes version
kubeadm reset — Tear down what init or join created

Security Commands

These handle certificates and authentication:

kubeadm certs — Manage cluster certificates (renew, check expiration, etc.)
kubeadm kubeconfig — Manage kubeconfig files
kubeadm token — Manage bootstrap tokens

Utility Commands

These are helpers:

kubeadm config — View or migrate cluster configuration
kubeadm version — Print kubeadm version

The relationship between these commands looks something like this:

init ──→ join ──→ upgrade
                      │
                      ↓
                    reset

You start with init, expand with join, maintain with upgrade, and clean up with reset if needed.

Understanding kubeadm’s scope and design philosophy saves you from frustration later. When your node shows “NotReady” after init, you won’t panic — you’ll know it’s because you haven’t installed a CNI plugin yet. When you need to set up monitoring, you won’t waste time looking for a kubeadm flag that doesn’t exist.

kubeadm is a foundation, not a complete house. It gives you a solid base to build on, but the finishing work is up to you, or the higher-level tools you choose to use.

2. Environment & Prerequisites

Before we can run kubeadm init, there’s quite a bit of groundwork to lay. Kubernetes has specific requirements about the underlying system, and skipping any of these steps will cause problems later. Everything in this section needs to be done on all nodes, both the control plane and workers.

For this setup, I’m using three Rocky Linux 10 virtual machines provisioned with Vagrant. Rocky Linux is a RHEL derivative, so if you’re on CentOS, AlmaLinux, or RHEL itself, the commands will be nearly identical.

Here’s what the node setup looks like:

k8s-ctr    192.168.10.100    Control Plane
k8s-w1     192.168.10.101    Worker Node
k8s-w2     192.168.10.102    Worker Node

The target versions are containerd 2.1.5 and Kubernetes 1.32.11. I chose containerd 2.x specifically because it supports Kubernetes versions from 1.32 through 1.35, which gives us room for future upgrades without changing the container runtime.

First, let’s verify that our system meets the minimum requirements. kubeadm needs at least 2 CPUs and 2GB of RAM per node.

lscpu
# CPU(s): 4
# Architecture: aarch64

free -h
# Mem: 2.8Gi

Four CPUs and 2.8GB of memory. That’s more than enough. Note that the architecture here is aarch64 because I’m running on Apple Silicon. If you’re on x86_64, the commands are the same but container images will be pulled for a different architecture automatically.

For disk space, check that you have enough room for container images and logs:

df -hT /
# /dev/sda3   xfs   60G   2.5G   58G   5%   /

58GB free on the root partition is plenty.

This is where things get a bit tricky, especially in Vagrant environments. Most VMs have multiple network interfaces, and you need to know which one to use for cluster communication.

ip -br -c -4 addr
# lo        UNKNOWN   127.0.0.1/8
# enp0s8    UP        10.0.2.15/24
# enp0s9    UP        192.168.10.100/24

In my setup, enp0s8 is the NAT interface that Vagrant uses for internet access, and enp0s9 is the host-only network for communication between VMs. The cluster should use enp0s9 (192.168.10.x addresses), not the NAT interface. This becomes important later when we configure kubeadm init and join.

Kubernetes uses Linux cgroups to manage container resources like CPU and memory limits. There are two versions of cgroups, and Kubernetes 1.25+ prefers cgroup v2.

stat -fc %T /sys/fs/cgroup
# cgroup2fs

If you see cgroup2fs, you're on cgroup v2. If it says tmpfs, you're on v1. Rocky Linux 10 uses v2 by default, so we're good.

The next question is which cgroup driver to use. There are two options: cgroupfs and systemd. Since Rocky Linux uses systemd as its init system, we should use the systemd cgroup driver. This avoids conflicts where both systemd and kubelet try to manage cgroups independently. We’ll configure this later when setting up containerd.

All nodes in a Kubernetes cluster must have synchronized time. Certificate validation, log timestamps, and various distributed system operations depend on it.

timedatectl status
# Time zone: UTC
# NTP service: active
# RTC in local TZ: yes

If you see a warning about RTC being in local timezone, fix it by setting RTC to UTC:

timedatectl set-local-rtc 0

Set your timezone (I’m using Seoul):

timedatectl set-timezone Asia/Seoul

Make sure NTP is enabled:

timedatectl set-ntp true

You can verify that time synchronization is working by checking chrony:

chronyc sources -v
# ^* 175.210.18.47   2   9   377   152   +490us[+478us] +/-   11ms

The ^* marker indicates the currently selected NTP source. A Stratum 2 server with sub-millisecond offset is perfectly fine for our purposes.

SELinux provides mandatory access control on Linux systems. While it’s great for security, it can interfere with Kubernetes operations because containers need to access host filesystems for things like pod networking and volumes.

The recommended approach is to set SELinux to permissive mode. This logs policy violations without blocking them, so you can still see what would have been denied.

getenforce
# Enforcing

setenforce 0
getenforce
# Permissive

That changes the runtime setting, but it won’t survive a reboot. To make it permanent:

sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config

For a lab environment, the simplest approach is to disable the firewall entirely:

systemctl disable --now firewalld

If you’re setting this up in production and need the firewall enabled, you’ll need to open these ports:

On the control plane: 6443 (API server), 2379–2380 (etcd), 10250 (kubelet), 10259 (scheduler), 10257 (controller manager).

On worker nodes: 10250 (kubelet), 30000–32767 (NodePort services).

This is a hard requirement. Kubernetes will refuse to start if swap is enabled.

The reasoning is straightforward. The scheduler makes decisions based on how much memory each node has available. If swap is enabled, a node might appear to have free memory when it’s actually swapping, which leads to poor scheduling decisions. There’s also a performance argument: when a container starts swapping, it becomes extremely slow, and you’d rather have the OOM killer terminate it so Kubernetes can reschedule it elsewhere.

Check current swap status:

free -h | grep -i swap
# Swap:   3.8Gi   0B   3.8Gi

Disable it:

swapoff -a

To make this permanent across reboots, remove the swap entry from /etc/fstab:

sed -i '/swap/d' /etc/fstab

Verify it’s gone:

free -h | grep -i swap
# Swap:   0B   0B   0B

Kubernetes networking requires two kernel modules: overlay for container filesystem layering, and br_netfilter for bridge network traffic to pass through iptables.

Load them:

modprobe overlay
modprobe br_netfilter

Verify they’re loaded:

lsmod | grep -E 'overlay|br_netfilter'
# br_netfilter   32768   0
# bridge         327680  1 br_netfilter
# overlay        200704  0

To ensure they load automatically on boot, create a configuration file:

cat <overlay
br_netfilter
EOF

With the modules loaded, we need to configure some kernel parameters for Kubernetes networking to work properly.

cat <net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

The bridge-nf-call settings make sure that traffic crossing a bridge (which is how container networking works) goes through iptables rules. This is essential for Service networking and network policies to function. The ip_forward setting allows the node to act as a router, forwarding packets between pods on different nodes.

Apply the settings:

sysctl --system

You should see your k8s.conf file being applied in the output. Verify the values took effect:

sysctl net.bridge.bridge-nf-call-iptables
# net.bridge.bridge-nf-call-iptables = 1

sysctl net.ipv4.ip_forward
# net.ipv4.ip_forward = 1

Each node needs to be able to resolve the hostnames of other nodes. The simplest way is to add entries to /etc/hosts.

First, there’s a quirk with Vagrant that we need to handle. Vagrant often adds an entry like 127.0.1.1 hostname to /etc/hosts, which causes the hostname to resolve to localhost instead of the actual IP. This will break kubelet registration.

cat /etc/hosts
# 127.0.0.1   localhost localhost.localdomain
# 127.0.1.1   k8s-ctr k8s-ctr    <- problematic line

Remove it:

sed -i '/^127\.0\.\(1\|2\)\.1/d' /etc/hosts

Now add entries for all cluster nodes:

cat <> /etc/hosts
192.168.10.100 k8s-ctr
192.168.10.101 k8s-w1
192.168.10.102 k8s-w2
EOF

Test connectivity:

ping -c 1 k8s-ctr
ping -c 1 k8s-w1
ping -c 1 k8s-w2

All three should respond from their respective 192.168.10.x addresses.

At this point, the system is ready for installing the container runtime and Kubernetes components. All the kernel-level prerequisites are in place: cgroups are configured, necessary modules are loaded, networking parameters are set, and the nodes can communicate with each other. In the next section, we’ll install containerd and the kubeadm toolchain.

3. Component Installation

With the system prerequisites configured, we can now install the core components needed to run Kubernetes. This section covers two main parts: the container runtime (containerd) and the Kubernetes tools (kubeadm, kubelet, kubectl).

All commands in this section should be run on every node in your cluster, both control plane and workers.

Kubernetes needs a container runtime that implements the Container Runtime Interface (CRI). We’re using containerd, which is a CNCF graduated project and the industry standard container runtime. It focuses on simplicity, robustness, and portability.

Before we dive into installation, let’s talk about version compatibility. This is something that can bite you later if you don’t pay attention. Since we’re installing Kubernetes 1.32 and planning to upgrade to 1.33 and 1.34 later, we need a containerd version that supports all of them.

Looking at the compatibility matrix, containerd 2.1.5 works with Kubernetes 1.32 through 1.35. That gives us plenty of room for future upgrades.

One thing to watch out for: containerd 2.x uses a different configuration file format (version 3) compared to containerd 1.x (version 2). If you’re coming from an older setup or following outdated tutorials, this can cause confusion. We’re starting fresh with 2.x, so we don’t have to worry about migration.

Adding the Docker Repository

containerd packages are distributed through Docker’s repository. Let’s add it first.

dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

Verify the repository was added:

dnf repolist

You should see docker-ce-stable in the list. Now refresh the metadata cache:

dnf makecache

Let’s check what versions are available:

dnf list --showduplicates containerd.io

You’ll see both 1.7.x and 2.x versions listed. We want 2.1.5.

Installing containerd

dnf install -y containerd.io-2.1.5-1.el10

After installation, verify the components:

containerd --version
runc --version
ctr --version

You should see containerd v2.1.5 and runc v1.3.3 (runc is the low-level container runtime that containerd uses internally).

Now here’s something important. The default configuration file that ships with the package has a problem for our use case:

cat /etc/containerd/config.toml | grep disabled_plugins

You’ll see disabled_plugins = ["cri"]. This means the CRI plugin is disabled by default, which makes containerd useless for Kubernetes. We need to generate a proper configuration.

Configuring containerd

Generate the default configuration:

containerd config default | tee /etc/containerd/config.toml

This creates a complete configuration file with all the defaults. Now check the version:

head /etc/containerd/config.toml

You should see version = 3 at the top, confirming we're using the containerd 2.x configuration format.

The critical setting we need to change is the cgroup driver. By default, SystemdCgroup is set to false, but we need it to be true. Why? Because our system uses systemd as the init system, and having both systemd and containerd trying to manage cgroups independently leads to conflicts. When SystemdCgroup is true, containerd delegates cgroup management to systemd, keeping everything consistent.

cat /etc/containerd/config.toml | grep -i systemdcgroup

You’ll see SystemdCgroup = false. Let's fix that:

sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml

Verify the change:

cat /etc/containerd/config.toml | grep -i systemdcgroup

Now it should show SystemdCgroup = true.

Starting containerd

systemctl daemon-reload
systemctl enable --now containerd
systemctl status containerd --no-pager

The service should be active and running. Let’s check the logs to make sure our SystemdCgroup setting is being applied:

journalctl -u containerd.service --no-pager | grep -i systemdcgroup

You should see something like "SystemdCgroup":true in the output.

You might also notice an error about CNI in the logs:

failed to load cni during init... error="cni config load failed: no network config found in /etc/cni/net.d"

Don’t worry about this. It’s expected because we haven’t installed a CNI plugin yet. That comes after kubeadm init.

Verifying the Installation

Let’s make sure everything is working. Check the socket:

ls -l /run/containerd/containerd.sock

This Unix socket is how kubelet, ctr, and crictl communicate with containerd.

Check if the socket is listening:

ss -xl | grep containerd

You should see two entries: one for the main socket and one for the ttrpc socket.

Now let’s verify the plugins are loaded correctly:

ctr plugins ls

Look for these key plugins with “ok” status:

io.containerd.cri.v1 (images and runtime) - this is the CRI plugin that kubelet uses
io.containerd.snapshotter.v1 overlayfs - manages container filesystem layers
io.containerd.runtime.v2 task - handles actual container execution via runc

Finally, let’s check the cgroup hierarchy to confirm containerd is running under systemd’s management:

systemd-cgls --no-pager | grep -A2 containerd.service

You should see containerd running under system.slice/containerd.service, which confirms systemd is managing its cgroup.

Now we install the three essential Kubernetes binaries:

kubeadm: the cluster bootstrapping tool we’ll use to initialize the control plane and join worker nodes
kubelet: the node agent that runs on every node and manages pods
kubectl: the command-line tool for interacting with the cluster

Adding the Kubernetes Repository

Kubernetes packages are distributed through their own repository. We need to add it and configure version pinning to prevent accidental upgrades.

cat <[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

Notice the exclude line at the end. This prevents dnf from automatically upgrading these packages when you run dnf update. Kubernetes version upgrades should be deliberate and controlled, not accidental.

Refresh the cache:

dnf makecache

Checking Available Versions

Because of the exclude setting, a normal dnf list won't show the packages:

dnf list kubelet

This returns nothing. To see what’s available, temporarily bypass the exclusion:

dnf list --showduplicates kubelet --disableexcludes=kubernetes

You’ll see all the 1.32.x versions available. The same works for kubeadm and kubectl.

Installing the Packages

dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes

This also installs cri-tools (which provides crictl) and kubernetes-cni (the basic CNI binaries) as dependencies.

Enable the kubelet service:

systemctl enable --now kubelet

Now, here’s something that confuses a lot of people. If you check the kubelet status right now:

systemctl status kubelet --no-pager

You’ll see it’s in a crash loop, constantly restarting. Check the logs:

journalctl -u kubelet --no-pager | tail -20

You’ll see errors like:

failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: open /var/lib/kubelet/config.yaml: no such file or directory

This is completely normal. The kubelet needs configuration files that don’t exist yet. These files are created by kubeadm init on the control plane or kubeadm join on worker nodes. Until then, kubelet will keep trying and failing. Don't try to fix this; it resolves itself after we run kubeadm.

Verifying the Installation

Check the versions:

kubeadm version -o yaml
kubectl version --client=true
kubelet --version

All three should show v1.32.11 (or whatever the latest patch version is in the 1.32 series).

Configuring crictl

crictl is a CLI tool for CRI-compatible container runtimes. It’s useful for debugging and inspecting containers at a lower level than kubectl.

If you run crictl right now:

crictl version

You’ll see warnings about missing configuration. Let’s fix that:

cat << EOF > /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
EOF

Now crictl knows where to find containerd. Run it again:

crictl version

No more warnings. You can also check the runtime info:

crictl info | head -30

Look for these important values:

runtimeReady: true - containerd is working
networkReady: false - CNI not installed yet (expected)

CNI Binaries

The kubernetes-cni package installed basic CNI plugins in /opt/cni/bin/:

ls /opt/cni/bin/

You’ll see plugins like bridge, loopback, portmap, and others. These are the building blocks that CNI plugins like Flannel or Calico use under the hood.

The configuration directory is empty for now:

ls /etc/cni/net.d/

This directory will be populated when we install a CNI plugin after kubeadm init.

Understanding the kubelet Service Configuration

Let’s look at how the kubelet service is set up:

cat /usr/lib/systemd/system/kubelet.service

This is a basic service definition. The interesting part is the drop-in file:

cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf

This file shows how kubeadm integrates with kubelet. It references several files that kubeadm will create:

/etc/kubernetes/bootstrap-kubelet.conf - used during TLS bootstrap
/etc/kubernetes/kubelet.conf - the main kubeconfig for kubelet
/var/lib/kubelet/config.yaml - kubelet configuration
/var/lib/kubelet/kubeadm-flags.env - additional flags from kubeadm

None of these exist yet:

tree /etc/kubernetes
tree /var/lib/kubelet

Both directories are essentially empty. This explains why kubelet is crash-looping. After we run kubeadm init or join, these directories will be populated with all the necessary configuration.

At this point, we have everything installed and ready. The containerd runtime is running and waiting for containers. The kubelet is installed and trying to start (but failing because it has no configuration yet). In the next section, we’ll run kubeadm init to bring up the control plane, which will generate all the missing configuration and finally let kubelet do its job.

4. Control Plane Initialization (kubeadm init)

Now we get to the main event. With all the prerequisites in place, it’s time to actually initialize the control plane. This is where kubeadm does its magic, automating all those tedious steps we would otherwise have to do manually.

You have two ways to configure kubeadm init: command line flags or a YAML configuration file. The command line approach is quick and dirty, good for testing:

kubeadm init \
  --apiserver-advertise-address=192.168.10.100 \
  --pod-network-cidr=10.244.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --kubernetes-version=1.32.11

But I prefer the configuration file approach. It’s easier to version control, reproduce, and understand what you’re actually deploying. Here’s what I use:

cat << EOF > kubeadm-init.yaml
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
bootstrapTokens:
- token: "123456.1234567890123456"
  ttl: "0s"
  usages:
  - signing
  - authentication
nodeRegistration:
  kubeletExtraArgs:
    - name: node-ip
      value: "192.168.10.100"
  criSocket: "unix:///run/containerd/containerd.sock"
localAPIEndpoint:
  advertiseAddress: "192.168.10.100"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: "1.32.11"
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/16"
EOF

Let me explain the important bits here.

The advertiseAddress is the IP address the API Server will advertise to other components. In my Vagrant environment, there are multiple network interfaces. If I don't specify this, kubeadm might pick the wrong one (like the NAT interface at 10.0.2.15), and then nothing works because other nodes can't reach the API Server.

The node-ip in kubeletExtraArgs serves a similar purpose but for the kubelet. Again, this is critical in multi-NIC environments. Without it, the kubelet might register itself with the wrong IP address, and you'll spend hours debugging why nodes can't communicate.

The podSubnet is set to 10.244.0.0/16 because I'm planning to use Flannel as my CNI plugin, and that's Flannel's default. If you're using Calico, you'd typically use 192.168.0.0/16 instead. Make sure this doesn't overlap with your actual node network.

The bootstrap token is something I hardcoded for this lab environment. Normally kubeadm generates a random token, but having a fixed token makes it easier to script the worker node join later. In production, you’d let kubeadm generate a random one and set a proper TTL (like 24 hours) instead of “0s” which means it never expires.

Before we run the init, let me highlight a few options that often confuse people.

The --control-plane-endpoint option isn't in my config because this is a single control plane setup. But if you ever plan to expand to a high availability setup with multiple control planes, you must set this from the beginning. It should point to your load balancer address. You can't easily add this later because it gets baked into certificates and kubeconfig files. So plan ahead.

The --apiserver-cert-extra-sans option lets you add additional names or IPs to the API Server certificate's Subject Alternative Name field. You need this if you plan to access the API Server through a load balancer, custom domain name, or external IP that isn't automatically detected. For example:

kubeadm init --apiserver-cert-extra-sans=kubernetes.example.com,10.0.0.100

This step is optional but I recommend it, especially if your network is slow or you want faster initialization:

kubeadm config images list

This shows you what images kubeadm needs:

registry.k8s.io/kube-apiserver:v1.32.11
registry.k8s.io/kube-controller-manager:v1.32.11
registry.k8s.io/kube-scheduler:v1.32.11
registry.k8s.io/kube-proxy:v1.32.11
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/pause:3.10
registry.k8s.io/etcd:3.5.24-0

Then pull them:

kubeadm config images pull

The benefit is that when you actually run init, it doesn’t have to wait for image downloads. This is especially useful during upgrades where you want to minimize control plane downtime.

You can do a dry run first to see what would happen without actually doing anything:

kubeadm init --config="kubeadm-init.yaml" --dry-run

This creates temporary files in /etc/kubernetes/tmp/kubeadm-init-dryrun* so you can inspect the certificates and manifests that would be generated. Useful for debugging or understanding what's going on.

When you’re ready for real:

kubeadm init --config="kubeadm-init.yaml"

Now let’s walk through what happens. The output tells the whole story.

kubeadm init runs through about 14 phases. Understanding these helps tremendously when troubleshooting.

Preflight checks come first:

[init] Using Kubernetes version: v1.32.11
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster

This validates that your system meets requirements. It checks ports, container runtime, kernel parameters, and so on. If you didn’t pre-pull images, it downloads them here.

Certificate generation is next:

[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [k8s-ctr kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.10.100]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] Generating "etcd/peer" certificate and key
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key

This is all the stuff you’d do manually with OpenSSL in Kubernetes The Hard Way. kubeadm creates three separate CA hierarchies: one for the main cluster components, one for etcd, and one for the front proxy (used by API aggregation). The apiserver certificate output is worth noting because it shows all the SANs that clients can use to connect.

Kubeconfig files get created:

[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "super-admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file

Each component needs its own kubeconfig to authenticate with the API Server. The admin.conf is what you’ll copy to ~/.kube/config for kubectl access.

Static Pod manifests for etcd and control plane components:

[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"

This is a key design choice in kubeadm. Instead of running etcd and control plane components as systemd services (like in Kubernetes The Hard Way), kubeadm deploys them as Static Pods. The kubelet watches the /etc/kubernetes/manifests directory and automatically runs any pod manifests it finds there. This means the kubelet manages the control plane components, which is elegant but also means you need a running kubelet on control plane nodes.

Kubelet starts and waits for control plane:

[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz
[kubelet-check] The kubelet is healthy after 1.002214971s
[api-check] Waiting for a healthy API server. This can take up to 4m0s
[api-check] The API server is healthy after 3.003031359s

Remember how the kubelet was crashlooping before init? That’s because /var/lib/kubelet/config.yaml didn’t exist. Now kubeadm creates it, and the kubelet can finally start properly. It then waits for the API Server (running as a Static Pod) to become healthy.

Configuration upload to the cluster:

[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system

Once the API Server is up, kubeadm stores its configuration as ConfigMaps. This is important because when you later run kubeadm join on worker nodes, they fetch these ConfigMaps to get consistent settings.

Control plane marking:

[mark-control-plane] Marking the node k8s-ctr as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node k8s-ctr as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]

The label makes the node show up as “control-plane” in kubectl get nodes. The taint prevents regular workload pods from being scheduled on control plane nodes. You want your control plane dedicated to running cluster infrastructure, not your applications.

Bootstrap token setup:

[bootstrap-token] Using token: 123456.1234567890123456
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace

This creates the infrastructure for worker nodes to join the cluster. The cluster-info ConfigMap in kube-public is special because it’s readable without authentication. This allows new nodes (which don’t have certificates yet) to discover the cluster’s CA certificate and API Server address.

Addon installation:

[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

kubeadm installs two essential addons. CoreDNS provides cluster DNS (so pods can resolve service names), and kube-proxy handles service networking (the iptables or IPVS rules that make Services work).

Finally, the completion message:

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.10.100:6443 --token 123456.1234567890123456 \
        --discovery-token-ca-cert-hash sha256:bd763182471f1ed47780644230f234a89061a29041a922a74c849a48342c797d

Save that join command. You’ll need it for the worker nodes.

Post-init Setup

First thing: set up kubectl access. Since I’m running as root:

mkdir -p /root/.kube
cp -i /etc/kubernetes/admin.conf /root/.kube/config
chown $(id -u):$(id -g) /root/.kube/config

Verify it works:

kubectl cluster-info

Kubernetes control plane is running at https://192.168.10.100:6443
CoreDNS is running at https://192.168.10.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

Initial State Verification

Let’s see what we’ve got:

kubectl get nodes -o wide

NAME      STATUS     ROLES           AGE   VERSION    INTERNAL-IP      OS-IMAGE                        CONTAINER-RUNTIME
k8s-ctr   NotReady   control-plane   2m    v1.32.11   192.168.10.100   Rocky Linux 10.0 (Red Quartz)   containerd://2.1.5

The node shows NotReady. This is expected because we haven’t installed a CNI plugin yet. Without CNI, the node can’t set up pod networking, so Kubernetes considers it not ready.

Check the pods:

kubectl get pod -n kube-system -o wide

NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE
coredns-668d6bf9bc-n8jxf          0/1     Pending   0          2m               
coredns-668d6bf9bc-z6h69          0/1     Pending   0          2m               
etcd-k8s-ctr                      1/1     Running   0          2m    192.168.10.100   k8s-ctr
kube-apiserver-k8s-ctr            1/1     Running   0          2m    192.168.10.100   k8s-ctr
kube-controller-manager-k8s-ctr   1/1     Running   0          2m    192.168.10.100   k8s-ctr
kube-proxy-5p6jx                  1/1     Running   0          2m    192.168.10.100   k8s-ctr
kube-scheduler-k8s-ctr            1/1     Running   0          2m    192.168.10.100   k8s-ctr

The Static Pods (etcd, apiserver, controller-manager, scheduler) are running. kube-proxy is running too since it uses hostNetwork. But CoreDNS pods are Pending because they need pod network IPs, which requires CNI.

You can also verify with crictl directly:

crictl ps

CONTAINER      IMAGE          CREATED         STATE     NAME                      POD
dc8f81e24dff7  dcdb790dc2bfe  3 minutes ago   Running   kube-proxy                kube-proxy-5p6jx
28856e606823f  58951ea1a0b5d  3 minutes ago   Running   kube-apiserver            kube-apiserver-k8s-ctr
e7593756117ad  1211402d28f58  3 minutes ago   Running   etcd                      etcd-k8s-ctr
61a09c44673c6  cfa17ff3d6634  3 minutes ago   Running   kube-scheduler            kube-scheduler-k8s-ctr
65d20308c4200  82766e5f2d560  3 minutes ago   Running   kube-controller-manager   kube-controller-manager-k8s-ctr

All the control plane containers are running. The control plane is up, but we need CNI before we can call it fully functional.

What Got Created

After init, here’s what the /etc/kubernetes directory looks like:

/etc/kubernetes/
├── admin.conf
├── controller-manager.conf
├── kubelet.conf
├── scheduler.conf
├── super-admin.conf
├── manifests/
│   ├── etcd.yaml
│   ├── kube-apiserver.yaml
│   ├── kube-controller-manager.yaml
│   └── kube-scheduler.yaml
└── pki/
    ├── ca.crt
    ├── ca.key
    ├── apiserver.crt
    ├── apiserver.key
    ├── apiserver-kubelet-client.crt
    ├── apiserver-kubelet-client.key
    ├── front-proxy-ca.crt
    ├── front-proxy-ca.key
    ├── front-proxy-client.crt
    ├── front-proxy-client.key
    ├── sa.key
    ├── sa.pub
    └── etcd/
        ├── ca.crt
        ├── ca.key
        ├── server.crt
        ├── server.key
        ├── peer.crt
        ├── peer.key
        ├── healthcheck-client.crt
        └── healthcheck-client.key

The kubeconfig files are at the top level. The manifests directory contains the Static Pod definitions that kubelet is running. The pki directory has all the certificates and keys.

Compare this to what you’d create manually in Kubernetes The Hard Way. kubeadm did all of that in about 10 seconds.

5. CNI Plugin Installation

At this point, we have a control plane that is technically running, but if you check the node status, you will see something that might concern you at first.

kubectl get nodes
NAME      STATUS     ROLES           AGE   VERSION
k8s-ctr   NotReady   control-plane   2m    v1.32.11

The node shows NotReady. And if you look at the pods in kube-system namespace, you will notice CoreDNS pods are stuck in Pending state.

kubectl get pod -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
coredns-668d6bf9bc-n8jxf          0/1     Pending   0          2m
coredns-668d6bf9bc-z6h69          0/1     Pending   0          2m
etcd-k8s-ctr                      1/1     Running   0          2m
kube-apiserver-k8s-ctr            1/1     Running   0          2m
kube-controller-manager-k8s-ctr   1/1     Running   0          2m
kube-proxy-5p6jx                  1/1     Running   0          2m
kube-scheduler-k8s-ctr            1/1     Running   0          2m

This is completely expected and not a bug. The reason is simple: we have not installed a CNI plugin yet.

Why CNI is Required

Kubernetes itself does not implement pod networking. Instead, it delegates this responsibility to CNI (Container Network Interface) plugins. The CNI plugin is responsible for assigning IP addresses to pods and setting up the network so that pods can communicate with each other across nodes.

Without a CNI plugin, the kubelet cannot configure networking for pods. When kubelet reports the node status to the API server, it includes a condition called NetworkUnavailable. If no CNI plugin is present, this condition is set to True, which causes the node to be marked as NotReady.

You can verify this by describing the node.

kubectl describe node k8s-ctr | grep -A5 Conditions
Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  NetworkUnavailable   True    CalicoIsNotReady             Calico is not ready: ...
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory
  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID
  Ready                False   KubeletNotReady              container runtime network not ready

The message is clear: container runtime network not ready.

Now, why is CoreDNS stuck in Pending? CoreDNS is deployed as a regular Deployment, which means it needs to be scheduled on a node and run as a normal pod. Unlike the control plane components (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) that run as Static Pods with hostNetwork: true, CoreDNS needs a pod IP address from the pod network. Since there is no CNI to allocate that IP, CoreDNS cannot be scheduled.

The control plane components and kube-proxy can run because they use the host network directly. They bind to the node’s IP address rather than requiring a pod IP.

Choosing a CNI Plugin

There are many CNI plugins available: Calico, Flannel, Cilium, Weave, and others. Each has different features and trade-offs. For this setup, we will use Flannel because it is simple and lightweight. It creates a VXLAN overlay network that allows pods on different nodes to communicate with each other.

Remember when we ran kubeadm init with pod-network-cidr set to 10.244.0.0/16? This value is the default for Flannel. If you are using Calico, the default is 192.168.0.0/16. The pod network CIDR must match what the CNI plugin expects, or you need to modify the CNI configuration accordingly.

Installing Flannel

Flannel installation is straightforward. We just need to apply the manifest from the official repository.

kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

This will create several resources: a namespace called kube-flannel, a ServiceAccount, ConfigMap with the Flannel configuration, and most importantly, a DaemonSet that runs the Flannel agent on every node.

namespace/kube-flannel created
serviceaccount/flannel created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created

After a few seconds, you can check the Flannel pods.

kubectl get pod -n kube-flannel
NAME                    READY   STATUS    RESTARTS   AGE
kube-flannel-ds-xxxxx   1/1     Running   0          30s

Now let’s check if the node status has changed.

kubectl get nodes
NAME      STATUS   ROLES           AGE   VERSION
k8s-ctr   Ready    control-plane   5m    v1.32.11

The node is now Ready. Let’s also verify that CoreDNS pods are running.

kubectl get pod -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
coredns-668d6bf9bc-n8jxf          1/1     Running   0          5m
coredns-668d6bf9bc-z6h69          1/1     Running   0          5m
etcd-k8s-ctr                      1/1     Running   0          5m
kube-apiserver-k8s-ctr            1/1     Running   0          5m
kube-controller-manager-k8s-ctr   1/1     Running   0          5m
kube-proxy-5p6jx                  1/1     Running   0          5m
kube-scheduler-k8s-ctr            1/1     Running   0          5m

Both CoreDNS pods are now Running. They have been assigned pod IPs from the 10.244.0.0/24 range (the subnet allocated to this node).

kubectl get pod -n kube-system -o wide | grep coredns
coredns-668d6bf9bc-n8jxf   1/1   Running   0   5m   10.244.0.2   k8s-ctr      
coredns-668d6bf9bc-z6h69   1/1   Running   0   5m   10.244.0.3   k8s-ctr

You can also check the NetworkUnavailable condition again.

kubectl describe node k8s-ctr | grep NetworkUnavailable
  NetworkUnavailable   False   FlannelIsUp   Flannel is running on this node

The condition is now False, meaning the network is available.

Flannel creates a virtual network interface called flannel.1 on each node. This interface is used for the VXLAN overlay network.

ip addr show flannel.1
4: flannel.1:  mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 3a:2b:1c:4d:5e:6f brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever

Flannel also adds routes to the kernel routing table. Right now, with only one node, you will not see routes to other nodes yet. But once you add worker nodes, Flannel will automatically add routes for their pod subnets.

ip route | grep flannel
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1

The cni0 interface is a bridge that Flannel creates to connect pods on the same node. Pods on this node get IP addresses from 10.244.0.0/24 and are connected to this bridge.

When Flannel starts, it writes CNI configuration files to /etc/cni/net.d/. Before Flannel installation, this directory was empty.

ls -la /etc/cni/net.d/
total 8
drwxr-xr-x. 2 root root   54 Jan 23 11:00 .
drwxr-xr-x. 3 root root   19 Jan 23 01:19 ..
-rw-r--r--. 1 root root  292 Jan 23 11:00 10-flannel.conflist

The 10-flannel.conflist file tells the container runtime how to set up networking for new pods. When a pod is created, containerd reads this configuration and calls the appropriate CNI binaries (which were installed earlier in /opt/cni/bin/) to configure the pod’s network namespace.

cat /etc/cni/net.d/10-flannel.conflist
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

Summary

The control plane is now fully functional. The node is Ready, CoreDNS is running, and we have a working pod network. At this point, you could deploy workloads to the cluster, but they would all run on the control plane node. In a production environment, you typically do not want to run application workloads on the control plane. That is why the control plane node has a taint (node-role.kubernetes.io/control-plane:NoSchedule) that prevents regular pods from being scheduled there.

In the next section, we will add worker nodes to the cluster using kubeadm join. Once the worker nodes join, Flannel will automatically configure routes so that pods on different nodes can communicate with each other through the VXLAN overlay network.

6. Worker Node Join (kubeadm join)

With the control plane up and running, we now need to add worker nodes to the cluster. This is where kubeadm join comes in.

At first glance, joining a node might seem straightforward: just run a command and the node becomes part of the cluster. But there's actually a lot happening under the hood, and understanding this process will help you troubleshoot issues when things go wrong.

The Trust Problem

When a new node wants to join the cluster, we face an interesting chicken-and-egg problem. Think about it from both perspectives.

From the new node’s perspective: “How do I know this API server is legitimate? What if someone is running a fake control plane trying to steal my credentials?”

From the control plane’s perspective: “How do I know this node is allowed to join? What if someone unauthorized is trying to sneak into my cluster?”

This is why kubeadm implements what’s called bidirectional trust. Both sides need to verify each other before the node can join.

Discovery: The Node Validates the Control Plane

The first half of bidirectional trust is called Discovery. This is where the new node confirms that the API server it’s connecting to is actually the legitimate control plane for the cluster.

When you run kubeadm join, the command typically looks something like this:

kubeadm join 192.168.10.100:6443 --token 123456.1234567890123456 \
    --discovery-token-ca-cert-hash sha256:bd763182471f1ed47780644230f234a89061a29041a922a74c849a48342c797d

That --discovery-token-ca-cert-hash flag is doing the heavy lifting here. The new node connects to the API server and retrieves the cluster's CA certificate. Then it computes a hash of that certificate and compares it against the hash you provided. If they match, the node knows it's talking to the real control plane.

You might wonder where this hash comes from. It’s computed from the control plane’s CA public key. You can calculate it yourself on the control plane:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | \
    openssl rsa -pubin -outform der 2>/dev/null | \
    openssl dgst -sha256 -hex | sed 's/^.* //'

This approach is sometimes called CA pinning. Even if someone manages to intercept your network traffic, they can’t impersonate the control plane because they don’t have the matching CA certificate.

There’s also an unsafe option called --discovery-token-unsafe-skip-ca-verification that skips this check entirely. As the name suggests, you really shouldn't use this in production. It opens you up to man-in-the-middle attacks where an attacker could set up a fake API server and trick your nodes into joining the wrong cluster.

TLS Bootstrap: The Control Plane Validates the Node

The second half of bidirectional trust is TLS Bootstrap. Now we need to solve the opposite problem: how does the control plane know this new node is authorized to join?

Here’s where it gets interesting. The new node needs a kubelet certificate to communicate securely with the API server. But to request that certificate, it needs to authenticate to the API server first. And to authenticate, it would normally need… a certificate. See the chicken-and-egg problem?

The solution is the bootstrap token. This is a temporary credential that allows a new node to authenticate just long enough to request its real certificate.

When you ran kubeadm init, it created a bootstrap token and stored it as a Secret in the kube-system namespace. The token format is [token-id].[token-secret], like 123456.1234567890123456. The new node presents this token to the API server, which validates it against the stored Secret.

Once authenticated with the bootstrap token, the new node generates a key pair locally and submits a Certificate Signing Request (CSR) to the API server. The kube-controller-manager has been configured to automatically approve CSRs from nodes authenticated via bootstrap tokens. After approval, the node receives its signed certificate and can now communicate with the API server using proper mTLS.

The flow looks like this:

New node authenticates with bootstrap token
Node generates local key pair
Node submits CSR to API server
Controller manager auto-approves the CSR
Node receives signed certificate
Node switches to using the real certificate for all future communication

The cluster-info ConfigMap

You might be wondering how the new node even knows where to find the CA certificate for validation. The answer is the cluster-info ConfigMap in the kube-public namespace.

During kubeadm init, a ConfigMap called cluster-info is created that contains the API server address and CA certificate. What's special about this ConfigMap is that it's readable by unauthenticated users. This is intentional. A new node that hasn't joined yet obviously can't authenticate, but it still needs to retrieve this bootstrap information.

You can verify this yourself from a worker node before joining:

curl -s -k https://192.168.10.100:6443/api/v1/namespaces/kube-public/configmaps/cluster-info | jq '.data | keys'

This should return:

[
  "jws-kubeconfig-123456",
  "kubeconfig"
]

The kubeconfig field contains the CA certificate and API server address. The jws-kubeconfig-123456 field is a JSON Web Signature that proves this ConfigMap hasn't been tampered with.

Try accessing any other resource without authentication and you’ll get a 403 Forbidden. The cluster-info ConfigMap is the only exception.

Preparing the Worker Node

Before running kubeadm join, the worker node needs the same prerequisites as the control plane: containerd installed, kubelet installed, swap disabled, kernel modules loaded, and so on. I'm assuming you've already done this based on the earlier sections.

Let’s save some baseline information so we can compare before and after:

crictl images | tee -a crictl_images-1.txt
crictl ps -a | tee -a crictl_ps-1.txt
tree /etc/kubernetes | tee -a etc_kubernetes-1.txt
tree /var/lib/kubelet | tee -a var_lib_kubelet-1.txt

At this point, both directories should be essentially empty.

Join Configuration

Just like kubeadm init, you can run kubeadm join with command-line flags or with a configuration file. The configuration file approach is better for reproducibility and version control.

First, figure out which IP address this worker node should use for cluster communication. In my Vagrant environment, there are multiple network interfaces, so I need to be explicit:

NODEIP=$(ip -4 addr show enp0s9 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
echo $NODEIP

Now create the JoinConfiguration file:

cat << EOF > kubeadm-join.yaml
apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration
discovery:
  bootstrapToken:
    token: "123456.1234567890123456"
    apiServerEndpoint: "192.168.10.100:6443"
    unsafeSkipCAVerification: true
nodeRegistration:
  criSocket: "unix:///run/containerd/containerd.sock"
  kubeletExtraArgs:
    - name: node-ip
      value: "$NODEIP"
EOF

A few things to note about this configuration.

The token must match what was created during kubeadm init. In my setup, I hardcoded it to 123456.1234567890123456 for simplicity. In production, you'd typically use kubeadm token list on the control plane to find valid tokens, or create a new one with kubeadm token create.

I’m using unsafeSkipCAVerification: true here for lab convenience. In production, you should use caCertHashes instead and provide the actual CA hash.

The node-ip setting is critical in multi-interface environments like Vagrant. Without it, kubelet might pick the wrong IP address (like the NAT interface at 10.0.2.15) and nodes won't be able to communicate properly.

Running the Join

You can do a dry run first to see what would happen without actually making changes:

kubeadm join --config="kubeadm-join.yaml" --dry-run

This shows you the API calls that would be made and the files that would be created. It’s a good sanity check.

When you’re ready, run the actual join:

kubeadm join --config="kubeadm-join.yaml"

The output walks through the phases:

[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...

First, preflight checks validate that the system meets requirements and can reach the API server.

[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 501.164948ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap

Then kubelet starts up and performs the TLS Bootstrap we discussed earlier. It authenticates with the bootstrap token, submits a CSR, and receives its certificate.

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

That’s it. The node is now part of the cluster.

What Changed on the Worker Node

Let’s look at what files were created. The /etc/kubernetes directory now has content:

tree /etc/kubernetes

/etc/kubernetes
├── kubelet.conf
├── manifests
└── pki
    └── ca.crt

Compare this to the control plane, which has admin.conf, controller-manager.conf, scheduler.conf, and a full pki directory with many certificates and keys. The worker node only needs kubelet.conf for the kubelet to talk to the API server, and ca.crt to validate the API server's certificate. The manifests directory is empty because worker nodes don't run Static Pods.

The kubelet configuration lives in /var/lib/kubelet:

tree /var/lib/kubelet

You’ll see config.yaml (kubelet configuration), kubeadm-flags.env (extra flags including node-ip), and a pki directory containing the kubelet's client certificate that was issued through TLS Bootstrap.

Looking at kubelet.conf, you can see it references the certificate files:

cat /etc/kubernetes/kubelet.conf

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTi...
    server: https://192.168.10.100:6443
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

The kubelet-client-current.pem file is a symlink that gets updated when certificates are rotated. This supports automatic certificate renewal.

Verifying from the Control Plane

Back on the control plane, check the node status:

kubectl get nodes

NAME      STATUS   ROLES           AGE     VERSION
k8s-ctr   Ready    control-plane   30h     v1.32.11
k8s-w1    Ready              7m29s   v1.32.11
k8s-w2    Ready              119s    v1.32.11

Both worker nodes show Ready status. This is because Flannel (the CNI plugin we installed earlier) automatically deploys to new nodes via DaemonSet, so network is ready immediately.

Check what pods are running on the worker nodes:

kubectl get pod -A -owide | grep k8s-w1

kube-flannel   kube-flannel-ds-8vmb6   1/1   Running   0   10m   192.168.10.101   k8s-w1
kube-system    kube-proxy-dkczx        1/1   Running   0   10m   192.168.10.101   k8s-w1

The kube-proxy and kube-flannel pods were automatically scheduled because they’re DaemonSets. Every node gets one.

Each node has been assigned its own Pod CIDR:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'

k8s-ctr	10.244.0.0/24
k8s-w1	10.244.1.0/24
k8s-w2	10.244.2.0/24

The kube-controller-manager allocated these from the cluster’s pod network CIDR (10.244.0.0/16), giving each node a /24 subnet.

You can also verify cross-node networking is working. On the control plane, check the routing table:

ip -c route | grep flannel

10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink

Flannel has set up routes so traffic destined for other nodes’ pod networks goes through the VXLAN overlay. You can test connectivity:

ping -c 1 10.244.1.1

If that works, your cluster networking is healthy.

Troubleshooting Join Issues

The most common problem I’ve seen is kubeadm join hanging at the preflight stage. It just sits there doing nothing. This is almost always a network connectivity issue.

First, check if you can reach the API server from the worker node:

curl -k https://192.168.10.100:6443/healthz

If this times out or fails to connect, you have a network problem. Check the basics: can you ping the control plane? Is the firewall blocking port 6443?

In my lab, I once spent an embarrassing amount of time debugging a join failure. Ping worked fine, but the API server was unreachable. Turned out firewalld had been re-enabled on the control plane after a VM snapshot restore. The fix was simple:

# On control plane
systemctl disable --now firewalld

Or if you want to keep the firewall, open the required ports:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --reload

Another issue is when the node shows up but stays in NotReady status. This usually means the CNI plugin isn’t working. Check if the flannel or calico pods are running on that node. If they’re crashing, look at their logs.

Summary

Adding worker nodes to a kubeadm cluster involves a trust negotiation between the new node and the control plane. The node validates the control plane through CA certificate pinning (Discovery), and the control plane validates the node through bootstrap tokens and automatic CSR approval (TLS Bootstrap).

Once joined, the worker node has a minimal footprint compared to the control plane: just a kubelet.conf, a CA certificate, and its own kubelet client certificate. DaemonSets like kube-proxy and your CNI plugin automatically deploy to the new node, and it becomes ready to schedule workloads.

7. Cluster Verification

At this point, we have a control plane running and two worker nodes joined to the cluster. Before we call it done, let’s walk through a proper verification to make sure everything is working as expected. This is not just about running kubectl get nodes and seeing "Ready" status. We want to understand what actually happened under the hood.

The first thing most people do after setting up a cluster is check the nodes. Fair enough.

kubectl get nodes -o wide

NAME      STATUS   ROLES           AGE     VERSION    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                        KERNEL-VERSION                  CONTAINER-RUNTIME
k8s-ctr   Ready    control-plane   30h     v1.32.11   192.168.10.100           Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5
k8s-w1    Ready              7m29s   v1.32.11   192.168.10.101           Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5
k8s-w2    Ready              119s    v1.32.11   192.168.10.102           Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5

All three nodes are showing Ready status. Notice how the control plane node has the control-plane role while worker nodes show . This is just a label thing. The worker nodes are perfectly functional, they just don't have a role label assigned by default.

The INTERNAL-IP column is important here. Remember when we set node-ip in our kubeadm configuration? This is where it shows up. If you see 10.0.2.15 here instead of your expected IP, something went wrong with the network configuration. In Vagrant environments with multiple network interfaces, this is a common gotcha.

The kubelet on each node reports its capacity to the API server. The scheduler uses this information when deciding where to place pods.

kubectl get nodes -o json | jq ".items[] | {name:.metadata.name} + .status.capacity"

{
  "name": "k8s-ctr",
  "cpu": "4",
  "ephemeral-storage": "60970Mi",
  "memory": "2893976Ki",
  "pods": "110"
}
{
  "name": "k8s-w1",
  "cpu": "4",
  "ephemeral-storage": "60970Mi",
  "memory": "2893976Ki",
  "pods": "110"
}
{
  "name": "k8s-w2",
  "cpu": "4",
  "ephemeral-storage": "60970Mi",
  "memory": "2893976Ki",
  "pods": "110"
}

Each node can run up to 110 pods by default. This is a kubelet configuration that can be changed if needed. The memory and CPU values reflect what we provisioned for our VMs.

This is where things get interesting. When we ran kubeadm init, we specified podSubnet: 10.244.0.0/16 in the configuration. The kube-controller-manager takes this big /16 block and carves out smaller /24 subnets for each node.

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'

k8s-ctr	10.244.0.0/24
k8s-w1	10.244.1.0/24
k8s-w2	10.244.2.0/24

So the control plane got 10.244.0.0/24, the first worker got 10.244.1.0/24, and the second worker got 10.244.2.0/24. Any pod scheduled on k8s-w1 will get an IP address from the 10.244.1.0/24 range. This is how Kubernetes ensures that every pod gets a unique IP address across the entire cluster.

You can also see this information in the node’s annotations, which Flannel populates:

kubectl describe node k8s-w1 | grep -A5 Annotations

Flannel writes its VXLAN configuration here, including the public IP and MAC address it uses for the overlay network.

Now here is where the CNI plugin earns its keep. For pods on different nodes to talk to each other, there needs to be a route. Let’s check the routing table on the control plane node:

ip -c route | grep flannel

10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink

Flannel has automatically added routes for the other nodes’ pod CIDRs. Traffic destined for 10.244.1.0/24 (k8s-w1’s pods) goes through the flannel.1 interface, which is a VXLAN tunnel endpoint. The same applies for k8s-w2’s pod network.

Let’s verify that we can actually reach another node’s pod CIDR:

ping -c 1 10.244.1.0

PING 10.244.1.0 (10.244.1.0) 56(84) bytes of data.
64 bytes from 10.244.1.0: icmp_seq=1 ttl=64 time=0.765 ms

The ping works because the VXLAN overlay is functioning correctly. This is the foundation that allows pods to communicate across nodes.

Let’s see what pods are running across the cluster:

kubectl get pod -A -o wide

NAMESPACE      NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE      
kube-flannel   kube-flannel-ds-8vmb6             1/1     Running   0          10m   192.168.10.101   k8s-w1
kube-flannel   kube-flannel-ds-wtdsc             1/1     Running   0          4m    192.168.10.102   k8s-w2
kube-flannel   kube-flannel-ds-xxxxx             1/1     Running   0          30h   192.168.10.100   k8s-ctr
kube-system    coredns-668d6bf9bc-n8jxf          1/1     Running   0          30h   10.244.0.2       k8s-ctr
kube-system    coredns-668d6bf9bc-z6h69          1/1     Running   0          30h   10.244.0.3       k8s-ctr
kube-system    etcd-k8s-ctr                      1/1     Running   0          30h   192.168.10.100   k8s-ctr
kube-system    kube-apiserver-k8s-ctr            1/1     Running   0          30h   192.168.10.100   k8s-ctr
kube-system    kube-controller-manager-k8s-ctr   1/1     Running   0          30h   192.168.10.100   k8s-ctr
kube-system    kube-proxy-5p6jx                  1/1     Running   0          30h   192.168.10.100   k8s-ctr
kube-system    kube-proxy-dkczx                  1/1     Running   0          10m   192.168.10.101   k8s-w1
kube-system    kube-proxy-frb9n                  1/1     Running   0          4m    192.168.10.102   k8s-w2
kube-system    kube-scheduler-k8s-ctr            1/1     Running   0          30h   192.168.10.100   k8s-ctr

A few things to notice here. The control plane components (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) are only running on k8s-ctr and their IP addresses are the node’s IP, not pod network IPs. This is because they run with hostNetwork: true as Static Pods.

The kube-proxy and kube-flannel pods are running on every node. That’s the DaemonSet doing its job. Every node needs kube-proxy for service networking and kube-flannel for pod networking.

CoreDNS pods have IP addresses from the 10.244.0.0/24 range, which means they are running on the control plane node. These are the only “normal” pods using the pod network on the control plane. They got scheduled there because we only had one node when kubeadm init completed and installed the CoreDNS deployment.

You might wonder why regular workloads don’t get scheduled on the control plane. The answer is taints:

kubectl describe node k8s-ctr | grep Taints

Taints:             node-role.kubernetes.io/control-plane:NoSchedule

The control plane has a taint that prevents normal pods from being scheduled there. The system pods like CoreDNS tolerate this taint, which is why they can run on the control plane. Worker nodes have no taints:

kubectl describe node k8s-w1 | grep Taints

Taints:

This is why when you deploy a regular application, it will land on worker nodes and not on the control plane.

kubectl get svc -A

NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes   ClusterIP   10.96.0.1            443/TCP                  30h
kube-system   kube-dns     ClusterIP   10.96.0.10           53/UDP,53/TCP,9153/TCP   30h

Two services exist by default. The kubernetes service in the default namespace is how pods talk to the API server without knowing its actual IP. The kube-dns service points to the CoreDNS pods and provides cluster DNS resolution.

Notice the service IPs are from 10.96.0.0/12 range, which is the service CIDR we configured. The first IP (10.96.0.1) always goes to the kubernetes API service, and 10.96.0.10 is hardcoded for DNS. If you exec into any pod and check /etc/resolv.conf, you will see 10.96.0.10 as the nameserver.

After everything is set up, here is what the key directories look like on each node type.

On the control plane:

tree /etc/kubernetes

/etc/kubernetes
├── admin.conf
├── controller-manager.conf
├── kubelet.conf
├── scheduler.conf
├── super-admin.conf
├── manifests
│   ├── etcd.yaml
│   ├── kube-apiserver.yaml
│   ├── kube-controller-manager.yaml
│   └── kube-scheduler.yaml
└── pki
    ├── ca.crt
    ├── ca.key
    ├── apiserver.crt
    ├── apiserver.key
    ├── ... (many more certificates)
    └── etcd
        ├── ca.crt
        ├── ca.key
        └── ... (etcd certificates)

On a worker node:

tree /etc/kubernetes

/etc/kubernetes
├── kubelet.conf
├── manifests
└── pki
    └── ca.crt

The difference is significant. Worker nodes only need the kubelet.conf to talk to the API server and the CA certificate to verify the API server’s identity. They don’t have the private keys, admin credentials, or Static Pod manifests. This is by design. A compromised worker node should not have the keys to the kingdom.

Finally, let’s do a quick sanity check that everything is actually working:

kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide

NAME       READY   STATUS    RESTARTS   AGE   IP           NODE     
test-pod   1/1     Running   0          30s   10.244.1.5   k8s-w1

The pod got scheduled on a worker node (not the control plane, because of the taint) and received an IP from that node’s pod CIDR. If you can see this, your cluster is working properly.

Clean up:

kubectl delete pod test-pod

At this point, we have verified that our cluster is functioning correctly. The nodes are healthy, pod networking is working, services are reachable, and workloads can be scheduled. The cluster is ready for actual use.

8. Troubleshooting Tips

When things go wrong during cluster setup, the error messages are not always helpful. Here are the most common issues you will encounter and how to diagnose them.

kubeadm join hangs at preflight checks

This is probably the most frustrating issue because there is no error message. You run the join command and it just sits there doing nothing.

kubeadm join --config="kubeadm-join.yaml"
[preflight] Running pre-flight checks

And then nothing happens. The cursor just blinks.

The problem is almost always network connectivity. The worker node cannot reach the API server on port 6443. Let me walk you through how to diagnose this.

First, test the API server connection directly from the worker node.

curl -k https://192.168.10.100:6443/healthz

If everything is working, you should see “ok” as the response. If you see “Could not connect to server” or the command hangs, you have a network problem.

Next, check if basic network connectivity exists. Can you even ping the control plane?

ping -c 3 192.168.10.100

If ping works but the curl command fails, the issue is likely a firewall blocking port 6443. This is exactly what happened in my lab environment. Ping was successful but the API server was unreachable.

# This worked fine
ping -c 3 192.168.10.100
# 64 bytes from 192.168.10.100: icmp_seq=1 ttl=64 time=0.882 ms

# But this failed
curl -k https://192.168.10.100:6443/healthz
# curl: (7) Failed to connect to 192.168.10.100 port 6443

On the control plane node, check if the firewall is active.

# For Rocky/CentOS
systemctl is-active firewalld

# For Ubuntu
ufw status

If firewalld shows “active”, that is your culprit. You can either disable it completely for lab environments or open the specific ports needed.

# Option 1: Disable firewall entirely (lab only)
systemctl disable --now firewalld

# Option 2: Open only required ports (better for production)
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --reload

One more thing to check is whether the API server is binding to the correct interface. On the control plane, run this command.

ss -tlnp | grep 6443

You want to see it listening on all interfaces, shown as *:6443 or 0.0.0.0:6443. If it shows 127.0.0.1:6443, the API server is only accepting local connections and external nodes cannot reach it.

In Vagrant environments with multiple network interfaces, there is another gotcha. If you did not specify node-ip in your kubeadm configuration, kubelet might register with the wrong IP address (typically 10.0.2.15 from the NAT interface instead of your host-only network IP). This causes all sorts of communication problems between nodes.

Node stays in NotReady status

After kubeadm init completes successfully, you check the node status and see this.

kubectl get nodes
# NAME      STATUS     ROLES           AGE   VERSION
# k8s-ctr   NotReady   control-plane   5m    v1.32.11

NotReady. This is not an error. This is expected behavior when you have not installed a CNI plugin yet.

Kubernetes needs a network plugin to assign IP addresses to pods and enable communication between them. Without CNI, the node reports that its network is not ready. You can confirm this by checking the node conditions.

kubectl describe node k8s-ctr | grep -A5 Conditions

Look for the NetworkUnavailable condition. It will show True, meaning the network is not configured.

The fix is simple. Install a CNI plugin like Flannel or Calico. Once the CNI daemonset is running on the node, the NetworkUnavailable condition changes to False and the node becomes Ready.

CoreDNS pods stuck in Pending

Right after running kubeadm init, you check the pods in kube-system namespace.

kubectl get pod -n kube-system
# NAME                              READY   STATUS    RESTARTS   AGE
# coredns-668d6bf9bc-n8jxf          0/1     Pending   0          3m
# coredns-668d6bf9bc-z6h69          0/1     Pending   0          3m
# etcd-k8s-ctr                      1/1     Running   0          3m
# kube-apiserver-k8s-ctr            1/1     Running   0          3m
# kube-controller-manager-k8s-ctr   1/1     Running   0          3m
# kube-proxy-5p6jx                  1/1     Running   0          3m
# kube-scheduler-k8s-ctr            1/1     Running   0          3m

CoreDNS is Pending while everything else is Running. Why?

The control plane components (etcd, apiserver, controller-manager, scheduler) and kube-proxy all run with hostNetwork: true. They use the node's network directly, so they do not need pod IP addresses.

CoreDNS is different. It is a regular deployment that needs a pod network IP. Without a CNI plugin, the scheduler cannot place the pods because no node can provide network connectivity for them.

This is the same root cause as the NotReady node issue. Install your CNI plugin and CoreDNS will start running within seconds.

kubelet keeps restarting before init or join

You installed kubeadm and kubelet, enabled the kubelet service, and now the logs are full of errors.

systemctl status kubelet
# Active: activating (auto-restart)

journalctl -u kubelet --no-pager
# failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, 
# error: open /var/lib/kubelet/config.yaml: no such file or directory

The kubelet is crashlooping. It starts, fails, waits 10 seconds, and tries again. Over and over.

This is completely normal. Do not try to fix it.

The kubelet needs configuration files that do not exist yet. Specifically, it needs /var/lib/kubelet/config.yaml and /etc/kubernetes/kubelet.conf. These files are created by kubeadm init on the control plane or kubeadm join on worker nodes.

The systemd service is configured with Restart=always, so kubelet will keep trying until the configuration exists. Once you run kubeadm init or kubeadm join, the files get created and kubelet starts properly.

If you look at the kubelet service drop-in file, you can see why this happens.

cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf

The service references environment files and configuration paths that kubeadm generates. Before kubeadm runs, those paths point to nothing.

9. Conclusion

We now have a working three-node Kubernetes cluster. One control plane node running etcd, the API server, controller manager, and scheduler. Two worker nodes ready to run application workloads. Flannel providing pod networking across all nodes.

The control plane components run as static pods, managed directly by kubelet through manifest files in /etc/kubernetes/manifests/. This is different from how you might run them as systemd services, but it has a nice property. The kubelet monitors these manifests and automatically restarts the containers if they crash.

Each node got its own pod CIDR allocation from the cluster’s 10.244.0.0/16 range. The control plane has 10.244.0.0/24, the first worker has 10.244.1.0/24, and the second worker has 10.244.2.0/24. Flannel set up VXLAN tunnels between nodes so pods can communicate across the cluster.

All the TLS certificates were generated automatically. The cluster CA, API server certificates, etcd certificates, front-proxy certificates for API aggregation, and service account signing keys. kubeadm even set up certificate rotation for kubelet so the credentials get renewed before they expire.

If you have done Kubernetes The Hard Way, you know how much work goes into setting up a cluster manually. Let me highlight what kubeadm automated for us.

For certificates, we would have written OpenSSL configuration files, generated a CA, then generated and signed certificates for every component. That is about 10 different certificate pairs with specific SANs and key usages. kubeadm did this in the certs phase with a single command.

For kubeconfig files, we would have used kubectl to create five separate files with embedded certificates and cluster endpoints. kubeadm generated all of them in the kubeconfig phase.

For etcd, we would have downloaded the binary, written a systemd service file, configured the data directory, and set up peer communication. kubeadm created a static pod manifest instead, which is actually easier to manage.

The same story for the API server, controller manager, and scheduler. Instead of systemd services, we get static pods with all the correct flags and volume mounts.

The worker node join process is where kubeadm really shines. In a manual setup, you would need to copy certificates to each worker, write kubeconfig files, configure kubelet, and start services. With kubeadm, the bootstrap token mechanism handles all of this. A new node can join with just a token and a CA hash. It fetches the cluster CA from a public ConfigMap, submits a certificate signing request, gets a signed certificate back, and configures itself. All automated.

Here is a quick comparison.

Manual setup for certificates: write ca.conf, generate CA, generate 10+ certificate pairs with correct SANs. kubeadm: kubeadm init phase certs all

Manual setup for etcd: download binary, write systemd unit, configure flags, manage data directory. kubeadm: static pod manifest in /etc/kubernetes/manifests/etcd.yaml

Manual worker join: copy certificates, write kubeconfig, configure kubelet, start services. kubeadm: kubeadm join with a token

One notable difference is how etcd communication is secured. In Kubernetes The Hard Way, etcd often runs with HTTP for simplicity during learning. kubeadm always configures etcd with mutual TLS. Every connection to etcd requires a client certificate, including health checks.

# Even health checks need certificates with kubeadm's etcd
etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

The cluster is functional but not production ready. Here are the natural next steps.

Monitoring is essential. You cannot operate what you cannot observe. kube-prometheus-stack gives you Prometheus for metrics collection, Grafana for dashboards, and Alertmanager for notifications. You should also monitor certificate expiration. kubeadm certificates are valid for one year by default, and you do not want to discover they expired when your cluster stops working.

Cluster upgrades are something kubeadm handles well. The kubeadm upgrade command can move your cluster from one Kubernetes version to the next. You upgrade the control plane first, then the worker nodes one by one. The process is documented but requires careful planning, especially around workload disruption.

High availability is the big gap in our current setup. We have a single control plane node, which means a single point of failure. For production, you want at least three control plane nodes with a load balancer in front. kubeadm supports this through the --control-plane-endpoint flag and the ability to join additional control plane nodes. If you plan to go HA later, you should set the control plane endpoint from the beginning, because changing it afterward is painful.

Storage is another consideration. We did not configure any persistent storage. For stateful applications, you will need a storage solution like local-path-provisioner for simple cases, or something like Longhorn or Rook-Ceph for distributed storage.

Appendix: Claude Code Skills for kubeadm

GitHub - sigridjineth/claude-kubeadm-skills: A collection of Claude Code skills for setting up and troubleshooting Kubernetes clusters using kubeadm.

This appendix explains a collection of skills designed for Claude Code that provide systematic guidance for Kubernetes cluster setup and troubleshooting. If you are not familiar with Claude Code skills, think of them as reference documents that Claude can consult when helping you with specific tasks.

Skills are markdown files that live in your ~/.claude/skills/ directory. When you ask Claude Code to help with a task, it can discover and read relevant skills to provide more accurate and contextual assistance. Instead of relying solely on general knowledge, Claude gets access to curated, tested procedures for specific scenarios.

For kubeadm operations, this matters because cluster setup involves many steps where the order matters and small mistakes cause cascading failures. A skill document captures the exact sequence of commands, the verification steps, and the common failure modes all in one place.

The collection contains four skills that map to the cluster lifecycle.

The first skill is kubeadm-prerequisites. This covers everything you need to do before running kubeadm. Disabling swap, loading kernel modules, configuring containerd with systemd cgroup driver, installing kubeadm and kubelet. When you tell Claude "prepare this node for Kubernetes" or "I am getting CRI connection errors", this skill provides the reference.

The second skill is kubeadm-init. This covers control plane initialization. The 14 phases of kubeadm init, the critical configuration options like apiserver-advertise-address and pod-network-cidr, certificate generation, and post-init verification. When you need to understand why a particular certificate was created or what the control-plane-endpoint option does for HA, this skill has the answers.

The third skill is kubeadm-join. This covers adding nodes to an existing cluster. The bidirectional trust model, how TLS bootstrap works, token management, and the differences between joining a worker node versus an additional control plane node. When your join command hangs or you need to regenerate an expired token, this skill explains the process.

The fourth skill is kubeadm-troubleshooting. This is a systematic debugging guide organized by symptom. Node stuck in NotReady, CoreDNS pending, kubelet crashlooping, certificate errors. Each problem has a diagnostic sequence and solution.

Each skill follows a consistent format that makes it easy for both humans and Claude to find information quickly.

The frontmatter at the top contains the skill name and description. This is what triggers skill discovery. When you mention “kubeadm init” or “node won’t join the cluster”, Claude matches these phrases against skill descriptions.

---
name: kubeadm-init
description: Control plane initialization with kubeadm init, covering the 14 phases, certificate setup, and configuration options
---

The overview section states the core principle in one or two sentences. No fluff, just the essential concept.

The “when to use” section lists specific scenarios and symptoms. This is critical for skill discovery. Instead of vague descriptions like “helps with kubeadm”, it lists concrete situations like “kubeadm init fails with port 6443 already in use” or “need to add extra SANs to apiserver certificate”.

Quick reference tables provide scannable information for common operations. You do not need to read paragraphs of text to find the command for listing tokens.

| Task | Command |
|------|---------|
| List tokens | kubeadm token list |
| Create new token | kubeadm token create --print-join-command |
| Delete token | kubeadm token delete  |

The common issues section maps symptoms to causes to fixes. This is the most valuable part for troubleshooting.

Symptom: kubeadm join hangs at preflight
Cause: Firewall blocking port 6443
Diagnosis: curl -k https://:6443/healthz
Fix: systemctl disable --now firewalld

Finally, a verification checklist tells you how to confirm the operation succeeded. After running kubeadm init, what should you check? Node status, pod status in kube-system, certificate files in /etc/kubernetes/pki.

When setting up a new cluster, you would work through the skills in order.

Start with kubeadm-prerequisites on all nodes. This ensures every machine has the correct kernel parameters, container runtime configuration, and kubeadm installation. The skill includes verification commands so you can confirm each step before moving on.

# Verify containerd is configured correctly
cat /etc/containerd/config.toml | grep SystemdCgroup
# Should show: SystemdCgroup = true

# Verify kernel modules are loaded
lsmod | grep br_netfilter
# Should show br_netfilter in output

Next, use kubeadm-init on your control plane node. The skill walks through configuration file creation, explains which options matter for your environment, and shows what to expect at each phase. After init completes, it tells you exactly what to verify.

# Post-init verification
kubectl get nodes
# Should show control plane node (NotReady is expected before CNI)

kubectl get pods -n kube-system
# Should show etcd, apiserver, controller-manager, scheduler running
# CoreDNS will be Pending until CNI is installed

Install your CNI plugin. The kubeadm-init skill includes the common CNI installation commands for Flannel and Calico.

Finally, use kubeadm-join on your worker nodes. The skill explains the join command options, what the bootstrap token does, and how to verify the node joined successfully.

The troubleshooting skill is organized by symptom because that is how you encounter problems. You do not know the cause yet. You just know that the node is NotReady or that kubeadm join is hanging.

Each symptom section follows a diagnostic sequence. Check this first, then this, then this. The sequence is ordered by likelihood based on real-world experience.

For example, the “kubeadm join hangs” section starts with API server connectivity because that is the most common cause. It provides the exact curl command to test, explains what each response means, and gives the fix for each scenario.

# Test 1: Can you reach the API server?
curl -k https://192.168.10.100:6443/healthz
# "ok" = API server reachable, problem is elsewhere
# Connection refused = API server not running or wrong address
# Timeout = Firewall or routing issue

# Test 2: Is it a firewall?
# On control plane:
ss -tlnp | grep 6443
# Should show LISTEN on *:6443

systemctl is-active firewalld
# If "active", firewall may be blocking

The skill also includes less common issues that are harder to diagnose. Certificate SAN mismatches, token expiration, cgroup driver conflicts between kubelet and containerd. These problems produce confusing error messages, and having the symptom-to-cause mapping saves significant debugging time.

Skills are only useful if they reflect reality. The tested versions section documents which Kubernetes, containerd, and OS versions the procedures were validated against.

| Component | Version |
|-----------|---------|
| Kubernetes | 1.32.x |
| containerd | 2.1.x |
| OS | Rocky Linux 10 / RHEL 9 |

When Kubernetes or containerd releases a new version that changes behavior, the skills need updates. The most common changes are configuration file format changes (like containerd’s config.toml version 2 versus version 3) and new kubeadm phases or options.

If you encounter a scenario that the skill does not cover, that is feedback for improvement. Document the exact error message, the diagnostic steps you followed, and the solution you found. This becomes a new entry in the common issues section.

These skills follow principles from the superpowers skill methodology.

First, skills are optimized for discovery. The description and “when to use” sections contain the exact phrases someone would use when they have a problem. “Node shows NotReady” rather than “node status troubleshooting”. Claude searches skill content, so including the actual error messages and symptoms improves matching.

Second, skills are reference documents, not tutorials. They assume you have context and need specific information quickly. Tables for scanning, code blocks for copying, symptom-cause-fix mappings for troubleshooting. No lengthy explanations of concepts unless necessary for the fix.

Third, verification is mandatory. Every procedure ends with commands to confirm success. This catches problems early and builds confidence that the procedure worked.

Fourth, skills are tested with realistic scenarios before deployment. You do not write a skill from documentation alone. You set up a cluster, break it in various ways, and document what actually works to fix it.

Claude Code discovers skills automatically based on your request, but you can also reference them explicitly.

Use the kubeadm-prerequisites skill to check if this node is ready for Kubernetes.
$ Check the kubeadm-troubleshooting skill for why this node is NotReady.

This is useful when Claude’s automatic discovery does not find the right skill, or when you want to be explicit about which reference to use.

The skills reference several command-line tools that you should be familiar with.

kubeadm is the cluster bootstrapping tool. It runs on each node during setup.

kubectl is the cluster interaction tool. It runs from any machine with network access to the API server and a valid kubeconfig.

crictl is the container runtime debugging tool. It talks directly to containerd (or other CRI runtimes) without going through Kubernetes. Useful for checking if containers are actually running when kubectl cannot reach the API server.

systemctl and journalctl are the systemd service management and logging tools. kubelet and containerd run as systemd services, so these are essential for checking service status and reading logs.

# Check kubelet service status
systemctl status kubelet

# Read kubelet logs
journalctl -u kubelet -f

# Check containerd service
systemctl status containerd

# Read containerd logs
journalctl -u containerd --no-pager

These skills are derived from the tutorial content in this post. The step-by-step cluster setup, the certificate explanations, the troubleshooting scenarios. All of it was tested on Rocky Linux 10 with Kubernetes 1.32 and containerd 2.1.

The advantage of converting tutorial content into skills is reusability. A tutorial is read once during learning. A skill is consulted repeatedly during actual work. The format is different because the use case is different.

ralph is a form of porn

Sigrid Jin — Thu, 22 Jan 2026 00:20:22 GMT

the ralph technique is a form of engineering porn because it presents a seductive, almost hyper-real vision of the future where software builds itself while you sleep

the ralph technique, popularized by geoffrey huntley, represents a fundamental shift from manual software development toward autonomous ai orchestration. named after the simpsons character who is often misunderstood, the concept suggests that even a seemingly clumsy or non-linear ai can achieve professional results if it is placed inside a continuous, automated loop.

technically, ralph is often just a simple bash loop that feeds a set of instructions into a coding agent and allows it to run indefinitely. it operates on the philosophy of eventual consistency, where the ai writes code, runs tests, fails, and tries again until the software settles into a functional state. huntley describes this as treating software like clay on a pottery wheel; the engineer’s job is not to shape the clay by hand, but to tune the speed and direction of the wheel by refining the prompts that guide the ai.

while the speed and autonomy of this method are impressive, it faces significant criticism regarding its lack of human oversight. many argue that unless a goal is strictly deterministic, meaning there is a clear, mathematical way to verify success like a passing test suite, an unsupervised loop can quickly spiral out of control. without a human in the loop to provide sanity checks, an ai might find “hacky” ways to pass tests while producing code that is unmaintainable, insecure, or structurally flawed. critics point out that an ai lacks the long-term intuition to recognize when a specific architectural path is a dead end. it might spend hours refining a solution that a human engineer would have discarded in seconds for being fundamentally wrong.

for ralph to be truly effective in a professional environment, most believe there must be human intervention at critical decision points. while the loop can handle the repetitive labor of writing and fixing code, a human is still needed to act as the strategic judge. this human-in-the-loop approach ensures that the ai doesn’t just produce something that “works” on a surface level, but something that actually aligns with the broader goals of the project. in this new era, the role of the software engineer is evolving from a builder of code into an orchestrator of systems who knows exactly when to let the loop run and when to step in to provide the moral and technical compass necessary for high-quality engineering.

GitHub - anthropics/original_performance_takehome: Anthropic's original performance take-home, now open for you to try!

Ansible Basics Lab: A Hands-On Guide for Infrastructure Automation

Sigrid Jin — Sat, 17 Jan 2026 12:26:18 GMT

Before Reading This Lab: Use Ansible Claude Skills

GitHub - sigridjineth/hello-ansible-skills: Ansible Playbook Tutorial with Basic Claude Skills

this repository contains custom skills for claude code that help you work with ansible more effectively. when you place skill files in the .claude/skills/ directory of your project, claude code reads them and uses that knowledge to provide better assistance when you're working on ansible-related tasks.

the idea is simple: instead of explaining ansible best practices every time you ask for help, you encode that knowledge into skill files once, and claude code applies it automatically. this means more consistent outputs, fewer mistakes, and playbooks that follow established patterns from the start.

what skills are included

the repository includes four main skill files, each covering a different aspect of ansible development.

the first skill is ansible-playbook.md, which covers playbook authoring best practices. it explains how to structure your project directories, including where to put inventory files, group variables, host variables, and roles. it describes naming conventions for tasks and variables, how to use tags effectively, and when to use handlers versus regular tasks. it also covers security topics like using ansible-vault for sensitive data, and testing strategies including check mode, diff mode, and molecule for role testing.

the second skill is ansible-modules.md, which serves as a quick reference for commonly used ansible modules. instead of looking up documentation every time, claude code can reference this skill to use the right module for each task. it covers file operations like creating directories, copying files, and modifying configuration files. it covers package management across different distributions, service management with systemd, user and group management, downloading files from the web, making api calls, and working with archives.

the third skill is shell-to-ansible.md, which helps convert legacy shell scripts into proper ansible playbooks. this is probably the most practically useful skill for teams migrating from manual scripts to infrastructure as code. it provides a mapping table showing which shell command corresponds to which ansible module. for example, mkdir -p becomes the file module with state: directory, apt-get install becomes the apt module, and systemctl restart becomes the service module. the skill includes a complete worked example showing a 70-line deployment script being converted into a structured playbook with proper error handling, handlers, and templates.

the fourth skill is ansible-interactive.md, which defines a conversational workflow for building ansible projects step by step. this is useful when you're starting from scratch and want claude code to guide you through the process. it breaks down the development into phases: first analyzing your environment and gathering information about your servers, then creating the initial project structure and inventory, testing connectivity, writing a simple playbook, gradually adding features based on your requirements, refactoring into roles when the code gets complex enough, and finally documenting everything properly.

how to use it

first clone the repository to your local machine. then open the directory with claude code. once you’re in the project, you can start asking for ansible-related help in natural language.

for example, you might say “create a playbook that installs nginx and configures it as a reverse proxy”. claude code will read the skills and generate a playbook that follows the directory structure conventions, uses appropriate modules instead of shell commands, includes proper error handling, and sets up handlers for service restarts.

another example: you could paste a shell script and ask “convert this to ansible”. claude code will use the shell-to-ansible skill to map each command to the appropriate module, extract hardcoded values into variables, add idempotency where the original script lacked it, and structure the result as a proper playbook or role.

you can also take the interactive approach by saying something like “help me set up ansible for my servers step by step”. claude code will walk you through the phases defined in the interactive skill, asking about your server inventory, testing connections, and building up the automation gradually.

why this matters

the combination of claude code and ansible is powerful because it addresses the main pain points of infrastructure automation. writing ansible playbooks requires knowing which modules exist, what parameters they accept, and how to structure everything properly. this knowledge takes time to acquire and is easy to forget. by encoding it into skills, you get consistent, high-quality outputs without having to remember everything yourself.

the shell-to-ansible conversion skill is particularly valuable for teams with legacy automation. most organizations have accumulated shell scripts over the years that work but are fragile, not idempotent, and hard to maintain. converting them to ansible manually is tedious and error-prone. having claude code do the heavy lifting while following a consistent conversion methodology speeds up the migration significantly.

the interactive workflow skill recognizes that infrastructure automation is often an iterative process. you don’t always know exactly what you need upfront. being able to develop incrementally, testing each step before moving on, reduces the risk of building something that doesn’t work in your specific environment.

Introduction

The post comes from Week 2 of the K8S Deploy study hosted by the CloudNet@ community. The focus this week is Ansible Basics, and I’ve put together everything we covered into this comprehensive guide.

Last week, we went through Kubernetes The Hard Way, manually building a Kubernetes cluster from scratch. It was painful, tedious, and honestly? Eye-opening. You really start to appreciate automation tools when you’ve had to do everything by hand.

This week, we flip the script. Instead of doing things manually, we’re diving into Ansible, an open-source automation tool that lets you manage multiple servers over SSH without installing agents on them. You write playbooks (basically scripts in YAML format), and Ansible handles the rest. It’s declarative, it’s idempotent, and once you get the hang of it, you’ll wonder how you ever lived without it.

By the end of this guide, you’ll understand:

What Ansible is and how it works
How to write playbooks with variables, loops, and conditionals
How to handle errors gracefully
How to structure reusable code with roles
How to leverage Ansible Galaxy for community-contributed roles

Let’s get started.

Lab Environment Setup

Before we touch any Ansible commands, we need an environment to play with. For this tutorial, we’re spinning up four virtual machines using Vagrant:

The server node is our Ansible control node, this is where we’ll install Ansible and run all our commands from. The three tnode machines are managed nodes, these are the servers Ansible will configure.

Notice that tnode3 runs Rocky Linux instead of Ubuntu. This is intentional. In the real world, you often deal with heterogeneous environments, and Ansible needs to handle that gracefully. We’ll see how to write playbooks that work across different operating systems.

Spinning Up the Environment

Create a directory and download the necessary files:

mkdir ansible
cd ansible
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/Vagrantfile
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/init_cfg.sh
curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/init_cfg2.sh
vagrant up

Here’s what each file does:

Vagrantfile: The Vagrant configuration that defines all four VMs — their resources, network settings, and which initialization script to run.

init_cfg.sh: Initialization script for Ubuntu nodes (server, tnode1, tnode2). It sets the timezone, disables firewalls (AppArmor, ufw), installs essential packages, configures passwords, enables SSH password authentication and root login, and populates /etc/hosts with all node entries.

init_cfg2.sh: Same thing, but for Rocky Linux (tnode3). Uses yum instead of apt, disables firewalld and SELinux instead of AppArmor and ufw.

After vagrant up completes, you should see all four VMs running in VirtualBox.

Verifying the Environment

SSH into the server node and run some basic checks:

# Check current user
whoami
# root

# Check user ID and groups
id
# uid=0(root) gid=0(root) groups=0(root)

# Check kernel version
uname -r
# 6.8.0-86-generic

# Check system info
hostnamectl
# Static hostname: server
# Operating System: Ubuntu 24.04.3 LTS
# Kernel: Linux 6.8.0-86-generic
# Architecture: arm64

# Check memory
free -h
# total: 1.3Gi, used: 226Mi, free: 515Mi, available: 1.1Gi

# Check disk
lsblk
df -hT /
# Filesystem: /dev/mapper/ubuntu--vg-ubuntu--lv
# Type: ext4, Size: 30G, Used: 5.3G, Avail: 24G

# Check network interfaces
ip -c addr
# eth0: 10.0.2.15/24 (NAT network)
# eth1: 10.10.1.10/24 (private network - this is what we'll use)

# Verify /etc/hosts has all nodes
cat /etc/hosts
# 127.0.0.1 localhost
# 10.10.1.10 server
# 10.10.1.11 tnode1
# 10.10.1.12 tnode2
# 10.10.1.13 tnode3

# Test connectivity to all nodes
for i in {1..3}; do ping -c 1 tnode$i; done
# All should respond with 0% packet loss

If everything checks out, you’re ready to install Ansible.

Ansible Overview

Before we start typing commands, let’s understand what we’re working with.

Ansible follows a simple architecture:

┌─────────────────────────────────────┐
│         Control Node (server)       │
│  ┌───────────┐  ┌────────────────┐  │
│  │ Inventory │  │   Playbooks    │  │
│  │ (hosts)   │  │ (YAML scripts) │  │
│  └───────────┘  └────────────────┘  │
│         │              │            │
│         └──────┬───────┘            │
│                ▼                    │
│          ┌──────────┐               │
│          │ Modules  │               │
│          └──────────┘               │
└────────────────┬────────────────────┘
                 │ SSH
    ┌────────────┼────────────┐
    ▼            ▼            ▼
┌────────┐  ┌────────┐  ┌────────┐
│ tnode1 │  │ tnode2 │  │ tnode3 │
└────────┘  └────────┘  └────────┘
    Managed Nodes (no agent needed)

The control node is where Ansible lives. It has three main components:

Inventory: A list of servers you want to manage
Playbooks: YAML files that describe what you want to do
Modules: The actual code that performs tasks (install packages, copy files, restart services, etc.)

The managed nodes are the servers you’re configuring. Here’s the beautiful part: they don’t need any special software installed. As long as they have SSH access and Python available, Ansible can manage them.

A few things make Ansible stand out:

Agentless: No daemon running on managed nodes. No ports to open (besides SSH). No agent to update. This alone eliminates a whole class of operational headaches.

Idempotent: Run the same playbook ten times, and you get the same result. Ansible checks the current state before making changes. If a package is already installed, it won’t try to install it again. If a file already has the right content, it won’t rewrite it. This makes Ansible safe to run repeatedly.

YAML-based: Playbooks are written in YAML, which is human-readable. No need to learn a new programming language. If you can read a config file, you can read a playbook.

Massive module library: There are modules for just about everything — managing packages, files, services, users, cloud resources, databases, network devices, you name it.

There are two flavors of Ansible:

Community Edition: Free, open-source, what we’re using in this tutorial. You install it on any Linux machine and you’re good to go.

Red Hat Ansible Automation Platform: The enterprise version. Adds features like a web UI, role-based access control, centralized logging, and official support. Requires a subscription.

For learning purposes and small-to-medium deployments, the community edition is more than enough.

Key Ansible Concepts

Let’s define the terminology you’ll encounter throughout this guide.

An inventory is simply a list of hosts that Ansible manages. At its most basic, it’s a text file with hostnames or IP addresses, one per line:

web1.example.com
web2.example.com
db1.example.com
192.168.1.50

But inventories can get sophisticated. You can organize hosts into groups, define variables per host or per group, and even nest groups within groups. We’ll explore all of this shortly.

A playbook is a YAML file containing one or more plays. Each play maps a set of hosts to a set of tasks. When you run a playbook, Ansible executes the tasks on the specified hosts in order.

Think of a playbook as a recipe. It describes the desired state of your infrastructure, “these packages should be installed, these files should exist, these services should be running”, and Ansible figures out how to get there.

A play is a single unit within a playbook. It specifies:

Which hosts to target
What tasks to run on those hosts
Any variables, handlers, or roles to use

A playbook can contain multiple plays. For example, you might have one play that configures web servers and another play that configures database servers, all in the same file.

A task is a single action: install a package, copy a file, restart a service, create a user. Each task calls a module with specific arguments.

- name: Install nginx
  ansible.builtin.apt:
    name: nginx
    state: present

The name field is a human-readable description. The rest specifies which module to use (ansible.builtin.apt) and what arguments to pass.

Modules are the workhorses of Ansible. They’re units of code that perform specific actions. When you write a task, you’re really just telling Ansible which module to run and with what parameters.

Some commonly used modules:

ansible.builtin.apt / ansible.builtin.yum: Package management
ansible.builtin.copy: Copy files to remote hosts
ansible.builtin.template: Copy files with variable substitution
ansible.builtin.service: Manage services (start, stop, restart)
ansible.builtin.user: Manage user accounts
ansible.builtin.file: Manage files and directories
ansible.builtin.shell: Run shell commands

Handlers are special tasks that only run when notified. They’re typically used for actions that should only happen if something changed — like restarting a service after its config file was modified.

tasks:
  - name: Copy nginx config
    ansible.builtin.copy:
      src: nginx.conf
      dest: /etc/nginx/nginx.conf
    notify: Restart nginx

handlers:
  - name: Restart nginx
    ansible.builtin.service:
      name: nginx
      state: restarted

If the config file changes, the handler runs. If the file was already identical, the handler is skipped. This prevents unnecessary service restarts.

Roles are a way to organize playbooks into reusable components. Instead of having one giant playbook, you break it into roles like “webserver”, “database”, “monitoring”, each with its own tasks, handlers, variables, and files.

Roles follow a standard directory structure, making them easy to share and reuse. You can publish roles to Ansible Galaxy for others to use, or download roles that others have published.

Plugins extend Ansible’s core functionality. There are plugins for:

Connection: How Ansible connects to hosts (SSH, WinRM, Docker, etc.)
Lookup: Retrieve data from external sources
Filter: Transform data in templates
Callback: Customize output and logging

Collections are the distribution format for Ansible content. A collection can include modules, plugins, roles, and playbooks, everything packaged together. This allows vendors and developers to distribute their Ansible content independently from Ansible’s release cycle.

Installing Ansible

Alright, enough theory. Let’s install Ansible on our control node.

SSH into the server VM and run:

# Confirm we're root in the right directory
whoami
# root

pwd
# /root

# Check Python version (Ansible requires Python)
python3 --version
# Python 3.12.3

# Install prerequisite for adding PPAs
apt install software-properties-common -y

# Add the official Ansible PPA
add-apt-repository --yes --update ppa:ansible/ansible

# Install Ansible
apt install ansible -y

# Verify installation
ansible --version
# ansible [core 2.19.5]
#   config file = /etc/ansible/ansible.cfg
#   python version = 3.12.3

The output shows key information:

Ansible core version (2.19.5)
Default config file location (/etc/ansible/ansible.cfg)
Python version Ansible is using

Let’s also create a working directory for our exercises:

mkdir my-ansible
cd my-ansible

This is where we’ll store our inventory, playbooks, and roles.

SSH Authentication Setup

Ansible connects to managed nodes via SSH. While you can use password authentication, SSH key-based authentication is the way to go for automation. Let’s set it up.

First, a quick primer on what’s in the ~/.ssh directory:

tree ~/.ssh
# /root/.ssh
# ├── authorized_keys  # Public keys allowed to connect TO this machine
# ├── id_rsa           # Your private key (never share this!)
# ├── id_rsa.pub       # Your public key (safe to distribute)
# └── known_hosts      # Fingerprints of servers you've connected to

The id_rsa (private key) and id_rsa.pub (public key) pair work together. You put your public key on servers you want to access, and you use your private key to prove your identity.

ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa
# Generating public/private rsa key pair.
# Your identification has been saved in /root/.ssh/id_rsa
# Your public key has been saved in /root/.ssh/id_rsa.pub

The -N "" flag sets an empty passphrase. For automated systems, this is common (though in high-security environments, you might use ssh-agent instead).

Now we copy our public key to all three managed nodes:

for i in {1..3}; do 
  sshpass -p 'qwe123' ssh-copy-id -o StrictHostKeyChecking=no root@tnode$i
done
# tnode1: Number of key(s) added: 1
# tnode2: Number of key(s) added: 1
# tnode3: Number of key(s) added: 1

The sshpass utility provides the password non-interactively (useful for bootstrapping, not recommended for production). The StrictHostKeyChecking=no flag accepts host keys automatically.

Let’s confirm we can now SSH without passwords:

for i in {1..3}; do 
  echo ">> tnode$i <<"
  ssh tnode$i hostname
  echo
done
# >> tnode1 <<
# tnode1
#
# >> tnode2 <<
# tnode2
#
# >> tnode3 <<
# tnode3

All three nodes respond without prompting for a password.

Ansible needs Python on managed nodes to execute modules:

for i in {1..3}; do 
  echo ">> tnode$i <<"
  ssh tnode$i python3 -V
  echo
done
# >> tnode1 <<
# Python 3.12.3
#
# >> tnode2 <<
# Python 3.12.3
#
# >> tnode3 <<
# Python 3.9.21

All nodes have Python installed. Note that tnode3 (Rocky Linux) has a slightly older version — that’s fine, Ansible handles it.

Inventory Basics

The inventory tells Ansible which hosts to manage. Let’s explore different ways to structure it.

The most basic inventory is just a list of hosts:

cat < inventory
10.10.1.11
10.10.1.12
10.10.1.13
EOT

Or using hostnames (since we have /etc/hosts configured):

cat < inventory
tnode1
tnode2
tnode3
EOT

Verify the inventory parses correctly:

ansible-inventory -i ./inventory --list | jq

In real environments, you’ll want to organize hosts by role. Groups are defined with square brackets:

[webservers]
web1.example.com
web2.example.com

[dbservers]
db1.example.com
db2.example.com

A host can belong to multiple groups:

[webservers]
web1.example.com
web2.example.com

[dbservers]
db01.example.com
db02.example.com

[production]
web1.example.com
web2.example.com
db01.example.com
db02.example.com

[development]
192.168.0.42

You can create groups of groups using the :children suffix:

[webservers]
web1.example.com
web2.example.com

[dbservers]
db01.example.com
db02.example.com

[datacenter:children]
webservers
dbservers

Now datacenter contains all hosts from both webservers and dbservers.

Real-World Example: Kubespray

Here’s what a production inventory might look like (from Kubespray):

[all]
master01 ansible_host=192.168.10.10 ip=192.168.10.10 ansible_user=root
worker01 ansible_host=192.168.10.11 ip=192.168.10.11 ansible_user=root
worker02 ansible_host=192.168.10.12 ip=192.168.10.12 ansible_user=root

[kube_control_plane]
master01

[etcd]
master0

[kube_node]
worker01
worker02

[k8s_cluster:children]
kube_control_plane
kube_node

This inventory defines individual hosts with variables, groups them by function (control plane, etcd, worker nodes), and creates a parent group for the entire cluster.

For large inventories, you can use ranges instead of listing every host:

[webservers]
web[1:20].example.com      # web1 through web20

[databases]
db[01:05].example.com      # db01 through db05 (zero-padded)

[alphabet]
[a:f].example.com          # a.example.com through f.example.com

Our Lab Inventory

Let’s create the inventory we’ll use for the rest of this tutorial:

cat < inventory
[web]
tnode1 ansible_python_interpreter=/usr/bin/python3
tnode2 ansible_python_interpreter=/usr/bin/python3

[db]
tnode3 ansible_python_interpreter=/usr/bin/python3

[all:children]
web
db
EOT

A few things to note:

We explicitly set ansible_python_interpreter to avoid warnings about Python auto-detection
tnode1 and tnode2 are in the web group
tnode3 is in the db group
The all group (which is implicit anyway) contains both groups

Verify the inventory:

ansible-inventory -i ./inventory --list | jq
# Shows all hosts organized by group

ansible-inventory -i ./inventory --graph
# @all:
#   |--@ungrouped:
#   |--@web:
#   |  |--tnode1
#   |  |--tnode2
#   |--@db:
#   |  |--tnode3

Ansible Configuration (ansible.cfg)

Every time we run an Ansible command, we’ve been specifying -i ./inventory. That gets old fast. Let's create a configuration file to set defaults.

cat < ansible.cfg
[defaults]
inventory = ./inventory
remote_user = root
ask_pass = false

[privilege_escalation]
become = true
become_method = sudo
become_user = root
become_ask_pass = false
EOT

[defaults] Basic operational settings:

inventory: Path to the inventory file. Now we can skip the -i flag.
remote_user: The user Ansible logs in as on remote hosts.
ask_pass: Whether to prompt for SSH password. We use key-based auth, so false.

[privilege_escalation] How Ansible gains elevated privileges:

become: Enable privilege escalation (like sudo).
become_method: How to escalate (sudo, su, etc.).
become_user: Which user to become (usually root).
become_ask_pass: Whether to prompt for the sudo password.

Ansible looks for configuration in this order (first match wins):

ANSIBLE_CONFIG environment variable (if set)
./ansible.cfg (current directory)
~/.ansible.cfg (home directory)
/etc/ansible/ansible.cfg (system-wide default)

This is why we created ansible.cfg in our working directory — it takes precedence over the system-wide config.

Now verify that the configuration is being picked up:

ansible-inventory --list | jq
# Should work without -i flag now

Ad-Hoc Commands

Ad-hoc commands let you run quick, one-off tasks without writing a playbook. They’re great for testing, troubleshooting, or simple operations.

ansible  -m  -a ""

The ping module checks if Ansible can connect to hosts and execute Python:

ansible -m ping web
# tnode1 | SUCCESS => {
#     "changed": false,
#     "ping": "pong"
# }
# tnode2 | SUCCESS => {
#     "changed": false,
#     "ping": "pong"
# }

ansible -m ping db
# tnode3 | SUCCESS => {
#     "changed": false,
#     "ping": "pong"
# }

ansible -m ping all
# All three nodes respond

Note: This is not an ICMP ping. It’s an Ansible module that connects via SSH and runs a small Python script to verify everything works.

The shell module executes arbitrary commands:

ansible -m shell -a "uptime" db
# tnode3 | CHANGED | rc=0 >>
#  00:04:22 up  1:54,  1 user,  load average: 0.00, 0.00, 0.00

ansible -m shell -a "free -h" web
# tnode1 | CHANGED | rc=0 >>
#                total        used        free      shared  buff/cache   available
# Mem:           1.3Gi       230Mi       909Mi       4.8Mi       266Mi       1.1Gi
# Swap:          3.7Gi          0B       3.7Gi
# tnode2 | CHANGED | rc=0 >>
# ...

ansible -m shell -a "tail -n 3 /etc/passwd" all
# Shows last 3 lines of /etc/passwd on all hosts

By default, Ansible uses the remote_user from ansible.cfg. You can override with -u:

# Try connecting as vagrant without password
ansible -m ping web -u vagrant
# Permission denied (publickey,password).

# Now with password authentication
ansible -m ping web -u vagrant --ask-pass
# SSH password: [enter qwe123]
# tnode1 | SUCCESS => ...
# tnode2 | SUCCESS => ...

Your First Playbook

Ad-hoc commands are useful, but playbooks are where Ansible really shines. Let’s write our first one.

A playbook is YAML file containing a list of plays:

---
- hosts: all           # Which hosts to target
  tasks:               # List of tasks to run
    - name: Task 1     # Human-readable description
      module_name:     # Which module to use
        arg1: value1   # Module arguments
        arg2: value2

cat > first-playbook.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Print message
      debug:
        msg: Hello CloudNet@ Ansible Study
EOF

The debug module simply prints a message. It's useful for troubleshooting and learning.

Before running, let’s verify the syntax:

ansible-playbook --syntax-check first-playbook.yml
# playbook: first-playbook.yml

No errors. Now let’s see what happens with broken YAML:

cat > first-playbook-with-error.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Print message
      debug:
      msg: Hello CloudNet@ Ansible Study
EOF

ansible-playbook --syntax-check first-playbook-with-error.yml
# ERROR! conflicting action statements: debug, msg

The error shows exactly where the problem is (the msg should be indented under debug).

ansible-playbook first-playbook.yml

Output:

PLAY [all] ******************************************************************

TASK [Gathering Facts] ******************************************************
ok: [tnode1]
ok: [tnode2]
ok: [tnode3]

TASK [Print message] ********************************************************
ok: [tnode1] => {
    "msg": "Hello CloudNet@ Ansible Study"
}
ok: [tnode2] => {
    "msg": "Hello CloudNet@ Ansible Study"
}
ok: [tnode3] => {
    "msg": "Hello CloudNet@ Ansible Study"
}

PLAY RECAP ******************************************************************
tnode1    : ok=2    changed=0    unreachable=0    failed=0    skipped=0
tnode2    : ok=2    changed=0    unreachable=0    failed=0    skipped=0
tnode3    : ok=2    changed=0    unreachable=0    failed=0    skipped=0

Let’s break down what happened:

Gathering Facts: Ansible automatically collects information about each host
Print message: Our task runs on each host
Play Recap: Summary showing success/failure counts

Let’s write something more useful — a playbook that restarts the SSH service:

cat > restart-sshd.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Restart sshd service
      ansible.builtin.service:
        name: ssh
        state: restarted
EOF

But wait! There’s a problem. On Debian/Ubuntu, the service is called ssh. On RHEL/Rocky, it's called sshd. Let's run it and see what happens:

ansible-playbook restart-sshd.yml
# ...
# fatal: [tnode3]: FAILED! => {"changed": false, "msg": "Could not find the requested service ssh: host"}
# tnode1 and tnode2 succeed, tnode3 fails

Handling OS Differences with Conditionals

We need to use different service names based on the OS. Ansible collects OS information as “facts” that we can use in conditionals:

cat > restart-sshd.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Restart SSH on Debian
      ansible.builtin.service:
        name: ssh
        state: restarted
      when: ansible_facts['os_family'] == 'Debian'
- name: Restart SSH on RedHat
      ansible.builtin.service:
        name: sshd
        state: restarted
      when: ansible_facts['os_family'] == 'RedHat'
EOF

$ ansible-playbook restart-sshd.yml

Output:

TASK [Restart SSH on Debian] ************************************************
skipping: [tnode3]
changed: [tnode1]
changed: [tnode2]

TASK [Restart SSH on RedHat] ************************************************
skipping: [tnode1]
skipping: [tnode2]
changed: [tnode3]

Now each task runs only on the appropriate OS. This is a common pattern in Ansible, using facts and conditionals to handle heterogeneous environments.

Variables

Variables make playbooks flexible and reusable. Instead of hardcoding values, you define them once and reference them throughout your playbook.

Ansible has many places where you can define variables. Here’s the precedence order (highest to lowest):

Extra vars (-e on command line) — Always wins
Play vars (vars: in playbook)
Host vars (defined per-host in inventory)
Group vars (defined per-group in inventory)

Understanding this precedence is crucial. If you define the same variable in multiple places, the higher-precedence one wins.

Group variables apply to all hosts in a group. Define them in the inventory:

cat >> inventory <<'EOF'

[all:vars]
user=ansible
EOF

Now every host has access to the user variable.

Let’s create a playbook that uses it:

cat > create-user.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Create User {{ user }}
      ansible.builtin.user:
        name: "{{ user }}"
        state: present
EOF

Variable references use double curly braces: {{ variable_name }}. Note the quotes around "{{ user }}" — this is required when a value starts with a variable reference.

Run it:

ansible-playbook create-user.yml
# TASK [Create User ansible] **************************************************
# changed: [tnode1]
# changed: [tnode2]
# changed: [tnode3]

The task name shows “Create User ansible” — the variable was substituted.

Run it again:

ansible-playbook create-user.yml
# TASK [Create User ansible] **************************************************
# ok: [tnode1]
# ok: [tnode2]
# ok: [tnode3]

Notice changed=0 this time. The user already exists, so Ansible doesn't create it again. This is idempotency in action.

Verify the user was created:

for i in {1..3}; do ssh tnode$i tail -n 1 /etc/passwd; done
# ansible:x:1001:1001::/home/ansible:/bin/sh
# ansible:x:1001:1001::/home/ansible:/bin/sh
# ansible:x:1001:1001::/home/ansible:/bin/bash

Host Variables

Host variables apply to a single host and override group variables:

cat > inventory <<'EOF'
[web]
tnode1 ansible_python_interpreter=/usr/bin/python3
tnode2 ansible_python_interpreter=/usr/bin/python3

[db]
tnode3 ansible_python_interpreter=/usr/bin/python3 user=ansible1

[all:children]
web
db

[all:vars]
user=ansible
EOF

Now tnode3 has user=ansible1 while others have user=ansible.

cat > create-user1.yml <<'EOF'
---
- hosts: db
  tasks:
    - name: Create User {{ user }}
      ansible.builtin.user:
        name: "{{ user }}"
        state: present
EOF

ansible-playbook create-user1.yml
# TASK [Create User ansible1] *************************************************
# changed: [tnode3]

The host variable ansible1 overrode the group variable ansible.

Play Variables

Play variables are defined within the playbook and override both host and group variables:

cat > create-user2.yml <<'EOF'
---
- hosts: all
  vars:
    user: ansible2
  tasks:
    - name: Create User {{ user }}
      ansible.builtin.user:
        name: "{{ user }}"
        state: present
EOF

$ ansible-playbook create-user2.yml
# TASK [Create User ansible2] *************************************************
# changed: [tnode1]
# changed: [tnode2]
# changed: [tnode3]

Even though tnode3 has user=ansible1 in inventory, the play variable ansible2 wins.

External Variable Files

For better organization, you can store variables in separate files:

mkdir -p vars
cat > vars/users.yml <<'EOF'
user: ansible3
EOF

cat > create-user3.yml <<'EOF'
---
- hosts: all
  vars_files:
    - vars/users.yml
  tasks:
    - name: Create User {{ user }}
      ansible.builtin.user:
        name: "{{ user }}"
        state: present
EOF

$ ansible-playbook create-user3.yml
# TASK [Create User ansible3] *************************************************
# changed on all hosts

Extra Variables (Command Line)

Extra variables have the highest precedence. Pass them with -e:

ansible-playbook -e user=ansible4 create-user3.yml
# TASK [Create User ansible4] *************************************************
# changed on all hosts

Even though the playbook uses vars_files with user: ansible3, the -e user=ansible4 overrides it.

Registered Variables

You can capture task output in a variable using register:

cat > create-user4.yml <<'EOF'
---
- hosts: db
  tasks:
    - name: Create User {{ user }}
      ansible.builtin.user:
        name: "{{ user }}"
        state: present
      register: result
    - name: Display result
      ansible.builtin.debug:
        var: result
EOF

$ ansible-playbook -e user=ansible5 create-user4.yml

Output:

TASK [Display result] *******************************************************
ok: [tnode3] => {
    "result": {
        "changed": true,
        "comment": "",
        "create_home": true,
        "group": 1006,
        "home": "/home/ansible5",
        "name": "ansible5",
        "shell": "/bin/sh",
        "state": "present",
        "uid": 1006
    }
}

The registered variable contains detailed information about what the task did. This is useful for conditional logic, debugging, or passing information between tasks.

Facts

Facts are variables that Ansible automatically discovers about managed hosts. They include hardware info, network configuration, OS details, and more.

cat > facts.yml <<'EOF'
---
- hosts: db
  tasks:
    - name: Print all facts
      ansible.builtin.debug:
        var: ansible_facts
EOF

$ ansible-playbook facts.yml

This dumps a huge amount of information — hostname, IP addresses, memory, CPU, disk, OS version, Python version, and much more.

You don’t usually need all facts. Here’s how to access specific ones:

cat > facts1.yml <<'EOF'
---
- hosts: db
  tasks:
    - name: Print specific facts
      ansible.builtin.debug:
        msg: >
          The default IPv4 address of {{ ansible_facts.hostname }}
          is {{ ansible_facts.default_ipv4.address }}
EOF

$ ansible-playbook facts1.yml
# "msg": "The default IPv4 address of tnode3 is 10.10.1.13"

Fact Description ansible_facts.hostname Short hostname ansible_facts.fqdn Fully qualified domain name ansible_facts.default_ipv4.address Primary IPv4 address ansible_facts.os_family OS family (Debian, RedHat, etc.) ansible_facts.distribution Specific distribution (Ubuntu, Rocky, etc.) ansible_facts.distribution_version Distribution version ansible_facts.kernel Kernel version ansible_facts.memtotal_mb Total memory in MB ansible_facts.processor_cores Number of CPU cores

Disabling Fact Gathering

Fact gathering adds time to playbook runs. If you don’t need facts, disable them:

cat > facts3.yml <<'EOF'
---
- hosts: db
  gather_facts: no
  tasks:
    - name: Print message
      ansible.builtin.debug:
        msg: Hello Ansible World
EOF

$ ansible-playbook facts3.yml
# Note: no "Gathering Facts" task runs

But if you try to use facts after disabling gathering:

cat > facts3-error.yml <<'EOF'
---
- hosts: db
  gather_facts: no
  tasks:
    - name: Print facts
      ansible.builtin.debug:
        msg: "Hostname is {{ ansible_facts.hostname }}"
EOF

$ ansible-playbook facts3-error.yml
# ERROR! 'dict object' has no attribute 'hostname'

You can disable automatic gathering but collect facts later when needed:

cat > facts4.yml <<'EOF'
---
- hosts: db
  gather_facts: no
  tasks:
    - name: Manually gather facts
      ansible.builtin.setup:
    - name: Print facts
      ansible.builtin.debug:
        msg: "Hostname is {{ ansible_facts.hostname }}"
EOF


ansible-playbook facts4.yml
# Works now because we explicitly gathered facts

You can create your own facts on managed hosts. Create .fact files in /etc/ansible/facts.d/:

# On the control node (for local testing)
mkdir -p /etc/ansible/facts.d

cat > /etc/ansible/facts.d/my-custom.fact <<'EOF'
[packages]
web_package = httpd
db_package = mariadb-server

[users]
user1 = ansible
user2 = devlos
EOF

Access custom facts via ansible_local:

cat > facts5.yml <<'EOF'
---
- hosts: localhost
  tasks:
    - name: Print custom facts
      ansible.builtin.debug:
        var: ansible_local
EOF

ansible-playbook facts5.yml
# "ansible_local": {
#     "my-custom": {
#         "packages": {
#             "db_package": "mariadb-server",
#             "web_package": "httpd"
#         },
#         "users": {
#             "user1": "ansible",
#             "user2": "devlos"
#         }
#     }
# }

Custom facts are useful for site-specific configuration that you want Ansible to discover automatically.

Loops

Loops let you repeat a task multiple times with different values. This eliminates the need to write the same task over and over.

cat > check-services.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Check services are running
      ansible.builtin.service:
        name: "{{ item }}"
        state: started
      loop:
        - rsyslog
        - cron
EOF

ansible-playbook check-services.yml
# TASK [Check services are running] *******************************************
# ok: [tnode1] => (item=rsyslog)
# ok: [tnode1] => (item=cron)
# ok: [tnode2] => (item=rsyslog)
# ...

The {{ item }} variable holds the current value from the loop list.

Instead of hardcoding the list, use a variable:

cat > check-services2.yml <<'EOF'
---
- hosts: all
  vars:
    services:
      - rsyslog
      - cron
  tasks:
    - name: Check services are running
      ansible.builtin.service:
        name: "{{ item }}"
        state: started
      loop: "{{ services }}"
EOF

Sometimes you need multiple values per iteration. Use a list of dictionaries:

cat > make-file.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Create files
      ansible.builtin.file:
        path: "{{ item['log-path'] }}"
        mode: "{{ item['log-mode'] }}"
        state: touch
      loop:
        - log-path: /var/log/test1.log
          log-mode: '0644'
        - log-path: /var/log/test2.log
          log-mode: '0600'
EOF

$ ansible-playbook make-file.yml

Verify:

ansible -m shell -a "ls -l /var/log/test*.log" all
# -rw-r--r-- 1 root root 0 ... /var/log/test1.log
# -rw------- 1 root root 0 ... /var/log/test2.log

Each file was created with its specified permissions.

You might encounter older playbooks using with_* syntax:

# Old syntax (still works, but loop is preferred)
- name: Print items
  debug:
    msg: "{{ item }}"
  with_items:
    - one
    - two
    - three

# Modern syntax
- name: Print items
  debug:
    msg: "{{ item }}"
  loop:
    - one
    - two
    - three

The loop keyword is the modern approach. with_items still works but is considered legacy.

Capture output from all loop iterations:

cat > loop_register.yml <<'EOF'
---
- hosts: localhost
  tasks:
    - name: Loop echo test
      ansible.builtin.shell: "echo 'I can speak {{ item }}'"
      loop:
        - Korean
        - English
      register: result
    - name: Show all results
      ansible.builtin.debug:
        var: result
    - name: Show just stdout from each iteration
      ansible.builtin.debug:
        msg: "Stdout: {{ item.stdout }}"
      loop: "{{ result.results }}"
EOF

$ ansible-playbook loop_register.yml

When you register inside a loop, the variable contains a results array with output from each iteration.

Conditionals

The when keyword lets you run tasks only when certain conditions are met.

cat > when_task.yml <<'EOF'
---
- hosts: localhost
  vars:
    run_my_task: true
  tasks:
    - name: Echo message
      ansible.builtin.shell: "echo test"
      when: run_my_task
      register: result
    - name: Show result
      ansible.builtin.debug:
        var: result
EOF

$ ansible-playbook when_task.yml
# Task runs because run_my_task is true

Change run_my_task: false:

cat > when_task_false.yml <<'EOF'
---
- hosts: localhost
  vars:
    run_my_task: false
  tasks:
    - name: Echo message
      ansible.builtin.shell: "echo test"
      when: run_my_task
      register: result
    - name: Show result
      ansible.builtin.debug:
        var: result
EOF


$ ansible-playbook when_task_false.yml
# TASK [Echo message] *********************************************************
# skipping: [localhost]

The task is skipped.

Comparison Operators

Operator Description == Equal != Not equal >, >=, <, <= Numeric comparisons in Value in list is defined Variable exists is not defined Variable doesn't exist not Negation and, or Combine conditions

cat > check-os.yml <<'EOF'
---
- hosts: all
  vars:
    supported_distros:
      - Ubuntu
      - CentOS
  tasks:
    - name: Print supported os
      ansible.builtin.debug:
        msg: "This {{ ansible_facts['distribution'] }} needs to use apt"
      when: ansible_facts['distribution'] in supported_distros
EOF

$ ansible-playbook check-os.yml
# tnode1, tnode2: prints message (Ubuntu)
# tnode3: skipped (Rocky not in list)

Multiple Conditions

Use and / or or list syntax:

# Using 'and'
when: ansible_facts['distribution'] == "Ubuntu" and ansible_facts['distribution_version'] == "24.04"

# List syntax (implicit AND)
when:
  - ansible_facts['distribution'] == "Ubuntu"
  - ansible_facts['distribution_version'] == "24.04"

# Combining AND and OR
when: >
  ( ansible_facts['distribution'] == "Rocky" and
    ansible_facts['distribution_version'] == "9.6" )
  or
  ( ansible_facts['distribution'] == "Ubuntu" and
    ansible_facts['distribution_version'] == "24.04" )

Loop + Condition

You can combine loops and conditionals:

cat > check-mount.yml <<'EOF'
---
- hosts: db
  tasks:
    - name: Print Root Directory Size
      ansible.builtin.debug:
        msg: "Directory {{ item.mount }} size is {{ item.size_available }}"
      loop: "{{ ansible_facts['mounts'] }}"
      when: item['mount'] == "/" and item['size_available'] > 300000000
EOF

$ ansible-playbook check-mount.yml
# Only prints for root mount if it has >300MB available

Register + Condition

cat > register-when.yml <<'EOF'
---
- hosts: all
  tasks:
    - name: Get rsyslog service status
      ansible.builtin.command: systemctl is-active rsyslog
      register: result

    - name: Print rsyslog status
      ansible.builtin.debug:
        msg: "Rsyslog status is {{ result.stdout }}"
      when: result.stdout == "active"
EOF

ansible-playbook register-when.yml
# Only prints if rsyslog is active

Handlers and Error Handling

Handlers are tasks that only run when notified. They’re perfect for actions that should happen only if something changed:

cat > handler-sample.yml <<'EOF'
---
- hosts: tnode2
  tasks:
    - name: Restart rsyslog
      ansible.builtin.service:
        name: rsyslog
        state: restarted
      notify:
        - print msg
  handlers:
    - name: print msg
      ansible.builtin.debug:
        msg: "rsyslog is restarted"
EOF

$ ansible-playbook handler-sample.yml
# TASK [Restart rsyslog] ******************************************************
# changed: [tnode2]
#
# RUNNING HANDLER [print msg] *************************************************
# ok: [tnode2] => {
#     "msg": "rsyslog is restarted"
# }

The handler ran because the service task returned changed. If you run it again, the service restarts (always changed for state: restarted), so the handler runs again.

In practice, handlers are commonly used to restart services only when their configuration files change.

By default, if a task fails, Ansible stops executing on that host. Use ignore_errors to continue:

cat > ignore-example.yml <<'EOF'
---
- hosts: tnode1
  tasks:
    - name: Install nonexistent package
      ansible.builtin.apt:
        name: apache3      # This doesn't exist
        state: latest
      ignore_errors: yes

    - name: Print msg
      ansible.builtin.debug:
        msg: "This still runs despite the error above"
EOF

ansible-playbook ignore-example.yml
# First task fails but second task still runs

Normally, if a play fails, notified handlers don’t run. Use force_handlers: yes to change this:

cat > force-handler.yml <<'EOF'
---
- hosts: tnode2
  force_handlers: yes
  tasks:
    - name: Restart rsyslog
      ansible.builtin.service:
        name: rsyslog
        state: restarted
      notify:
        - print msg
    - name: Install nonexistent package
      ansible.builtin.apt:
        name: apache3
        state: latest
    handlers:
    - name: print msg
      ansible.builtin.debug:
        msg: "rsyslog is restarted"
EOF

ansible-playbook force-handler.yml
# Second task fails, but handler still runs because of force_handlers

Use failed_when to define custom failure criteria:

cat > failed-when.yml <<'EOF'
---
- hosts: tnode1
  tasks:
    - name: Run user add script
      ansible.builtin.shell: /root/adduser-script.sh
      register: command_result
      failed_when: "'Please input user id and password' in command_result.stdout"
    - name: Print msg
      ansible.builtin.debug:
        msg: "{{ command_result.stdout }}"
EOF

Even though the shell command returns exit code 0 (success), we mark it as failed if it prints the usage message.

Block / Rescue / Always

For try-catch-finally style error handling:

cat > block-example.yml <<'EOF'
---
- hosts: tnode2
  vars:
    logdir: /var/log/daily_log
    logfile: todays.log
  tasks:
    - name: Configure Log Env
      block:
        - name: Find Directory
          ansible.builtin.find:
            paths: "{{ logdir }}"
          register: result
          failed_when: "'Not all paths' in result.msg"
  rescue:
        - name: Make Directory when Not found
          ansible.builtin.file:
            path: "{{ logdir }}"
            state: directory
            mode: '0755'
  always:
        - name: Create File
          ansible.builtin.file:
            path: "{{ logdir }}/{{ logfile }}"
            state: touch
            mode: '0644'
EOF

$ ansible-playbook block-example.yml

First run (directory doesn’t exist):

block task fails (directory not found)
rescue task runs (creates directory)
always task runs (creates file)

Second run (directory exists):

block task succeeds
rescue is skipped
always task runs

Roles

As playbooks grow, they become hard to manage. Roles let you organize your automation into reusable components.

A role follows a standard structure:

my-role/
├── defaults/       # Default variables (can be overridden)
│   └── main.yml
├── files/          # Static files to copy
├── handlers/       # Handler definitions
│   └── main.yml
├── meta/           # Role metadata (dependencies, etc.)
│   └── main.yml
├── tasks/          # Main task list
│   └── main.yml
├── templates/      # Jinja2 templates
├── tests/          # Test playbooks
│   ├── inventory
│   └── test.yml
└── vars/           # Variables (high precedence, hard to override)
    └── main.yml

Use ansible-galaxy to scaffold a new role:

ansible-galaxy role init my-role
tree my-role/

Let’s create a role that:

Checks if the OS is supported
Installs Apache
Deploys a custom index.html
Restarts Apache when the file changes

tasks/main.yml:

cat > my-role/tasks/main.yml <<'EOF'
---
- name: Install service {{ service_title }}
  ansible.builtin.apt:
    name: "{{ item }}"
    state: latest
  loop: "{{ httpd_packages }}"
  when: ansible_facts.distribution in supported_distros

- name: Copy html file
  ansible.builtin.copy:
    src: "{{ src_file_path }}"
    dest: "{{ dest_file_path }}"
  notify: 
    - restart service
EOF

files/index.html:

echo "Hello! Ansible" > my-role/files/index.html

handlers/main.yml:

cat > my-role/handlers/main.yml <<'EOF'
---
- name: restart service
  ansible.builtin.service:
    name: "{{ service_name }}"
    state: restarted
EOF

defaults/main.yml (overridable variables):

cat > my-role/defaults/main.yml <<'EOF'
---
service_title: "Apache Web Server"
EOF

vars/main.yml (internal variables):

cat > my-role/vars/main.yml <<'EOF'
---
service_name: apache2
src_file_path: index.html
dest_file_path: /var/www/html/index.html
httpd_packages:
  - apache2
  - apache2-doc
supported_distros:
  - Ubuntu
EOF

Using the Role

Create a playbook that calls the role:

cat > role-example.yml <<'EOF'
---
- hosts: tnode1
  tasks:
    - name: Print start play
      ansible.builtin.debug:
        msg: "Let's start role play"
    - name: Install Service by role
      ansible.builtin.import_role:
        name: my-role
EOF

$ ansible-playbook role-example.yml

Output:

TASK [Print start play] *******************************************************
ok: [tnode1] => {
    "msg": "Let's start role play"
}

TASK [my-role : install service Apache Web Server] ****************************
changed: [tnode1] => (item=apache2)
changed: [tnode1] => (item=apache2-doc)

TASK [my-role : copy html file] ***********************************************
changed: [tnode1]

RUNNING HANDLER [my-role : restart service] ***********************************
changed: [tnode1]

Verify:

curl tnode1
# Hello! Ansible

Overriding Role Variables

Override the default service_title:

cat > role-example.yml <<'EOF'
---
- hosts: tnode1
  tasks:
    - name: Install Service by role
      ansible.builtin.import_role:
        name: my-role
      vars:
        service_title: Httpd
EOF

$ ansible-playbook role-example.yml
# TASK [my-role : install service Httpd] **************************************

Using the roles: Section

An alternative to import_role:

cat > role-example2.yml <<'EOF'
---
- hosts: tnode1
  roles:
    - role: my-role
      service_title: "Custom Apache"
  tasks:
    - name: Print finish
      ansible.builtin.debug:
        msg: "Finished role play"
EOF

Special Task Sections

When using roles:, you can add tasks that run before or after:

cat > special_role.yml <<'EOF'
---
- hosts: tnode1
  pre_tasks:
    - name: Print Start
      ansible.builtin.debug:
        msg: "Starting role play"
  roles:
    - my-role
  tasks:
    - name: Curl test
      ansible.builtin.uri:
        url: http://tnode1
        return_content: true
      register: curl_result
      notify: Print result
      changed_when: true
  post_tasks:
    - name: Print Finish
      ansible.builtin.debug:
        msg: "Finished role play"
  handlers:
    - name: Print result
      ansible.builtin.debug:
        msg: "{{ curl_result.content }}"
EOF

Execution order:

pre_tasks
roles
tasks → triggers handler
post_tasks

Special Tags

always: Task always runs (unless explicitly skipped)

- name: Always run this
  debug:
    msg: "I always run"
  tags:
    - always

never: Task never runs (unless explicitly requested)

- name: Debug task
  debug:
    msg: "Only runs with --tags debug or --tags never"
  tags:
    - never
    - debug

Tags on Blocks, Plays, and Roles

# Tag a block
- block:
    - name: Task 1
      ...
    - name: Task 2
      ...
  tags: myblock

# Tag a play
- hosts: all
  tags: deploy
  tasks:
    ...

# Tag a role
roles:
  - role: my-role
    tags: [ web, setup ]

Ansible Galaxy

Ansible Galaxy is a community hub for sharing roles and collections. Why write everything from scratch when someone has already done the work?On the web: https://galaxy.ansible.com Via command line:

ansible-galaxy role search postgresql --platforms Ubuntu
# Found 282 roles matching your search

ansible-galaxy role info geerlingguy.postgresql
# Role: geerlingguy.postgresql
#     description: PostgreSQL server for Linux.
#     download_count: 3758702
#     ...

# Install to default location
ansible-galaxy role install geerlingguy.postgresql

# Install to specific directory
ansible-galaxy role install -p roles geerlingguy.postgresql

# List installed roles
ansible-galaxy role list

cat > role-galaxy.yml <<'EOF'
---
- hosts: tnode1
  become: yes
  roles:
    - geerlingguy.postgresql
EOF

ansible-playbook role-galaxy.yml

ansible-galaxy role remove geerlingguy.postgresql

Collections are the modern way to distribute Ansible content. They can include modules, plugins, roles, and playbooks — all packaged together.

Before collections, all modules were bundled with Ansible core. This created problems:

Module updates tied to Ansible releases
Namespace collisions
Everything installed whether you need it or not

Collections solve this by allowing vendors and developers to distribute content independently.

# List installed collections
ansible-galaxy collection list

# Install a collection
ansible-galaxy collection install openstack.cloud:2.1.0

# Install from a tarball (for offline)
ansible-galaxy collection download -p ./collection openstack.cloud
ansible-galaxy collection install ./collection/openstack-cloud-2.2.0.tar.gz

# Remove a collection (manual deletion)
rm -rf ~/.ansible/collections/ansible_collections/openstack

When referencing modules from collections, use the fully qualified name:

# Full name
- name: Create server
  openstack.cloud.server:
    name: myserver
    ...

# Or import the collection
collections:
  - openstack.cloud

tasks:
  - name: Create server
    server:
      name: myserver

Conclusion

We’ve covered a lot of ground. Let’s recap what you’ve learned:

Ansible basics: Agentless architecture, SSH-based, idempotent
Inventories: Organizing hosts into groups
Playbooks: YAML files describing desired state
Variables: Multiple scopes with clear precedence rules
Facts: Auto-discovered host information
Loops: Repeating tasks efficiently
Conditionals: Running tasks based on conditions
Handlers: Triggered tasks for “changed” events
Error handling: ignore_errors, failed_when, block/rescue/always
Roles: Reusable, modular automation components
Tags: Selective task execution
Galaxy: Sharing and reusing community content
Collections: Modern content distribution

The manual approach to server management, SSH’ing in, running commands, copying files, simply doesn’t scale. With Ansible, you describe what you want, and it figures out how to get there. Your infrastructure becomes code that you can version, review, and reproduce.

Next week, we’ll put all this knowledge to use with Kubespray, using Ansible to deploy a production-ready Kubernetes cluster. The foundation you’ve built here will make that journey much smoother.

Thanks for reading! See you in the next post.

Building a Kubernetes Cluster From Scratch: Part 3 & Part 4

Sigrid Jin — Sat, 10 Jan 2026 18:00:37 GMT

See the repository if you want to follow my tutorial! https://github.com/sigridjineth/k8s-hard-way

In Part 2, we built the control plane: etcd for storage, the API server for cluster access, the controller manager for reconciliation loops, and the scheduler for pod placement. We verified that everything was running and that we could interact with the cluster using kubectl.

But right now, if you try to create a pod, nothing will happen. The scheduler will look for a node to place it on and find none. The pod will sit in Pending state forever. We need worker nodes.

In this part, we will configure both worker nodes with everything they need to run containers. This includes containerd as the container runtime, kubelet as the node agent, kube-proxy for service networking, and CNI plugins for pod networking. We will also set up routing so that pods on different nodes can talk to each other.

This is where Kubernetes starts to feel real. By the end of this part, you will be able to deploy actual workloads.

Before we dive into configuration, let me explain what each component does and how they interact.

The kubelet is the primary node agent. It watches the API server for pods that are assigned to its node and ensures those pods are running. When a new pod is scheduled to the node, kubelet tells the container runtime to start the containers. It also reports node status back to the API server, including resource capacity, running pods, and health conditions.

The container runtime actually runs containers. We use containerd, which is the industry standard runtime used by most Kubernetes distributions including Docker Desktop, EKS, GKE, and AKS. Containerd handles pulling images, creating containers, managing their lifecycle, and cleaning up when they exit. It implements the Container Runtime Interface (CRI) that kubelet uses to communicate with it.

Under containerd, there is runc. This is a low-level runtime that does the actual work of creating Linux containers using namespaces and cgroups. Containerd calls runc to create each container, and runc sets up the isolated environment where the container process runs.

The kube-proxy handles service networking. When you create a Service in Kubernetes, kube-proxy sets up iptables rules (or IPVS rules, depending on configuration) that route traffic destined for the service’s ClusterIP to one of the backing pods. It watches the API server for service and endpoint changes and updates the routing rules accordingly.

The CNI plugins handle pod networking. When kubelet creates a pod, it calls a CNI plugin to set up the network namespace, assign an IP address, and configure routing. We will use the bridge plugin, which creates a Linux bridge on each node and connects all pods to it. Each pod gets an IP from the node’s pod CIDR range.

Let us start with node-0. From the jumpbox, SSH into it:

ssh root@node-0

First, let us verify that swap is disabled. Kubernetes requires this:

swapon --show

This should produce no output. If you see swap listed, disable it:

swapoff -a
sed -i '/swap/s/^/#/' /etc/fstab

Install the required system packages:

apt-get update
apt-get install -y socat conntrack ipset

Let me explain why we need these.

socat is used by kubectl port-forward to create the tunnels between your machine and pods.
conntrack is the connection tracking tool. Kube-proxy uses it to track network connections for proper load balancing and NAT.
ipset is used by kube-proxy in IPVS mode to efficiently manage large sets of IP addresses. Even if we use iptables mode, having it installed does not hurt.

Now load the kernel modules that Kubernetes networking requires:

cat > /etc/modules-load.d/kubernetes.conf << EOF
overlay
br_netfilter
EOF

modprobe overlay
modprobe br_netfilter

The overlay module is used by containerd’s overlayfs storage driver, which efficiently layers container filesystem images.

The br_netfilter module is critical for Kubernetes networking. By default, traffic that passes through a Linux bridge stays at layer 2 and does not go through iptables. But kube-proxy’s service routing relies on iptables. The br_netfilter module makes bridged traffic visible to iptables so that service routing works correctly for pod-to-pod communication.

Verify the modules are loaded:

lsmod | grep br_netfilter
lsmod | grep overlay

Both should show output.

Now configure the required sysctl parameters:

cat > /etc/sysctl.d/kubernetes.conf << EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

sysctl --system

The bridge-nf-call-iptables settings ensure that bridged traffic goes through iptables. The ip_forward setting enables IP forwarding, which is necessary for routing traffic between pods on different nodes.

Create the directories we will need:

mkdir -p \
  /etc/cni/net.d \
  /opt/cni/bin \
  /var/lib/kubelet \
  /var/lib/kube-proxy \
  /var/lib/kubernetes \
  /var/run/kubernetes \
  /etc/containerd

Exit back to jumpbox to copy the binaries:

exit

From jumpbox, copy all the worker node binaries to node-0:

scp ~/downloads/kubelet ~/downloads/kube-proxy ~/downloads/kubectl \
  ~/downloads/runc ~/downloads/crictl \
  ~/downloads/containerd ~/downloads/containerd-shim-runc-v2 ~/downloads/ctr \
  root@node-0:~/

scp -r ~/downloads/cni-plugins root@node-0:~/

SSH back to node-0:

ssh root@node-0

Install the binaries to their proper locations:

mv kubelet kube-proxy kubectl /usr/local/bin/
mv runc crictl /usr/local/bin/
mv containerd containerd-shim-runc-v2 ctr /bin/
mv cni-plugins/* /opt/cni/bin/

chmod +x /usr/local/bin/kubelet /usr/local/bin/kube-proxy /usr/local/bin/kubectl
chmod +x /usr/local/bin/runc /usr/local/bin/crictl
chmod +x /bin/containerd /bin/containerd-shim-runc-v2 /bin/ctr
chmod +x /opt/cni/bin/*

Verify the binaries work:

containerd --version
kubelet --version
kube-proxy --version

All three should print version information.

The CNI configuration tells kubelet how to set up networking for each pod. We will use the bridge plugin, which creates a Linux bridge and connects each pod to it.

Each node needs its own pod CIDR. Node-0 uses 10.200.0.0/24, which gives it 254 possible pod IPs. Create the bridge configuration:

cat > /etc/cni/net.d/10-bridge.conf << EOF
{
  "cniVersion": "1.0.0",
  "name": "bridge",
  "type": "bridge",
  "bridge": "cni0",
  "isGateway": true,
  "ipMasq": true,
  "ipam": {
    "type": "host-local",
    "ranges": [
      [{"subnet": "10.200.0.0/24"}]
    ],
    "routes": [
      {"dst": "0.0.0.0/0"}
    ]
  }
}
EOF

Let me explain this configuration:

The type is “bridge”, which means we use the bridge CNI plugin.
The bridge field specifies the name of the Linux bridge to create. All pods on this node will be connected to cni0.
The isGateway setting means the bridge will have an IP address and act as the default gateway for pods. Pods will send traffic to this bridge, which will then route it appropriately.
The ipMasq setting enables IP masquerading (NAT) for traffic leaving the node. When a pod sends traffic to an external destination, the source IP will be rewritten to the node’s IP.
The ipam section configures IP address management. We use host-local, which allocates IPs from a local pool. The subnet is 10.200.0.0/24 for node-0.

Now create the loopback configuration:

cat > /etc/cni/net.d/99-loopback.conf << EOF
{
  "cniVersion": "1.0.0",
  "name": "lo",
  "type": "loopback"
}
EOF

This ensures that every pod has a working loopback interface.

Create the containerd configuration file:

cat > /etc/containerd/config.toml << EOF
version = 2

[plugins."io.containerd.grpc.v1.cri"]
  [plugins."io.containerd.grpc.v1.cri".containerd]
    snapshotter = "overlayfs"
    default_runtime_name = "runc"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
EOF

The key settings here.

The snapshotter is overlayfs, which is the most efficient storage driver for container layers on Linux.
The SystemdCgroup setting is important. It tells containerd to use systemd as the cgroup driver. This must match what kubelet uses, and since systemd is the init system on our Debian nodes, we should use systemd for cgroups too. Mismatched cgroup drivers cause all sorts of mysterious issues.
The cni section tells containerd where to find CNI plugins and configuration.

Create the systemd unit file for containerd.

cat > /etc/systemd/system/containerd.service << EOF
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/bin/containerd
Restart=always
RestartSec=5
Delegate=yes
KillMode=process
OOMScoreAdjust=-999
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity

[Install]
WantedBy=multi-user.target
EOF

The Delegate=yes setting is important. It allows containerd to manage cgroups for containers it creates. Without this, systemd might interfere with container resource limits.

Move the certificates and kubeconfig into place. First, let us check what we have in the home directory:

ls -la ~/*.pem ~/node-0.kubeconfig

You should see ca.pem, node-0.pem, node-0-key.pem, and node-0.kubeconfig.

Move them to the appropriate locations:

mv ~/ca.pem /var/lib/kubernetes/
mv ~/node-0.pem ~/node-0-key.pem /var/lib/kubelet/
mv ~/node-0.kubeconfig /var/lib/kubelet/kubeconfig

Create the kubelet configuration file:

cat > /var/lib/kubelet/kubelet-config.yaml << EOF
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
  x509:
    clientCAFile: "/var/lib/kubernetes/ca.pem"
authorization:
  mode: Webhook
cgroupDriver: systemd
clusterDomain: "cluster.local"
clusterDNS:
  - "10.32.0.10"
podCIDR: "10.200.0.0/24"
resolvConf: "/run/systemd/resolve/resolv.conf"
runtimeRequestTimeout: "15m"
tlsCertFile: "/var/lib/kubelet/node-0.pem"
tlsPrivateKeyFile: "/var/lib/kubelet/node-0-key.pem"
registerNode: true
EOF

Let me explain the important settings.

The authentication section configures how kubelet authenticates incoming requests. We disable anonymous access and enable webhook authentication, which means kubelet will check with the API server to verify tokens. We also configure x509 authentication using our CA.
The authorization mode is Webhook, which means kubelet will ask the API server whether a given request is authorized. This is more secure than AlwaysAllow and integrates with Kubernetes RBAC.
The cgroupDriver must be systemd to match containerd’s configuration.
The clusterDNS setting points to 10.32.0.10, which is where CoreDNS will run. We have not set up DNS yet, but kubelet needs to know the address to configure pods correctly.
The podCIDR tells kubelet which IP range to use for pods on this node.
The resolvConf setting points to the systemd-resolved configuration. This is used as a base for pod DNS configuration.
The tlsCertFile and tlsPrivateKeyFile are the node’s certificate and key, used when kubelet serves its HTTPS API.

Create the kubelet systemd unit file.

cat > /etc/systemd/system/kubelet.service << EOF
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \\
  --config=/var/lib/kubelet/kubelet-config.yaml \\
  --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock \\
  --kubeconfig=/var/lib/kubelet/kubeconfig \\
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

The container-runtime-endpoint tells kubelet how to communicate with containerd via its Unix socket.

Move the kube-proxy kubeconfig:

mv ~/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig

Create the kube-proxy configuration file:

cat > /var/lib/kube-proxy/kube-proxy-config.yaml << EOF
kind: KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
clientConnection:
  kubeconfig: "/var/lib/kube-proxy/kubeconfig"
mode: "iptables"
clusterCIDR: "10.200.0.0/16"
EOF

The mode is set to iptables, which is the traditional and most compatible mode. Kube-proxy will create iptables rules for service routing.

The clusterCIDR is the overall pod network range. This is used for determining which traffic is internal to the cluster.

Create the systemd unit file:

cat > /etc/systemd/system/kube-proxy.service << EOF
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-proxy \\
  --config=/var/lib/kube-proxy/kube-proxy-config.yaml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

Now let us start all the services:

systemctl daemon-reload
systemctl enable containerd kubelet kube-proxy
systemctl start containerd kubelet kube-proxy

Wait a few seconds, then check their status:

systemctl status containerd --no-pager

It should show active (running).

systemctl status kubelet --no-pager

This should also show active (running). If it shows failed or you see errors, check the logs:

journalctl -u kubelet --no-pager -n 50

Common issues at this stage include certificate problems (wrong paths or permissions), kubeconfig issues, or containerd not running.

systemctl status kube-proxy --no-pager

This should be running as well.

Now we need to repeat the process for node-1. I will go through it more quickly since the steps are the same, just with different node-specific values.

Exit from node-0:

exit

From jumpbox, copy the binaries and configuration to node-1:

scp ~/downloads/kubelet ~/downloads/kube-proxy ~/downloads/kubectl \
  ~/downloads/runc ~/downloads/crictl \
  ~/downloads/containerd ~/downloads/containerd-shim-runc-v2 ~/downloads/ctr \
  root@node-1:~/

scp -r ~/downloads/cni-plugins root@node-1:~/

SSH into node-1:

ssh root@node-1

Install system dependencies:

apt-get update
apt-get install -y socat conntrack ipset

Load kernel modules:

cat > /etc/modules-load.d/kubernetes.conf << EOF
overlay
br_netfilter
EOF

modprobe overlay
modprobe br_netfilter

Configure sysctl:

cat > /etc/sysctl.d/kubernetes.conf << EOF
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

sysctl --system

Create directories:

mkdir -p \
  /etc/cni/net.d \
  /opt/cni/bin \
  /var/lib/kubelet \
  /var/lib/kube-proxy \
  /var/lib/kubernetes \
  /var/run/kubernetes \
  /etc/containerd

Install binaries:

mv kubelet kube-proxy kubectl /usr/local/bin/
mv runc crictl /usr/local/bin/
mv containerd containerd-shim-runc-v2 ctr /bin/
mv cni-plugins/* /opt/cni/bin/

chmod +x /usr/local/bin/kubelet /usr/local/bin/kube-proxy /usr/local/bin/kubectl
chmod +x /usr/local/bin/runc /usr/local/bin/crictl
chmod +x /bin/containerd /bin/containerd-shim-runc-v2 /bin/ctr
chmod +x /opt/cni/bin/*

Create CNI configuration. Note the different subnet for node-1:

cat > /etc/cni/net.d/10-bridge.conf << EOF
{
  "cniVersion": "1.0.0",
  "name": "bridge",
  "type": "bridge",
  "bridge": "cni0",
  "isGateway": true,
  "ipMasq": true,
  "ipam": {
    "type": "host-local",
    "ranges": [
      [{"subnet": "10.200.1.0/24"}]
    ],
    "routes": [
      {"dst": "0.0.0.0/0"}
    ]
  }
}
EOF

cat > /etc/cni/net.d/99-loopback.conf << EOF
{
  "cniVersion": "1.0.0",
  "name": "lo",
  "type": "loopback"
}
EOF

Create containerd configuration (same as node-0):

cat > /etc/containerd/config.toml << EOF
version = 2

[plugins."io.containerd.grpc.v1.cri"]
  [plugins."io.containerd.grpc.v1.cri".containerd]
    snapshotter = "overlayfs"
    default_runtime_name = "runc"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
EOF

cat > /etc/systemd/system/containerd.service << EOF
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/bin/containerd
Restart=always
RestartSec=5
Delegate=yes
KillMode=process
OOMScoreAdjust=-999
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity

[Install]
WantedBy=multi-user.target
EOF

Move certificates and kubeconfig:

mv ~/ca.pem /var/lib/kubernetes/
mv ~/node-1.pem ~/node-1-key.pem /var/lib/kubelet/
mv ~/node-1.kubeconfig /var/lib/kubelet/kubeconfig

Create kubelet configuration. Note the different certificate filenames and podCIDR:

cat > /var/lib/kubelet/kubelet-config.yaml << EOF
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
  x509:
    clientCAFile: "/var/lib/kubernetes/ca.pem"
authorization:
  mode: Webhook
cgroupDriver: systemd
clusterDomain: "cluster.local"
clusterDNS:
  - "10.32.0.10"
podCIDR: "10.200.1.0/24"
resolvConf: "/run/systemd/resolve/resolv.conf"
runtimeRequestTimeout: "15m"
tlsCertFile: "/var/lib/kubelet/node-1.pem"
tlsPrivateKeyFile: "/var/lib/kubelet/node-1-key.pem"
registerNode: true
EOF

cat > /etc/systemd/system/kubelet.service << EOF
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \\
  --config=/var/lib/kubelet/kubelet-config.yaml \\
  --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock \\
  --kubeconfig=/var/lib/kubelet/kubeconfig \\
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

Move kube-proxy kubeconfig and create configuration:

mv ~/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig

cat > /var/lib/kube-proxy/kube-proxy-config.yaml << EOF
kind: KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
clientConnection:
  kubeconfig: "/var/lib/kube-proxy/kubeconfig"
mode: "iptables"
clusterCIDR: "10.200.0.0/16"
EOF

cat > /etc/systemd/system/kube-proxy.service << EOF
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-proxy \\
  --config=/var/lib/kube-proxy/kube-proxy-config.yaml
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

Start the services:

systemctl daemon-reload
systemctl enable containerd kubelet kube-proxy
systemctl start containerd kubelet kube-proxy

Verify they are running:

systemctl status containerd kubelet kube-proxy --no-pager

All three should show active (running).

Exit from node-1 and go back to jumpbox:

exit

Now let us check if the nodes have registered with the cluster:

kubectl get nodes

You should see both nodes:

NAME     STATUS     ROLES    AGE   VERSION
node-0   NotReady      1m    v1.32.0
node-1   NotReady      30s   v1.32.0

The nodes show NotReady status. This is because we have not set up pod network routes yet. The nodes cannot communicate with each other’s pod networks, so the network is not fully functional. The kubelet marks the node as NotReady when the network plugin reports issues.

Actually, let us check more carefully. The CNI plugin should work for local pods. The NotReady status might be because of a different reason. Let us check the node conditions:

kubectl describe node node-0 | grep -A5 Conditions

Look for the Ready condition. If it says “container runtime network not ready”, that is a CNI issue. If it says something about kubelet, check the kubelet logs on that node.

In some cases, the nodes might show Ready even before we set up routes, because the CNI plugin is configured correctly for local pods. Let us proceed with setting up routes anyway, as they are necessary for pods on different nodes to communicate.

Here is the problem we need to solve. Node-0 has pods in the 10.200.0.0/24 range. Node-1 has pods in the 10.200.1.0/24 range. When a pod on node-0 wants to talk to a pod on node-1, the packet needs to get from node-0 to node-1. But node-0 does not know that 10.200.1.0/24 is reachable via node-1’s IP address.

We need to add static routes on each machine so they know how to reach each other’s pod networks.

On the server (control plane), we need routes to both worker nodes’ pod networks in case we want to access pods from there. SSH to server:

ssh root@server

Add the routes:

ip route add 10.200.0.0/24 via 192.168.10.101
ip route add 10.200.1.0/24 via 192.168.10.102

Verify:

ip route

You should see the new routes listed.

Exit and go to node-0:

exit
ssh root@node-0

Add the route to node-1’s pod network:

ip route add 10.200.1.0/24 via 192.168.10.102

Verify:

ip route

Exit and go to node-1:

exit
ssh root@node-1

Add the route to node-0’s pod network:

ip route add 10.200.0.0/24 via 192.168.10.101

Verify:

ip route

Exit back to jumpbox:

exit

These routes are not persistent. If you reboot a machine, the routes will be gone. For a production setup, you would add them to the network configuration files or use a CNI plugin like Calico or Flannel that handles routing automatically. For our lab, we can just re-add them if needed.

Now let us check the nodes again:

kubectl get nodes

You should see:

NAME     STATUS   ROLES    AGE   VERSION
node-0   Ready       5m    v1.32.0
node-1   Ready       4m    v1.32.0

Both nodes should now show Ready. If they still show NotReady, check the kubelet logs on each node for errors.

Let us get more details about the nodes:

kubectl get nodes -o wide

This shows additional information including the internal IP, OS, kernel version, and container runtime.

kubectl describe node node-0

This shows detailed information about node-0, including its capacity, allocatable resources, conditions, and events. Look at the Conditions section to verify that all conditions are healthy.

Let us create a simple pod to verify everything works. Create a test deployment:

kubectl create deployment nginx --image=nginx --replicas=2

Watch the pods come up:

kubectl get pods -o wide -w

Press Ctrl+C after you see both pods running. You should see something like:

NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE
nginx-77b4fdf86c-abc12   1/1     Running   0          30s   10.200.0.2   node-0
nginx-77b4fdf86c-def34   1/1     Running   0          30s   10.200.1.2   node-1

The scheduler has placed one pod on each node, and each pod has an IP from its node’s pod CIDR.

Let us verify the pods can communicate with each other. First, get the pod IPs:

kubectl get pods -o wide

Note the IP addresses. Let us say the pod on node-0 has IP 10.200.0.2 and the pod on node-1 has IP 10.200.1.2.

Exec into the pod on node-0 and ping the pod on node-1:

kubectl exec -it $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') -- /bin/bash

Inside the pod, try to reach the other pod:

apt-get update && apt-get install -y curl
curl -s 10.200.1.2

You should see the nginx welcome page HTML. If this works, pod-to-pod networking across nodes is functional.

Exit the pod:

exit

Let me walk through what happens when you create a pod, so you can see how all the pieces fit together.

When you run kubectl create deployment, kubectl sends a request to the API server to create a Deployment object.
The API server authenticates your request using the admin certificate, authorizes it via RBAC (the system:masters group has full access), and stores the Deployment in etcd.
The Deployment controller, running in kube-controller-manager, notices the new Deployment and creates a ReplicaSet.
The ReplicaSet controller notices the ReplicaSet and creates Pod objects (two of them, since we specified replicas=2).
The scheduler notices the new Pods that have no node assigned. It evaluates each node’s capacity and constraints, selects a node for each pod, and updates the Pod objects with the node assignment.
The kubelet on each selected node notices that it has a new pod to run. It calls containerd via CRI to pull the nginx image and create a container.
Containerd pulls the image layers, creates an overlayfs mount for the container filesystem, and calls runc to create the actual container process with the appropriate namespaces and cgroups.
Before starting the container, kubelet calls the CNI plugin to set up networking. The bridge plugin creates a veth pair, puts one end in the container’s network namespace, connects the other end to the cni0 bridge, and assigns an IP address from the node’s pod CIDR.
The container starts and nginx begins listening on port 80.
Kubelet reports the pod status back to the API server, and you can see it with kubectl get pods.

This whole process happens in seconds, but there is a lot of machinery involved. Every piece we configured plays a role.

Let us remove the test deployment:

kubectl delete deployment nginx

Verify the pods are gone:

kubectl get pods

You should see “No resources found”.

We have accomplished a lot in this part. Let us recap:

We installed containerd as the container runtime on both worker nodes, configuring it with the proper cgroup driver and CNI settings.
We installed and configured kubelet on each node, setting up authentication, authorization, and integration with containerd.
We installed and configured kube-proxy on each node to handle service routing via iptables.
We set up CNI networking using the bridge plugin, giving each node its own pod CIDR range.
We configured static routes so that pods on different nodes can communicate with each other.
We verified that nodes registered with the cluster and show Ready status.
We tested pod deployment and cross-node pod communication.

The cluster is now fully functional. You can deploy workloads, and they will be scheduled across the two worker nodes. Pods can communicate with each other regardless of which node they are on.

There are still some things missing that you would want in a production cluster. We do not have DNS, so pods cannot resolve service names. We do not have a storage solution for persistent volumes. We do not have an ingress controller for external access. But the core cluster is working.

In Part 4, we will run a comprehensive smoke test to verify all aspects of the cluster, including secrets encryption, service exposure via NodePort, kubectl exec and logs, and more. We will also discuss what we learned and what you might want to explore next.

Before moving on, make sure you can run these commands successfully from the jumpbox:

kubectl get nodes
kubectl get pods -A
kubectl get namespaces

The nodes should show Ready. There should be no pods in any namespace (we deleted the test deployment). The namespaces should include default, kube-system, kube-public, and kube-node-lease.

The Fourth Part: Smoke Test and Conclusion

We have come a long way. In Part 1, we set up our lab environment with four virtual machines. In Part 2, we created certificates, configured etcd, and brought up the control plane. In Part 3, we installed containerd, kubelet, and kube-proxy on the worker nodes and configured pod networking.

Now it is time to verify that everything actually works. A smoke test is a quick set of checks that exercises the major functionality of a system. If the smoke test passes, you can be reasonably confident that the system is working correctly. If it fails, you know something is wrong and can start debugging.

We will test data encryption, pod deployment, kubectl commands like port-forward, logs, and exec, service exposure via NodePort, and cross-node pod communication. By the end, you will have confidence that your hand-built cluster is fully functional.

Make sure you are on the jumpbox and can communicate with the cluster:

vagrant ssh jumpbox

Verify kubectl is working:

kubectl get nodes

You should see both worker nodes in Ready state:

NAME     STATUS   ROLES    AGE   VERSION
node-0   Ready       30m   v1.32.0
node-1   Ready       29m   v1.32.0

If the nodes show NotReady or you see connection errors, go back to Part 3 and verify that all services are running on each node.

In Part 2, we configured the API server to encrypt secrets before storing them in etcd. Let us verify this is actually working.

Create a test secret:

kubectl create secret generic smoke-test-secret \
  --from-literal=username=admin \
  --from-literal=password=supersecretpassword

Verify the secret was created:

kubectl get secret smoke-test-secret

You should see:

NAME                TYPE     DATA   AGE
smoke-test-secret   Opaque   2      5s

Now let us retrieve the secret through kubectl and verify we can read it:

kubectl get secret smoke-test-secret -o jsonpath='{.data.password}' | base64 -d
echo

This should print “supersecretpassword”. The base64 decode is necessary because Kubernetes stores secret values as base64-encoded strings in the API response.

But here is the important part: let us check how the secret is actually stored in etcd. SSH into the server:

ssh root@server

Query etcd directly for the secret:

etcdctl get /registry/secrets/default/smoke-test-secret | hexdump -C | head -30

Look at the output carefully. Near the beginning, you should see something like:

00000030  79 0a 6b 38 73 3a 65 6e  63 3a 61 65 73 63 62 63  |y.k8s:enc:aescbc|
00000040  3a 76 31 3a 6b 65 79 31  3a                       |:v1:key1:|

The “k8s:enc:aescbc:v1:key1” prefix indicates that the data is encrypted using AES-CBC with the key named “key1” that we configured in the encryption config. The actual secret values (username and password) are not visible in plaintext.

If you see “k8s:enc:aescbc” in the output, encryption is working correctly. If you see the plaintext values “admin” or “supersecretpassword” anywhere in the hexdump, something is wrong with the encryption configuration.

Exit back to jumpbox:

exit

This encryption protects your secrets if someone gains access to etcd directly, through a backup, or by stealing the disk. They would need the encryption key to decrypt the data. Of course, anyone with kubectl access and appropriate RBAC permissions can still read secrets through the API, but that is expected and controlled through authorization policies.

Let us test that the scheduler correctly places pods across nodes and that containerd can pull images and run containers.

Create a deployment with multiple replicas:

kubectl create deployment nginx --image=nginx:latest --replicas=3

Watch the pods come up:

kubectl get pods -o wide -w

You should see the pods transition from Pending to ContainerCreating to Running. The scheduler should distribute them across both nodes. Press Ctrl+C once all pods show Running.

Let us verify the distribution:

kubectl get pods -o wide

You should see something like:

NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE
nginx-77b4fdf86c-2xkpm   1/1     Running   0          45s   10.200.0.2   node-0
nginx-77b4fdf86c-8fwqz   1/1     Running   0          45s   10.200.1.2   node-1
nginx-77b4fdf86c-kx9rm   1/1     Running   0          45s   10.200.0.3   node-0

The exact distribution depends on the scheduler’s algorithm, but you should see pods on both nodes. Notice that each pod has an IP from its node’s pod CIDR: pods on node-0 get IPs from 10.200.0.0/24, and pods on node-1 get IPs from 10.200.1.0/24.

If pods are stuck in Pending, check if there are any events:

kubectl describe pod

Look at the Events section at the bottom. Common issues include insufficient resources, image pull failures, or scheduler errors.

If pods are stuck in ContainerCreating, the issue is usually with containerd or CNI. SSH into the affected node and check the kubelet logs:

journalctl -u kubelet --no-pager -n 100

The kubectl port-forward command creates a tunnel from your local machine to a pod. This is useful for debugging and accessing services that are not exposed externally.

First, get the name of one of the nginx pods:

POD_NAME=$(kubectl get pods -l app=nginx -o jsonpath='{.items[0].metadata.name}')
echo $POD_NAME

Now create a port forward. We will forward local port 8080 to the pod’s port 80:

kubectl port-forward $POD_NAME 8080:80 &

The ampersand runs the command in the background so we can continue using the terminal. You should see:

Forwarding from 127.0.0.1:8080 -> 80

Now test the connection:

curl -s http://127.0.0.1:8080 | head -5

You should see the beginning of the nginx welcome page:




Welcome to nginx!