<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Sigrid Jin on Medium]]></title>
        <description><![CDATA[Stories by Sigrid Jin on Medium]]></description>
        <link>https://medium.com/@sigridjin?source=rss-9e186deee13f------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*maH-p1zU2r53U5DqlUHQfw@2x.jpeg</url>
            <title>Stories by Sigrid Jin on Medium</title>
            <link>https://medium.com/@sigridjin?source=rss-9e186deee13f------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 23 Jun 2026 16:44:32 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@sigridjin/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[A Hands-On Guide to RKE2 and Cluster API for On-Premise Kubernetes]]></title>
            <link>https://sigridjin.medium.com/a-hands-on-guide-to-rke2-and-cluster-api-for-on-premise-kubernetes-4b1c2c96be2d?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/4b1c2c96be2d</guid>
            <category><![CDATA[mlops]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[kubernetes]]></category>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sun, 22 Feb 2026 20:01:04 GMT</pubDate>
            <atom:updated>2026-02-22T20:01:04.989Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6b9pbPcGOcxS8hdERe0RWA.jpeg" /></figure><p>If you’ve ever managed Kubernetes clusters on bare-metal or on-premise infrastructure, you know the pain: installing, upgrading, securing, and scaling clusters is <strong>a lot of manual work</strong>. Two tools from the Kubernetes ecosystem tackle these challenges head-on — <strong>RKE2</strong> and <strong>Cluster API (CAPI)</strong>.</p><p>In this post, we’ll get our hands dirty with both. But first, let’s understand <em>why</em> they exist and <em>what problems</em> they solve. <strong>RKE2</strong> (also known as <em>RKE Government</em>) is a Kubernetes distribution developed by <strong>Rancher (now part of SUSE)</strong>.</p><p>Here’s how simple the installation looks — A single script installs the rke2-server (or rke2-agent) systemd service, the binary, and all required container images. Compare that to the dozens of steps needed for a traditional kubeadm setup, <strong>the difference is night and day</strong>.</p><pre># Download and run the installer — that&#39;s it!<br>curl -sfL https://get.rke2.io --output install.sh<br>chmod +x install.sh<br>INSTALL_RKE2_CHANNEL=v1.33 ./install.sh<br><br># Verify<br>rke2 --version<br># rke2 version v1.33.7+rke2r3 (...)</pre><h3>RKE2 Architecture at a Glance</h3><p>RKE2 follows a familiar <strong>server + agent</strong> model.</p><blockquote>📖 For the official architecture diagram, see <a href="https://docs.rke2.io/architecture">docs.rke2.io/architecture</a>.</blockquote><p><strong>Boot sequence:</strong></p><ol><li>The agent (kubelet) starts up</li><li>Static pod manifests are written to disk</li><li>Control-plane pods (etcd, apiserver, etc.) are launched by kubelet</li></ol><pre>┌──────────────────────────────────────────────┐<br>│              RKE2 Server Node                │<br>│           (= Control Plane Node)             │<br>│                                              │<br>│  ┌─────────────────────────────────────────┐ │<br>│  │         RKE2 Supervisor Process         │ │<br>│  │  ┌──────────┐  ┌────────────────────┐   │ │<br>│  │  │ kubelet  │  │    containerd      │   │ │<br>│  │  └────┬─────┘  └────────────────────┘   │ │<br>│  │       │                                 │ │<br>│  │       ▼  watches pod-manifests/         │ │<br>│  │  ┌──────────────────────────────────┐   │ │<br>│  │  │  Static Pods:                    │   │ │<br>│  │  │  • etcd                          │   │ │<br>│  │  │  • kube-apiserver                │   │ │<br>│  │  │  • kube-controller-manager       │   │ │<br>│  │  │  • kube-scheduler                │   │ │<br>│  │  └──────────────────────────────────┘   │ │<br>│  └─────────────────────────────────────────┘ │<br>└──────────────────────────────────────────────┘<br><br>┌──────────────────────────────────────────────┐<br>│              RKE2 Agent Node                 │<br>│            (= Worker Node)                   │<br>│                                              │<br>│  ┌─────────────────────────────────────────┐ │<br>│  │         RKE2 Agent Process              │ │<br>│  │  ┌──────────┐  ┌────────────────────┐   │ │<br>│  │  │ kubelet  │  │    containerd      │   │ │<br>│  │  └──────────┘  └────────────────────┘   │ │<br>│  └─────────────────────────────────────────┘ │<br>└──────────────────────────────────────────────┘</pre><h3>What is Cluster API?</h3><p>Now imagine you don’t just have <em>one</em> cluster to manage — you have <strong>tens or hundreds</strong>. Installing each one manually (even with RKE2) becomes a bottleneck. This is where <strong>Cluster API (CAPI)</strong> comes in.</p><p><strong>Cluster API</strong> is a Kubernetes sub-project that lets you <strong>manage Kubernetes clusters themselves as Kubernetes resources</strong>. In other words: <em>you use Kubernetes to create, configure, and upgrade Kubernetes clusters.</em></p><p>It introduces custom resources like below.</p><pre># These are just regular Kubernetes objects!<br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: Cluster          # Represents an entire K8s cluster<br><br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: Machine          # Represents a single node (VM or bare-metal)<br><br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: MachineDeployment  # Like a Deployment, but for Machines</pre><p>The key concept is the separation between two types of clusters.</p><ul><li>Management Cluster: The cluster where Cluster API is deployed and its controllers run. Think of it as the “control tower.”</li><li>Workload Cluster: A cluster that was provisioned <em>through</em> Cluster API. This is where your applications actually run.</li></ul><pre>┌─────────────────────────────────┐      ┌─────────────────────────────────┐<br>│      Management Cluster         │      │       Workload Cluster(s)       │<br>│                                 │      │                                 │<br>│  • Runs Cluster API controllers │─────▶│  • Provisioned BY Cluster API   │<br>│  • Stores cluster definitions   │      │  • Runs your actual workloads   │<br>│  • The &quot;brain&quot; of operations    │      │  • Can be upgraded declaratively│<br>│                                 │      │                                 │<br>└─────────────────────────────────┘      └─────────────────────────────────┘</pre><p>So, here is the bottom line.</p><ul><li><strong>RKE2</strong> solves the problem of <em>installing and running a single, secure Kubernetes cluster</em> with minimal effort.</li><li><strong>Cluster API</strong> solves the problem of <em>managing the lifecycle of many clusters at scale</em> — creation, upgrades, scaling — all declaratively.</li></ul><p>In this hands-on guide, we’ll walk through:</p><ul><li><strong>RKE2</strong>: Install a server + agent cluster from scratch, deploy a sample app, manage certificates, and perform both manual and automated upgrades.</li><li><strong>Cluster API</strong>: Set up a management cluster, provision a full workload cluster (3 control-plane + 3 worker nodes), deploy apps, upgrade the cluster version with a single kubectl patch, and tear it all down.</li></ul><h3>RKE2 Overview — Architecture, Security-First Design, and How It Works</h3><p>RKE2 is a Kubernetes distribution developed by Rancher (now part of SUSE). If you have ever worked with K3s, you will immediately notice a familiar philosophy here — keep things simple, ship a single binary, and get out of the operator’s way.</p><p>But RKE2 takes a slightly different path. While K3s was designed to be ultra-lightweight and perfect for edge and IoT scenarios, RKE2 was built with security and compliance as the top priority from day one. The name itself sometimes appears as “RKE Government” in older documentation, which gives you a hint about its original target audience — environments where passing security audits is not optional.</p><p>So what exactly makes RKE2 special? There are three pillars that define it.</p><p><strong>The first pillar is security by default.</strong> Out of the box, with zero additional configuration, RKE2 is designed to pass the CIS (Center for Internet Security) Kubernetes Benchmark. If you have ever gone through a CIS hardening exercise on a vanilla kubeadm cluster, you know how painful that can be — dozens of flags to set, file permissions to tighten, audit policies to write. RKE2 handles most of this automatically. It ships with sane defaults for etcd encryption, Pod Security Standards, audit logging, and network policies. You still have the freedom to customize everything, but the starting point is already hardened.</p><p><strong>The second pillar is operational simplicity.</strong> Just like K3s, RKE2 is distributed as a single binary. You do not need to separately install kubelet, kube-proxy, etcd, or any other control plane component by hand. A single install script pulls everything you need, registers a systemd service, and you are ready to go. Upgrades follow the same pattern — run the installer again with a newer channel, restart the service, and the cluster rolls forward. This dramatically reduces the surface area for human error during both initial deployment and ongoing maintenance.</p><p><strong>The third pillar is a container-first architecture.</strong> RKE2 minimizes its dependency on the host operating system. Instead of running control plane components as bare processes directly on the host (the way kubeadm does by default), RKE2 runs everything on top of containerd. The kubelet, which is managed by the RKE2 supervisor process, watches a static pod manifest directory and launches etcd, kube-apiserver, kube-controller-manager, and kube-scheduler as static pods. This means the host OS only needs to provide a Linux kernel and containerd — everything else lives inside containers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1016/0*6160DVf47UZCtGf4.png" /><figcaption><a href="https://docs.rke2.io/architecture">https://docs.rke2.io/architecture</a> &amp; <a href="https://blog03038.tistory.com/14">https://blog03038.tistory.com/14</a></figcaption></figure><p>Let us talk about the architecture in more detail. If you look at the official architecture diagram at <a href="https://docs.rke2.io/architecture,">https://docs.rke2.io/architecture,</a> you will see that an RKE2 cluster is composed of two types of nodes: RKE2 Server Nodes and RKE2 Agent Nodes.</p><p>A Server Node is what you would normally call a control plane node — it runs etcd, the API server, the controller manager, and the scheduler, in addition to a kubelet and kube-proxy. An Agent Node is simply a worker node — it runs a kubelet and kube-proxy, and its sole job is to run your application workloads.</p><p>The interesting part is what happens inside each node. There is a single RKE2 process (called the RKE2 Supervisor) that orchestrates everything. On a server node, this supervisor process manages containerd and the kubelet internally. It writes static pod manifests into a well-known directory, and the kubelet picks them up and starts the corresponding pods. So the boot sequence looks like this — the RKE2 supervisor starts, it brings up the internal kubelet (which also starts containerd), the kubelet detects the static pod manifests that the supervisor has written, and then the control plane pods — etcd, kube-apiserver, kube-controller-manager, kube-scheduler — come to life one by one. On an agent node the process is simpler — the supervisor starts, brings up kubelet and kube-proxy, and the node registers itself with the API server running on the server nodes.</p><p>This design has a subtle but important consequence. Because the control plane components run as static pods managed by the kubelet, you can inspect them with standard Kubernetes tooling. You can run kubectl get pods -n kube-system and see etcd-k8s-node1, kube-apiserver-k8s-node1, and so on, just like you would on a kubeadm cluster. But unlike kubeadm, you did not have to manually bootstrap any of this — the RKE2 supervisor took care of it.</p><p>Now let us see what this looks like in practice. When you install RKE2 on a fresh server, the install script is remarkably straightforward. You download it, make it executable, and run it with a channel flag to pin your desired Kubernetes minor version.</p><pre>curl -sfL https://get.rke2.io --output install.sh<br>chmod +x install.sh<br>INSTALL_RKE2_CHANNEL=v1.33 ./install.sh</pre><p>After the script finishes, you will see output confirming that several RPM packages have been installed (on RHEL/Rocky-based systems). For example, on a Rocky Linux 9 machine you might see something like below.</p><pre>Installed:<br>  rke2-common-1.33.7~rke2r3-0.el9.aarch64<br>  rke2-selinux-0.22-1.el9.noarch<br>  rke2-server-1.33.7~rke2r3-0.el9.aarch64</pre><p>You can verify the installed version immediately:</p><pre>rke2 --version<br><br>rke2 version v1.33.7+rke2r3 (7e4fd1a82edf497cab91c220144619bbad659cf4)<br>go version go1.24.11 X:boringcrypto</pre><p>Notice the “boringcrypto” tag in the Go version. This is not accidental. RKE2 is compiled with BoringCrypto, Google’s FIPS 140–2 validated cryptographic module. This means that all TLS operations within the RKE2 binary use FIPS-compliant cryptography — another nod to the security-first philosophy.</p><p>The install script also sets up YUM/DNF repositories so that future updates can be pulled in cleanly. You can confirm this by checking the repo list:</p><pre>dnf repolist<br><br>rancher-rke2-1.33-stable       Rancher RKE2 1.33 (v1.33)<br>rancher-rke2-common-stable     Rancher RKE2 Common (v1.33)</pre><p>And if you peek inside the repo configuration file, you will see that it points to Rancher’s official RPM mirror:</p><pre># /etc/yum.repos.d/rancher-rke2.repo<br><br>[rancher-rke2-common-stable]<br>name=Rancher RKE2 Common (v1.33)<br>baseurl=https://rpm.rancher.io/rke2/stable/common/centos/9/noarch<br>enabled=1<br>gpgcheck=1<br>repo_gpgcheck=1<br>gpgkey=https://rpm.rancher.io/public.key<br><br>[rancher-rke2-1.33-stable]<br>name=Rancher RKE2 1.33 (v1.33)<br>baseurl=https://rpm.rancher.io/rke2/stable/1.33/centos/9/aarch64<br>enabled=1<br>gpgcheck=1<br>repo_gpgcheck=1<br>gpgkey=https://rpm.rancher.io/public.key</pre><p>Both GPG signature checking and repository-level GPG checking are enabled by default. This ensures that the packages you pull have not been tampered with in transit — yet another security detail that RKE2 gets right without you having to think about it.</p><p>At this point, RKE2 is installed but not yet running. The binary supports two primary subcommands, which directly correspond to the two node roles we discussed earlier:</p><pre>rke2 --help<br><br>server    Run management server<br>   agent     Run node agent</pre><p>The “server” subcommand starts the full control plane (API server, etcd, scheduler, controller manager) along with a kubelet and kube-proxy. The “agent” subcommand starts only a kubelet and kube-proxy, which is all a worker node needs.</p><p>Before starting the server, you typically write a configuration file at /etc/rancher/rke2/config.yaml. This is where you customize the behavior of RKE2 — things like which CNI plugin to use, which IP addresses to bind to, and which built-in add-ons to disable. Here is an example from the hands-on lab:</p><pre># /etc/rancher/rke2/config.yaml<br>write-kubeconfig-mode: &quot;0644&quot;<br>debug: true<br>cni: canal<br>bind-address: 192.168.10.11<br>advertise-address: 192.168.10.11<br>node-ip: 192.168.10.11<br>disable-cloud-controller: true<br>disable:<br>  - servicelb<br>  - rke2-coredns-autoscaler<br>  - rke2-ingress-nginx<br>  - rke2-snapshot-controller<br>  - rke2-snapshot-controller-crd<br>  - rke2-snapshot-validation-webhook</pre><p>A few things worth calling out here. The “write-kubeconfig-mode” setting controls the file permissions of the generated kubeconfig file at /etc/rancher/rke2/rke2.yaml. Setting it to 0644 makes it world-readable, which is convenient in a lab but something you would lock down in production.</p><p>The “cni” field lets you choose between canal (the default, which combines Calico for network policy and Flannel for overlay networking), cilium, calico, or none. The “disable” list is particularly useful — RKE2 ships with several built-in Helm charts (ingress-nginx, metrics-server, CoreDNS autoscaler, etc.), and you can selectively disable any of them if you plan to bring your own.</p><p>RKE2 also supports per-chart customization through a mechanism called HelmChartConfig. If you want to override the default values of a built-in Helm chart, you simply drop a HelmChartConfig manifest into /var/lib/rancher/rke2/server/manifests/ before starting the server. For example, to tell the Canal CNI to use a specific network interface:</p><pre># /var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml<br>apiVersion: helm.cattle.io/v1<br>kind: HelmChartConfig<br>metadata:<br>  name: rke2-canal<br>  namespace: kube-system<br>spec:<br>  valuesContent: |-<br>    flannel:<br>      iface: &quot;enp0s9&quot;</pre><p>This is a very elegant pattern. Instead of requiring you to manually run helm install with custom values after the cluster is up, <strong>RKE2 lets you declare your desired configuration before the cluster even boots.</strong> The RKE2 Helm controller will pick up these manifests and apply them automatically during the initial startup sequence.</p><p>Once everything is configured, starting the server is a single systemctl command:</p><pre>systemctl enable --now rke2-server.service</pre><p>This typically takes about two minutes. During that time, the RKE2 supervisor starts containerd, generates TLS certificates for all control plane components, writes static pod manifests, and waits for the kubelet to bring up etcd and the API server. You can watch the progress in real time with:</p><pre>journalctl -u rke2-server -f</pre><p>After the server is up, you will find all the necessary binaries tucked inside /var/lib/rancher/rke2/bin/:</p><pre>tree /var/lib/rancher/rke2/bin/<br>├── containerd<br>├── containerd-shim-runc-v2<br>├── crictl<br>├── ctr<br>├── kubectl<br>├── kubelet<br>└── runc</pre><p>These are not installed into your system PATH by default. RKE2 keeps them isolated to avoid conflicts with any existing binaries on the host. If you want to use kubectl and other tools directly, the recommended approach is to create symbolic links:</p><pre>ln -s /var/lib/rancher/rke2/bin/kubectl /usr/local/bin/kubectl<br>ln -s /var/lib/rancher/rke2/bin/crictl /usr/local/bin/crictl<br>ln -s /var/lib/rancher/rke2/bin/containerd /usr/local/bin/containerd<br>ln -s /var/lib/rancher/rke2/bin/runc /usr/local/bin/runc<br>ln -s /var/lib/rancher/rke2/bin/ctr /usr/local/bin/ctr<br>ln -s /var/lib/rancher/rke2/agent/etc/crictl.yaml /etc/crictl.yaml</pre><p>Then you copy the generated kubeconfig to your home directory:</p><pre>mkdir ~/.kube<br>cp /etc/rancher/rke2/rke2.yaml ~/.kube/config</pre><p>And you are ready to interact with your cluster:</p><pre>kubectl cluster-info</pre><pre>Kubernetes control plane is running at https://192.168.10.11:6443</pre><pre>kubectl get node -owide</pre><pre>NAME        STATUS   ROLES                AGE   VERSION          INTERNAL-IP     OS-IMAGE                      CONTAINER-RUNTIME<br>k8s-node1   Ready    control-plane,etcd   15m   v1.34.3+rke2r3   192.168.10.11   Rocky Linux 9.6 (Blue Onyx)   containerd://2.1.5-k3s1</pre><p>If you check the running pods, you will see the control plane components we discussed — all running as pods in the kube-system namespace, exactly as the architecture diagram predicts:</p><pre>kubectl get pod -A</pre><pre>NAMESPACE     NAME                                         READY   STATUS      AGE<br>kube-system   etcd-k8s-node1                               1/1     Running     2m<br>kube-system   kube-apiserver-k8s-node1                     1/1     Running     2m<br>kube-system   kube-controller-manager-k8s-node1            1/1     Running     2m<br>kube-system   kube-scheduler-k8s-node1                     1/1     Running     2m<br>kube-system   kube-proxy-k8s-node1                         1/1     Running     2m<br>kube-system   rke2-canal-dkw2n                             2/2     Running     2m<br>kube-system   rke2-coredns-rke2-coredns-784bcb7f4d-tpt2d   1/1     Running     2m<br>kube-system   rke2-metrics-server-7b59bd8854-m5w2c         1/1     Running     1m</pre><p>You will also notice that RKE2 deployed several add-ons as Helm releases — Canal for networking, CoreDNS for DNS resolution, metrics-server for resource metrics, and a set of runtime classes:</p><pre>helm list -A</pre><pre>NAME                NAMESPACE    STATUS    CHART                              APP VERSION<br>rke2-canal          kube-system  deployed  rke2-canal-v3.31.3-build2026011900 v3.31.3<br>rke2-coredns        kube-system  deployed  rke2-coredns-1.45.008              1.13.1<br>rke2-metrics-server kube-system  deployed  rke2-metrics-server-3.13.006       0.8.0<br>rke2-runtimeclasses kube-system  deployed  rke2-runtimeclasses-0.1.000        0.1.0</pre><p>This Helm-based add-on management is one of the things that makes RKE2 so pleasant to operate. Every built-in component is a standard Helm chart, which means you can inspect its values, override them with HelmChartConfig, or disable them entirely and replace them with your own preferred solution. There is no magic, no proprietary packaging — just Helm charts running on your cluster.</p><p>To summarize the architecture: RKE2 gives you a production-grade, CIS-hardened Kubernetes cluster through a single binary and a single configuration file. The RKE2 supervisor process manages containerd and kubelet internally, launches control plane components as static pods, and deploys networking and DNS through Helm charts. The result is a cluster that is easy to install, easy to understand, and secure by default — without sacrificing any of the flexibility that Kubernetes operators expect.</p><h3>RKE2 Hands-On</h3><p>Before we get into the actual installation, let’s talk about what we’re building. Our lab environment consists of two nodes: k8s-node1 (which will serve as the control plane) and k8s-node2 (which will be our worker node). Both are running Rocky Linux 9.6, and we’re provisioning them with Vagrant. If you want to follow along, you can grab the Vagrantfile and init script from the repository and spin up the environment like this:</p><pre>mkdir k8s-rke2<br>cd k8s-rke2<br><br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-rke2/Vagrantfile<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-rke2/init_cfg.sh<br><br>vagrant up<br>vagrant status</pre><p>Once both VMs are up and running, we’re ready to begin.</p><h4>Installing the Server (Control Plane)</h4><p>Let’s SSH into our first node, which will become the RKE2 server, also known as the control plane node.</p><pre>vagrant ssh k8s-node1</pre><p>RKE2 provides a convenient installation script that sets everything up as a systemd service. We’re going to download this script, make it executable, and run it while specifying the v1.33 release channel. This tells the installer to pull the latest stable version from the 1.33 line.</p><pre>curl -sfL https://get.rke2.io --output install.sh<br>chmod +x install.sh<br>INSTALL_RKE2_CHANNEL=v1.33 ./install.sh</pre><p>When the installation completes, you’ll see output confirming that the rke2-common, rke2-selinux, and rke2-server packages have been installed. At this point, nothing is running yet. The installer simply placed the binary and the systemd unit files on your system.</p><p>You can verify the installed version by running the following command. You should see something like v1.33.7+rke2r3 along with the Go version used to compile it.</p><pre>rke2 --version</pre><p>The installer also added a couple of Rancher RPM repositories to your system. You can confirm this by checking the repo list and the repo configuration file. This is how future upgrades will be pulled.</p><pre>dnf repolist<br>cat /etc/yum.repos.d/rancher-rke2.repo</pre><p>Now comes the important part: configuring RKE2 before we start the service. RKE2 reads its configuration from /etc/rancher/rke2/config.yaml.</p><p>We’re going to create this file and set a few key options. We’ll make the kubeconfig file world-readable for convenience, enable debug logging to get more visibility, choose Canal as our CNI plugin, bind the API server and advertise address to our node’s static IP, disable the cloud controller since we’re on-prem, and also disable a handful of built-in add-ons that we either don’t need or want to manage ourselves (the built-in service load balancer, CoreDNS autoscaler, ingress-nginx, and the snapshot-related components).</p><pre>cat &lt;&lt; EOF &gt; /etc/rancher/rke2/config.yaml<br>write-kubeconfig-mode: &quot;0644&quot;<br><br>debug: true<br><br>cni: canal<br><br>bind-address: 192.168.10.11<br>advertise-address: 192.168.10.11<br>node-ip: 192.168.10.11<br><br>disable-cloud-controller: true<br><br>disable:<br>  - servicelb<br>  - rke2-coredns-autoscaler<br>  - rke2-ingress-nginx<br>  - rke2-snapshot-controller<br>  - rke2-snapshot-controller-crd<br>  - rke2-snapshot-validation-webhook<br>EOF</pre><p>Since we’re using Canal as the CNI, and our Vagrant VMs have multiple network interfaces, we need to tell Flannel (which is the networking backend inside Canal) which interface to use. RKE2 manages its built-in add-ons through Helm charts, and we can customize them by placing HelmChartConfig manifests in the server manifests directory.</p><p>Let’s create that directory first and then drop in our Canal configuration.</p><pre>mkdir -p /var/lib/rancher/rke2/server/manifests/<br><br>cat &lt;&lt; EOF &gt; /var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml<br>apiVersion: helm.cattle.io/v1<br>kind: HelmChartConfig<br>metadata:<br>  name: rke2-canal<br>  namespace: kube-system<br>spec:<br>  valuesContent: |-<br>    flannel:<br>      iface: &quot;enp0s9&quot;<br>EOF</pre><p>We’ll also create a config for CoreDNS to explicitly disable its autoscaler, since we already disabled the autoscaler add-on in the main config and we want to make sure both sides agree.</p><pre>cat &lt;&lt; EOF &gt; /var/lib/rancher/rke2/server/manifests/rke2-coredns-config.yaml<br>apiVersion: helm.cattle.io/v1<br>kind: HelmChartConfig<br>metadata:<br>  name: rke2-coredns<br>  namespace: kube-system<br>spec:<br>  valuesContent: |-<br>    autoscaler:<br>      enabled: false<br>EOF</pre><p>Before starting the service, it’s a great idea to open a second terminal and set up some monitoring so you can watch the boot process in real time. In that second terminal, run the following two commands. The first watches the process tree so you can see RKE2’s child processes spawn, and the second tails the journal for the rke2-server unit.</p><pre>watch -d pstree -a<br>journalctl -u rke2-server -f</pre><p>Now, back in your main terminal, let’s enable and start the RKE2 server service. This single command both enables the service to start on boot and starts it immediately. The initial startup takes roughly two minutes as it downloads container images and bootstraps the control plane. After the core components are up, you’ll need to wait another minute or two for CoreDNS pods to become healthy.</p><pre>systemctl enable --now rke2-server.service<br>systemctl status rke2-server --no-pager</pre><p>If you flip over to your monitoring terminal, you’ll see the process tree come alive. The rke2 process starts containerd and kubelet, and then kubelet picks up the static pod manifests and launches the etcd, kube-apiserver, kube-controller-manager, and kube-scheduler pods.</p><p>This is one of the elegant things about RKE2: the entire control plane boots up through the same static-pod mechanism that kubeadm uses, but it’s all orchestrated by a single RKE2 binary.</p><p>Once the service is running, we need to set up our kubeconfig so we can use kubectl. RKE2 writes its kubeconfig to /etc/rancher/rke2/rke2.yaml. Let’s copy it to the standard location.</p><pre>mkdir ~/.kube<br>cp /etc/rancher/rke2/rke2.yaml ~/.kube/config</pre><p>RKE2 installs its own set of binaries under /var/lib/rancher/rke2/bin/, including kubectl, crictl, containerd, ctr, and runc. Rather than modifying the PATH, a clean approach is to create symbolic links in /usr/local/bin/ so that these tools are available system-wide.</p><pre>ln -s /var/lib/rancher/rke2/bin/containerd /usr/local/bin/containerd<br>ln -s /var/lib/rancher/rke2/bin/kubectl /usr/local/bin/kubectl<br>ln -s /var/lib/rancher/rke2/bin/crictl /usr/local/bin/crictl<br>ln -s /var/lib/rancher/rke2/bin/runc /usr/local/bin/runc<br>ln -s /var/lib/rancher/rke2/bin/ctr /usr/local/bin/ctr<br>ln -s /var/lib/rancher/rke2/agent/etc/crictl.yaml /etc/crictl.yaml</pre><p>Let’s also set up shell completion and an alias for kubectl. This might seem minor, but trust me, it saves a lot of typing over the course of a long lab session.</p><pre>source &lt;(kubectl completion bash)<br>alias k=kubectl<br>complete -F __start_kubectl k<br>echo &#39;source &lt;(kubectl completion bash)&#39; &gt;&gt; /etc/profile<br>echo &#39;alias k=kubectl&#39; &gt;&gt; /etc/profile<br>echo &#39;complete -F __start_kubectl k&#39; &gt;&gt; /etc/profile</pre><p>Now let’s verify everything is working. First, check that the cluster is reachable.</p><pre>kubectl cluster-info -v=6</pre><p>You should see “Kubernetes control plane is running at https://192.168.10.11:6443&quot;. Then check the node status and all the pods in the kube-system namespace.</p><pre>kubectl get node -owide</pre><p>You’ll see a single node, k8s-node1, in the Ready state with the roles control-plane and etcd, running the version v1.33.7+rke2r3 on Rocky Linux 9.6 with containerd as the container runtime.</p><p>If you check the Helm releases, you’ll see that RKE2 automatically deployed Canal, CoreDNS, the metrics server, and runtime classes as Helm charts. This is another nice feature of RKE2: it uses Helm under the hood to manage its add-ons, which gives you a familiar and powerful way to customize them.</p><pre>helm list -A</pre><p>And finally, let’s look at all the pods running in the cluster.</p><pre>kubectl get pod -A</pre><p>You should see etcd, kube-apiserver, kube-controller-manager, kube-scheduler, and kube-proxy running as static pods on k8s-node1, along with the Canal DaemonSet pod, the CoreDNS deployment pod, and the metrics server. There will also be a few completed Helm install jobs.</p><p>Everything should be in a Running or Completed state. At this point, your single-node RKE2 control plane is fully operational.</p><h4>Joining an Agent (Worker Node)</h4><p>Now that our server node is up, let’s add a worker. The first thing we need is the join token. RKE2 generates this token during the server bootstrap and stores it on disk. On k8s-node1, retrieve it like this:</p><pre>cat /var/lib/rancher/rke2/server/node-token</pre><p>You’ll get a long string that looks something like K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749. Copy this value because we’ll need it on the worker node.</p><p>You can also verify that the supervisor port is listening. This is port 9345, and it’s the dedicated management and bootstrap API that new nodes use to join the cluster.</p><pre>ss -tnlp | grep 9345</pre><p>Before switching to the worker node, let’s set up a monitoring loop on k8s-node1 so we can watch the new node appear in real time.</p><pre>watch -d &#39;kubectl get node; echo; kubectl get pod -n kube-system&#39;</pre><p>Now SSH into k8s-node2 and let’s get the agent installed. The process is very similar to the server installation, but we pass INSTALL_RKE2_TYPE=”agent” to tell the installer we want the agent role, not the server role.</p><pre>curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=&quot;agent&quot; INSTALL_RKE2_CHANNEL=v1.33 sh -</pre><p>Next, we create the agent configuration file. This is much simpler than the server config. It only needs two things: the URL of the server’s supervisor API (on port 9345) and the join token we copied earlier.</p><pre>TOKEN=K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749<br><br>mkdir -p /etc/rancher/rke2/<br>cat &lt;&lt; EOF &gt; /etc/rancher/rke2/config.yaml<br>server: https://192.168.10.11:9345<br>token: $TOKEN<br>EOF</pre><p>Then, just like the server, we enable and start the agent service. You can also tail the journal to watch the join process.</p><pre>systemctl enable --now rke2-agent.service<br>journalctl -u rke2-agent -f</pre><p>Flip back to your monitoring terminal on k8s-node1 and within a few seconds you’ll see k8s-node2 appear as a new node. Let’s verify the cluster now has two nodes.</p><pre>kubectl get node -owide</pre><p>You should see both k8s-node1 (with roles control-plane, etcd, and master) and k8s-node2 (with no specific role label, meaning it’s a pure worker). Both should be in the Ready state. If you check the pods in kube-system filtered by k8s-node2, you’ll see that only kube-proxy and the Canal DaemonSet pod were automatically deployed there, which is exactly what you’d expect for a worker node.</p><pre>kubectl get pod -n kube-system -owide | grep k8s-node2</pre><p>And that’s it. Two commands on the server (install + start), two commands on the agent (install + start), and a tiny config file on each side. The cluster is fully operational with a control plane and a worker node.</p><h4>Deploying a Sample App</h4><p>With our two-node cluster ready, let’s deploy a simple application to prove that everything is working end to end, including pod scheduling, networking, and service exposure.</p><p>We’re going to deploy a simple HTTP echo server (traefik/whoami) as a Deployment with two replicas, fronted by a NodePort Service on port 30000. We also add a pod anti-affinity rule to ensure the two replicas land on different nodes, which lets us verify that cross-node networking works.</p><pre>cat &lt;&lt; EOF | kubectl apply -f -<br>apiVersion: apps/v1<br>kind: Deployment<br>metadata:<br>  name: webpod<br>spec:<br>  replicas: 2<br>  selector:<br>    matchLabels:<br>      app: webpod<br>  template:<br>    metadata:<br>      labels:<br>        app: webpod<br>    spec:<br>      affinity:<br>        podAntiAffinity:<br>          requiredDuringSchedulingIgnoredDuringExecution:<br>          - labelSelector:<br>              matchExpressions:<br>              - key: app<br>                operator: In<br>                values:<br>                - sample-app<br>            topologyKey: &quot;kubernetes.io/hostname&quot;<br>      containers:<br>      - name: webpod<br>        image: traefik/whoami<br>        ports:<br>        - containerPort: 80<br>---<br>apiVersion: v1<br>kind: Service<br>metadata:<br>  name: webpod<br>  labels:<br>    app: webpod<br>spec:<br>  selector:<br>    app: webpod<br>  ports:<br>  - protocol: TCP<br>    port: 80<br>    targetPort: 80<br>    nodePort: 30000<br>  type: NodePort<br>EOF</pre><p>Let’s check on the deployed resources to make sure everything is healthy.</p><pre>kubectl get deploy,pod,svc,ep -owide</pre><p>You should see the Deployment with 2/2 replicas available, two pods in the Running state (one on k8s-node1 and one on k8s-node2), the Service of type NodePort with port 30000 mapped, and the Endpoints object listing both pod IPs.</p><p>Now let’s hit the service from the host machine to verify connectivity. We’ll use a loop that calls the NodePort on k8s-node2 and prints the hostname returned by each response.</p><pre>while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done</pre><p>If everything is configured correctly, you’ll see the responses alternate between the two pod hostnames. This confirms that the NodePort service is load-balancing traffic across both pods, and that the Canal overlay network is successfully routing packets between nodes.</p><p>One interesting thing to notice here is that pods got scheduled on both nodes, including the control plane node. If you check the taints on k8s-node1, you’ll find that there are none.</p><pre>kubectl describe node k8s-node1 | grep -i taints</pre><p>This is a deliberate RKE2 design choice. Unlike kubeadm, which applies a NoSchedule taint to control plane nodes by default, RKE2 leaves them untainted so that workloads can run on all nodes out of the box. This is especially useful for smaller clusters or edge deployments where you want to maximize resource utilization. Of course, in a production environment with dedicated control plane nodes, you’d want to add taints yourself to keep workloads off the control plane.</p><p>At this point, we have a fully functional two-node RKE2 cluster running a sample application with verified cross-node networking. The installation was remarkably straightforward: a single binary, a simple config file, one systemd command on each node, and we were up and running.</p><h3>RKE2 Day-2 Operations</h3><p>Once your RKE2 cluster is up and running with a server node and an agent node joined, the real work begins.</p><p>Day-2 operations are about keeping your cluster healthy, secure, and up to date over time. In this section, we will walk through three essential tasks that every RKE2 operator needs to know: managing and rotating TLS certificates, performing a manual version upgrade from v1.33 to v1.34, and setting up fully automated upgrades using the Rancher System Upgrade Controller to take the cluster all the way to v1.35. We will go step by step, show every command, and explain what is happening behind the scenes so you can confidently apply these procedures to your own clusters.</p><h4>Certificate Management and Rotation</h4><p>Every component in your RKE2 cluster communicates over TLS. The API server, etcd, the kubelet, the scheduler, the controller manager — they all rely on certificates to authenticate and encrypt traffic between each other. RKE2 issues these certificates automatically during installation, and each client or server certificate is valid for 365 days from the date it was issued. The Certificate Authority (CA) certificates that sign them have a much longer lifetime of 10 years.</p><p>RKE2 has a built-in mechanism for certificate renewal. Every time the rke2-server or rke2-agent service starts, it checks the expiration dates of all certificates. If any certificate is within 120 days of expiring, RKE2 will automatically renew it during startup. This renewal process reuses the existing private keys, so the certificate identity stays the same — only the validity period gets extended. Additionally, when a certificate’s expiration date falls within 120 days, Kubernetes will emit a warning event of type CertificateExpirationWarning so you can be aware of upcoming expirations even if you have not manually checked.</p><p>Let us start by inspecting the certificates on both the server and agent nodes to understand what is installed and when everything expires.</p><p>On the server node (k8s-node1), run the following command to display all certificates in a table format:</p><pre>rke2 certificate check --output table</pre><p>You will see output like this, showing every certificate file, its subject, usage type, expiration date, residual time, and current status:</p><pre>FILENAME                           SUBJECT                             USAGES                  EXPIRES                  RESIDUAL TIME   STATUS<br>--------                           -------                             ------                  -------                  -------------   ------<br>client-kube-apiserver.crt          system:apiserver                    ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-kube-apiserver.crt          rke2-client-ca@1770647369           CertSign                Feb 07, 2036 14:29 UTC   10 years        OK<br>serving-kube-apiserver.crt         kube-apiserver                      ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>serving-kube-apiserver.crt         rke2-server-ca@1770647369           CertSign                Feb 07, 2036 14:29 UTC   10 years        OK<br>client-rke2-cloud-controller.crt   rke2-cloud-controller-manager       ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-scheduler.crt               system:kube-scheduler               ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>kube-scheduler.crt                 kube-scheduler                      ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-kube-proxy.crt              system:kube-proxy                   ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-kubelet.crt                 system:node:k8s-node1               ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>serving-kubelet.crt                k8s-node1                           ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-rke2-controller.crt         system:rke2-controller              ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-admin.crt                   system:admin                        ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-auth-proxy.crt              system:auth-proxy                   ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client-controller.crt              system:kube-controller-manager      ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>kube-controller-manager.crt        kube-controller-manager             ServerAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>client.crt                         etcd-client                         ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK<br>server-client.crt                  etcd-server                         ServerAuth,ClientAuth   Feb 09, 2027 14:29 UTC   1 year          OK<br>peer-server-client.crt             etcd-peer                           ServerAuth,ClientAuth   Feb 09, 2027 14:29 UTC   1 year          OK<br>client-supervisor.crt              system:rke2-supervisor              ClientAuth              Feb 09, 2027 14:29 UTC   1 year          OK</pre><p>Notice how the server node has a large number of certificates. You have client authentication certificates for every major component (API server, scheduler, controller manager, kubelet, kube-proxy, etcd, and the RKE2 supervisor), as well as server authentication certificates for the API server, scheduler, controller manager, kubelet, and etcd.</p><p>Each leaf certificate expires in 1 year, while the CA certificates that signed them expire in 10 years.</p><p>On the agent node (k8s-node2), the certificate inventory is much smaller because the agent only needs certificates for its own kubelet, kube-proxy, and the RKE2 controller:</p><pre>rke2 certificate check --output table<br><br>FILENAME                     SUBJECT                     USAGES       EXPIRES                  RESIDUAL TIME   STATUS<br>--------                     -------                     ------       -------                  -------------   ------<br>client-kubelet.crt           system:node:k8s-node2       ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK<br>serving-kubelet.crt          k8s-node2                   ServerAuth   Feb 14, 2027 14:21 UTC   1 year          OK<br>client-rke2-controller.crt   system:rke2-controller      ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK<br>client-kube-proxy.crt        system:kube-proxy           ClientAuth   Feb 14, 2027 14:21 UTC   1 year          OK</pre><p>Now, there are situations where you might want to rotate certificates manually rather than waiting for the automatic renewal window. Perhaps you suspect a certificate has been compromised, or you are performing a planned security rotation as part of your organization’s compliance requirements. RKE2 provides a straightforward command for this.</p><p>The process requires three steps: stop the RKE2 server, rotate the certificates, and then start the server again. Here is the full procedure on the server node (k8s-node1):</p><p>First, stop the RKE2 server service:</p><pre>systemctl stop rke2-server</pre><p>Next, run the certificate rotation command:</p><pre>rke2 certificate rotate</pre><p>You will see output confirming that every certificate is being rotated:</p><pre>INFO[0000] Server detected, rotating agent and server certificates<br>INFO[0000] Rotating dynamic listener certificate<br>INFO[0000] Rotating certificates for admin<br>INFO[0000] Rotating certificates for controller-manager<br>INFO[0000] Rotating certificates for kube-proxy<br>INFO[0000] Rotating certificates for kubelet<br>INFO[0000] Rotating certificates for rke2-controller<br>INFO[0000] Rotating certificates for api-server<br>INFO[0000] Rotating certificates for auth-proxy<br>INFO[0000] Rotating certificates for cloud-controller<br>INFO[0000] Rotating certificates for etcd<br>INFO[0000] Rotating certificates for scheduler<br>INFO[0000] Rotating certificates for supervisor<br>INFO[0000] Successfully backed up certificates to /var/lib/rancher/rke2/server/tls-1770651290, please restart rke2 server or agent to rotate certificates</pre><p>Notice that RKE2 automatically backs up the old certificates to a timestamped directory under /var/lib/rancher/rke2/server/. This is a nice safety net — if anything goes wrong, you can restore the previous certificates.</p><p>You can verify the new certificate expiration dates before restarting:</p><pre>rke2 certificate check --output table</pre><p>Now start the server back up:</p><pre>systemctl start rke2-server</pre><p>After the server is running, verify everything looks good once more:</p><pre>rke2 certificate check --output table</pre><p>There is one important thing to remember after certificate rotation: the kubeconfig file at /etc/rancher/rke2/rke2.yaml gets regenerated with the new certificate data.</p><p>If you previously copied this file to ~/.kube/config, you need to update your copy. You can check if they differ and then copy the updated file:</p><pre>diff /etc/rancher/rke2/rke2.yaml ~/.kube/config<br>yes | cp /etc/rancher/rke2/rke2.yaml ~/.kube/config ; echo<br>kubectl cluster-info</pre><p>A great thing about RKE2’s certificate rotation is that you do not need to do anything special on the worker nodes. When the rke2-agent service on a worker node detects that the connection to the server has been interrupted (because the server’s certificates changed), it will automatically reconnect and receive new certificates from the server. This behavior is very similar to how kubeadm-based clusters handle certificate renewal on worker nodes — the worker simply re-bootstraps its credentials upon reconnection.</p><h4>Manual Version Upgrade (v1.33 to v1.34)</h4><p>Upgrading RKE2 manually is refreshingly simple compared to many other Kubernetes distributions. The process involves running the same installation script you used initially, but pointing it to a newer version channel. RKE2 will upgrade the binaries and restart the control plane components, often with minimal disruption to running workloads.</p><p>Before starting, let us set up some monitoring so we can observe what happens during the upgrade. Open several terminal windows on the server node and run these commands:</p><p>In one terminal, continuously curl your sample application to verify it stays accessible throughout the upgrade:</p><pre>while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done</pre><p>In another terminal, watch the kube-system pods to see which ones get recreated:</p><pre>watch -d &quot;kubectl get pod -n kube-system -owide --sort-by=.metadata.creationTimestamp | tac&quot;</pre><p>In a third terminal, watch the node status:</p><pre>watch -d &quot;kubectl get node&quot;</pre><p>And optionally, if you want to monitor etcd cluster health during the upgrade:</p><pre>watch -d etcdctl \<br>  --endpoints=https://127.0.0.1:2379 \<br>  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \<br>  --cert=/var/lib/rancher/rke2/server/tls/etcd/client.crt \<br>  --key=/var/lib/rancher/rke2/server/tls/etcd/client.key \<br>  member list --write-out=table</pre><p>Now confirm your current version:</p><pre>rke2 --version</pre><pre>rke2 version v1.33.7+rke2r3 (7e4fd1a82edf497cab91c220144619bbad659cf4)<br>go version go1.24.11 X:boringcrypto</pre><p>You can also check what versions are available across different release channels by querying the RKE2 update API:</p><pre>curl -s https://update.rke2.io/v1-release/channels | jq .data</pre><p>This returns a JSON array with all available channels including stable, latest, and version-specific channels like v1.34 and v1.35. Each entry shows the latest version available in that channel.</p><p>Now let us upgrade the server node. On k8s-node1, run the RKE2 installer with the v1.34 channel specified:</p><pre>curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.34 sh -</pre><p>The installer will download and install the new RPM packages. You will see the transaction output showing the old v1.33 packages being replaced by v1.34:</p><pre>Running transaction<br>  Preparing        :                                                        1/1<br>  Upgrading        : rke2-common-1.34.3~rke2r3-0.el9.aarch64               1/4<br>  Upgrading        : rke2-server-1.34.3~rke2r3-0.el9.aarch64               2/4<br>  Running scriptlet: rke2-server-1.34.3~rke2r3-0.el9.aarch64               2/4<br>  Running scriptlet: rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4<br>  Cleanup          : rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4<br>  Running scriptlet: rke2-server-1.33.7~rke2r3-0.el9.aarch64               3/4<br>  Running scriptlet: rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4<br>  Cleanup          : rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4<br>  Running scriptlet: rke2-common-1.33.7~rke2r3-0.el9.aarch64               4/4<br>  Verifying        : rke2-common-1.34.3~rke2r3-0.el9.aarch64               1/4<br>  Verifying        : rke2-common-1.33.7~rke2r3-0.el9.aarch64               2/4<br>  Verifying        : rke2-server-1.34.3~rke2r3-0.el9.aarch64               3/4<br>  Verifying        : rke2-server-1.33.7~rke2r3-0.el9.aarch64               4/4</pre><p>What is remarkable here is that just running the installer script is enough to trigger the upgrade of the control plane components. During the scriptlet phase, the RKE2 process detects the new binaries and recreates the static pod manifests, which causes the kubelet to restart the control plane pods.</p><p>If you are watching your monitoring terminals, you will see the etcd, kube-apiserver, and kube-proxy pods get recreated first, followed by the scheduler and controller manager. And throughout this entire process, if you are watching your application curl loop, you should see that application traffic continues to flow without interruption.</p><p>Verify the new version:</p><pre>rke2 --version</pre><pre>rke2 version v1.34.3+rke2r3 (7598946e0086a9131564ccbb3c142b3fa54516ad)<br>go version go1.24.11 X:boringcrypto</pre><p>You can also confirm the yum repository was updated to point to the v1.34 channel:</p><pre>dnf repolist</pre><pre>rancher-rke2-1.34-stable       Rancher RKE2 1.34 (v1.34)<br>rancher-rke2-common-stable     Rancher RKE2 Common (v1.34)</pre><p>To see which container images each kube-system pod is running after the upgrade:</p><pre>kubectl get pods -n kube-system \<br>  -o custom-columns=\<br>POD:.metadata.name,\<br>CONTAINERS:.spec.containers[*].name,\<br>IMAGES:.spec.containers[*].image</pre><p>Now restart the RKE2 server service to ensure everything is fully reconciled:</p><pre>systemctl restart rke2-server</pre><p>Check the node versions to see the server is upgraded but the worker is still on the old version:</p><pre>kubectl get node -owide</pre><pre>NAME        STATUS   ROLES                       AGE     VERSION          INTERNAL-IP     ...<br>k8s-node1   Ready    control-plane,etcd,master   18m     v1.34.3+rke2r3   192.168.10.11   ...<br>k8s-node2   Ready    &lt;none&gt;                      7m23s   v1.33.7+rke2r3   192.168.10.12   ...</pre><p>This is the expected intermediate state. Kubernetes supports a version skew of up to two minor versions between the control plane and worker nodes, so having the server on v1.34 and the worker on v1.33 is perfectly fine temporarily.</p><p>Now let us upgrade the worker node. Switch to k8s-node2 and run the installer, this time specifying the agent type:</p><pre>rke2 --version<br>curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent INSTALL_RKE2_CHANNEL=v1.34 sh -</pre><p>Verify the new version on the worker:</p><pre>rke2 --version</pre><pre>rke2 version v1.34.3+rke2r3 (7598946e0086a9131564ccbb3c142b3fa54516ad)<br>go version go1.24.11 X:boringcrypto</pre><p>Confirm the repository was updated:</p><pre>dnf repolist</pre><pre>rancher-rke2-1.34-stable       Rancher RKE2 1.34 (v1.34)<br>rancher-rke2-common-stable     Rancher RKE2 Common (v1.34)</pre><p>Restart the agent service:</p><pre>systemctl restart rke2-agent</pre><p>Now go back to the server node (k8s-node1) and verify that both nodes are running the same version:</p><pre>kubectl get node -owide</pre><pre>NAME        STATUS   ROLES                       AGE   VERSION          INTERNAL-IP     ...<br>k8s-node1   Ready    control-plane,etcd,master   22m   v1.34.3+rke2r3   192.168.10.11   ...<br>k8s-node2   Ready    &lt;none&gt;                      11m   v1.34.3+rke2r3   192.168.10.12   ...</pre><p>If you check the kube-system pods sorted by creation time, you will notice that only the kube-proxy pod on the worker node was recreated during the worker upgrade. The rest of the worker node pods (like canal) continued running unchanged:</p><pre>kubectl get pod -n kube-system --sort-by=.metadata.creationTimestamp | tac</pre><p>The entire manual upgrade process is remarkably smooth. The key takeaway is the order of operations: always upgrade the server (control plane) first, then upgrade the agent (worker) nodes one by one. This ensures the control plane is always at the same or newer version than the workers, which is a fundamental Kubernetes requirement.</p><h4>Automated Upgrade with System Upgrade Controller</h4><p>Manual upgrades work well for small clusters, but when you are managing many nodes or want to ensure upgrades happen reliably without human intervention, the Rancher System Upgrade Controller is the way to go.</p><p>This controller watches for custom Plan resources that declaratively specify which nodes should be upgraded, to which version, and in what order. It then creates Jobs that run on each target node to perform the actual upgrade.</p><p>Before we begin, let us set up our monitoring terminals again. On the server node (k8s-node1):</p><p>Watch the upgrade plans and their associated jobs and pods:</p><pre>watch -d &#39;kubectl -n system-upgrade get plans -o wide; echo ; kubectl -n system-upgrade get jobs,pods&#39;</pre><p>Watch the node versions:</p><pre>watch -d &quot;kubectl get node&quot;</pre><p>Watch the kube-system pods:</p><pre>watch -d &quot;kubectl get pod -n kube-system -owide --sort-by=.metadata.creationTimestamp | tac&quot;</pre><p>And keep your application health check running:</p><pre>while true; do curl -s http://192.168.10.12:30000 | grep Hostname; date; sleep 1; done</pre><p>The first step is to install the System Upgrade Controller itself. It consists of a CRD (the Plan resource definition) and a controller deployment that watches for Plan resources:</p><pre>kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml \<br>  -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml</pre><p>This creates several resources: a system-upgrade namespace, a ServiceAccount, RBAC roles and bindings (including a special “drainer” ClusterRole that allows the controller to cordon and drain nodes), a ConfigMap with default environment variables, and the controller Deployment itself.</p><p>Verify the installation:</p><pre>kubectl get deploy,pod,cm -n system-upgrade</pre><pre>NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE<br>deployment.apps/system-upgrade-controller   1/1     1            1           12m<br><br>NAME                                             READY   STATUS    RESTARTS   AGE<br>pod/system-upgrade-controller-6f9f9b8cf4-46n82   1/1     Running   0          12m<br><br>NAME                               DATA   AGE<br>configmap/default-controller-env   10     12m</pre><p>Also confirm the CRD was registered:</p><pre>kubectl get crd | grep upgrade<br><br>$ plans.upgrade.cattle.io   2026-02-09T18:08:08Z</pre><p>You can tail the controller logs to watch what happens in real time:</p><pre>kubectl logs -n system-upgrade -l app.kubernetes.io/name=system-upgrade-controller -f</pre><p>Now comes the interesting part — defining the upgrade plans. We will create two Plan resources: one for server nodes (control plane) and one for agent nodes (workers). The key design here is that the agent plan has a prepare step that references the server plan. This ensures that all server nodes are upgraded before any agent node begins its upgrade, which respects the Kubernetes version skew policy.</p><pre>cat &lt;&lt; EOF | kubectl apply -f -<br>apiVersion: upgrade.cattle.io/v1<br>kind: Plan<br>metadata:<br>  name: server-plan<br>  namespace: system-upgrade<br>spec:<br>  concurrency: 1<br>  cordon: true<br>  nodeSelector:<br>    matchExpressions:<br>    - key: node-role.kubernetes.io/control-plane<br>      operator: In<br>      values:<br>      - &quot;true&quot;<br>  serviceAccountName: system-upgrade<br>  upgrade:<br>    image: rancher/rke2-upgrade<br>  channel: https://update.rke2.io/v1-release/channels/latest<br>---<br>apiVersion: upgrade.cattle.io/v1<br>kind: Plan<br>metadata:<br>  name: agent-plan<br>  namespace: system-upgrade<br>spec:<br>  concurrency: 1<br>  cordon: true<br>  nodeSelector:<br>    matchExpressions:<br>    - key: node-role.kubernetes.io/control-plane<br>      operator: DoesNotExist<br>  prepare:<br>    args:<br>    - prepare<br>    - server-plan<br>    image: rancher/rke2-upgrade<br>  serviceAccountName: system-upgrade<br>  upgrade:<br>    image: rancher/rke2-upgrade<br>  channel: https://update.rke2.io/v1-release/channels/latest<br>EOF</pre><p>Let us break down what each field does. The concurrency field is set to 1, meaning only one node will be upgraded at a time. This is the safest approach — you can increase it if you have a large cluster and want faster rollouts, but for most on-premise environments, upgrading one node at a time is the prudent choice. The cordon field is set to true, which means each node will be cordoned (marked as unschedulable) before the upgrade begins, preventing new pods from being scheduled on it during the process.</p><p>The nodeSelector is what differentiates the two plans. The server plan targets nodes that have the node-role.kubernetes.io/control-plane label set to “true”, while the agent plan targets nodes where that label does not exist at all. This cleanly separates control plane nodes from worker nodes.</p><p>The channel field points to the RKE2 update API, specifically the “latest” channel. The controller will query this URL to determine what the latest available version is. You could alternatively use a version field with a specific version string like “v1.35.0+rke2r3” if you want to pin to an exact version rather than following the latest channel.</p><p>The prepare section in the agent plan is crucial for ordering. It tells the controller to run a preparation step using the rancher/rke2-upgrade image with the argument “prepare server-plan”. This step will wait until the server-plan has completed successfully on all matching nodes before proceeding with the agent upgrade. This is how the controller enforces the correct upgrade order.</p><p>As soon as you apply these Plan resources, the controller springs into action. It resolves the latest version from the channel URL, creates Jobs for each target node, and begins the upgrade process. You can watch the progress by checking the plan status:</p><pre>kubectl -n system-upgrade get plans -o wide</pre><pre>NAME          IMAGE                  CHANNEL                                             VERSION   COMPLETE   MESSAGE   APPLYING<br>agent-plan    rancher/rke2-upgrade   https://update.rke2.io/v1-release/channels/latest             True<br>server-plan   rancher/rke2-upgrade   https://update.rke2.io/v1-release/channels/latest             True</pre><p>Check the jobs that were created:</p><pre>kubectl -n system-upgrade get jobs</pre><pre>NAME                                                              STATUS     COMPLETIONS   DURATION   AGE<br>apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-7dc30   Complete   1/1           2m5s       4m16s<br>apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-28ad1   Complete   1/1           55s        4m16s</pre><p>And the pods that performed the actual upgrades:</p><pre>kubectl get pod -n system-upgrade -owide</pre><pre>NAME                                                              READY   STATUS      RESTARTS   AGE     IP              NODE        ...<br>apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-54frb   0/1     Unknown     0          4m30s   192.168.10.12   k8s-node2   ...<br>apply-agent-plan-on-k8s-node2-with-db1bffd09b601fca4c7c06-jrsnq   0/1     Completed   0          2m37s   192.168.10.12   k8s-node2   ...<br>apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-xg4jl   0/1     Unknown     0          4m30s   192.168.10.11   k8s-node1   ...<br>apply-server-plan-on-k8s-node1-with-db1bffd09b601fca4c7c0-msrt2   0/1     Completed   0          3m43s   192.168.10.11   k8s-node1   ...<br>system-upgrade-controller-6f9f9b8cf4-zksgq                        1/1     Running     0          8m20s   10.42.0.8       k8s-node1   ...</pre><p>You might notice some pods with an “Unknown” status. These are the initial pods that were running on the node when it was being restarted during the upgrade — they lost connection and could not report their final status. The “Completed” pods are the ones that actually finished the upgrade successfully on a second attempt.</p><p>One very important thing to understand about these upgrade pods is the level of access they require. Because they need to modify the host operating system (installing new RKE2 binaries, restarting systemd services), they run with significant privileges. You can inspect this:</p><pre>kubectl describe pod -n system-upgrade | grep ^Volumes: -A4</pre><pre>Volumes:<br>  host-root:<br>    Type:          HostPath (bare host directory volume)<br>    Path:          /<br>    HostPathType:  Directory</pre><p>The upgrade pods mount the entire host root filesystem at /host with read and write permissions. They also use the host IPC, NET, and PID namespaces, and have the CAP_SYS_BOOT capability. This level of access is necessary for the upgrade to work, but it is something you should be aware of from a security perspective, especially in production environments. Make sure you review and understand these permissions before deploying the System Upgrade Controller in a sensitive environment.</p><p>You can also check the controller logs to see the full sequence of events:</p><pre>kubectl logs -n system-upgrade -l app.kubernetes.io/name=system-upgrade-controller</pre><pre>object=&quot;system-upgrade/server-plan&quot; ... type=&quot;Normal&quot; reason=&quot;Resolved&quot; message=&quot;Resolved latest version from Spec.Channel: v1.35.0-rke2r3&quot;<br>object=&quot;system-upgrade/server-plan&quot; ... type=&quot;Normal&quot; reason=&quot;SyncJob&quot; message=&quot;Jobs synced for version v1.35.0-rke2r3 on Nodes k8s-node1. Hash: db1bffd09b601fca4c7c067d987c4d368f9237f8219289438d8678e8&quot;<br>object=&quot;system-upgrade/server-plan&quot; ... type=&quot;Normal&quot; reason=&quot;JobComplete&quot; message=&quot;Job completed on Node k8s-node1&quot;<br>object=&quot;system-upgrade/server-plan&quot; ... type=&quot;Normal&quot; reason=&quot;Complete&quot; message=&quot;Jobs complete for version v1.35.0-rke2r3. Hash: db1bffd09b601fca4c7c067d987c4d368f9237f8219289438d8678e8&quot;</pre><p>The logs clearly show the sequence: the controller resolved the latest version as v1.35.0-rke2r3, synced upgrade jobs to the server node first, waited for completion, and then proceeded with the agent nodes.</p><p>Finally, verify that both nodes are now running the latest version:</p><pre>kubectl get node -owide</pre><pre>NAME        STATUS   ROLES                       AGE    VERSION          INTERNAL-IP     ...<br>k8s-node1   Ready    control-plane,etcd,master   108m   v1.35.0+rke2r3   192.168.10.11   ...<br>k8s-node2   Ready    &lt;none&gt;                      97m    v1.35.0+rke2r3   192.168.10.12   ...</pre><p>Both nodes are now on v1.35.0. The automated upgrade completed successfully, upgrading the server first and then the agent, exactly as specified in the Plan resources.</p><p>The beauty of this approach is that it is entirely declarative. You define your desired state (which version, which nodes, in what order), and the controller makes it happen. If you add new nodes to the cluster later, the controller will automatically apply the upgrade plan to them as well, as long as they match the nodeSelector. This makes the System Upgrade Controller an excellent choice for clusters that need to stay current with minimal operational overhead.</p><h3><strong>Cluster API Overview</strong> — Management vs. Workload clusters</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hkUDzI_JNscM2YUlxK4wvQ.jpeg" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*paYwggS01hHWRy6PZN4WiQ.jpeg" /></figure><p>Before we continue the hands-on part, let’s take a step back and understand what Cluster API actually is and why it exists. If you’ve ever managed more than a handful of Kubernetes clusters, you know the pain. Every cluster needs to be provisioned, configured, upgraded, and eventually decommissioned. Doing all of this manually or with a patchwork of scripts quickly becomes a nightmare as the number of clusters grows. Cluster API, often abbreviated as CAPI, was born out of this exact problem.</p><p>At its core, Cluster API is a Kubernetes-native tool that lets you create, configure, and manage Kubernetes clusters using Kubernetes itself. Yes, you read that right. You use Kubernetes custom resources to describe what your cluster should look like, and a set of controllers running inside a Kubernetes cluster takes care of making that a reality. Think of it as applying the declarative, reconciliation-driven model that Kubernetes uses for Pods and Deployments, but now applied to entire clusters, nodes, and control planes.</p><p>This brings us to the two most important concepts you need to understand before going any further: the Management Cluster and the Workload Cluster.</p><p>The Management Cluster is the Kubernetes cluster where Cluster API components are installed and running. It hosts all the CAPI controllers, the custom resource definitions (CRDs), and the provider-specific controllers. This is the cluster that watches your declarative cluster definitions and does the heavy lifting of provisioning infrastructure, bootstrapping nodes, and orchestrating upgrades. You can think of it as the “brain” that controls everything. In our hands-on lab, we use a simple KinD (Kubernetes in Docker) cluster as the management cluster, but in a real production environment, this would typically be a dedicated, highly available cluster that you treat with extra care.</p><p>The Workload Cluster, on the other hand, is the cluster that gets created and managed by the Management Cluster. This is where your actual applications run. You don’t install Cluster API components on workload clusters. Instead, the management cluster provisions them from scratch, including spinning up the underlying infrastructure (VMs, containers, or bare metal), bootstrapping each node with kubeadm, forming the control plane, and joining worker nodes. From the workload cluster’s perspective, it’s just a normal Kubernetes cluster. It has no idea that it was created and is being managed by another cluster.</p><p>Now let’s talk about how this is structured internally. Cluster API is designed around a pluggable provider model, and there are four types of providers that work together to make everything happen.</p><p>The first is the Core Provider, which lives in the capi-system namespace. This is the main Cluster API controller that manages the fundamental CRDs like Cluster, Machine, MachineSet, and MachineDeployment. It orchestrates the overall reconciliation loop. When you check the installed providers in our lab, you&#39;ll see it like this:</p><pre>kubectl get providers.clusterctl.cluster.x-k8s.io -A<br><br>NAMESPACE                           NAME                    TYPE                     PROVIDER      VERSION<br>capi-system                         cluster-api             CoreProvider             cluster-api   v1.12.2</pre><p>The second is the Bootstrap Provider. In our case, this is the Kubeadm Bootstrap Provider, running in the capi-kubeadm-bootstrap-system namespace. Its job is to generate the cloud-init or user-data configuration that each node needs to bootstrap itself into a functioning Kubernetes node. It creates the kubeadm init and kubeadm join configurations so that control plane nodes and worker nodes know how to set themselves up. You can verify it&#39;s running:</p><pre>kubectl get providers -n capi-kubeadm-bootstrap-system bootstrap-kubeadm -o yaml<br><br>providerName: kubeadm<br>type: BootstrapProvider<br>version: v1.12.2</pre><p>The third is the Control Plane Provider. This one, running in capi-kubeadm-control-plane-system, is specifically responsible for managing the lifecycle of control plane nodes. It handles the KubeadmControlPlane resource, which means it takes care of scaling the control plane up or down, rolling out upgrades to control plane nodes one at a time, and making sure etcd membership is correctly managed throughout. This is critical for safe, zero-downtime upgrades:</p><pre>kubectl get providers -n capi-kubeadm-control-plane-system control-plane-kubeadm -o yaml<br><br>providerName: kubeadm<br>type: ControlPlaneProvider<br>version: v1.12.2</pre><p>The fourth and final piece is the Infrastructure Provider. This is where things get environment-specific. The infrastructure provider is responsible for actually creating the underlying resources, whether that’s VMs on vSphere, instances on AWS, or in our lab’s case, Docker containers that simulate nodes. Our lab uses the Docker provider (CAPD), which runs in the capd-system namespace. It&#39;s important to note that the Docker provider is designed strictly for development and testing purposes, not for production. But it&#39;s perfect for learning because it lets you see the full Cluster API workflow without needing access to a cloud or hypervisor:</p><pre>kubectl get providers -n capd-system infrastructure-docker -o yaml<br><br>providerName: docker<br>type: InfrastructureProvider<br>version: v1.12.2</pre><p>When you put all four providers together and look at the pods running across the management cluster, you get a clear picture of the architecture:</p><pre>kubectl get pod -A<br><br>NAMESPACE                           NAME                                                            READY   STATUS<br>capd-system                         capd-controller-manager-7c9d67ffdf-7npsd                        1/1     Running<br>capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-bd5f89bbd-9c9ng       1/1     Running<br>capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55c48d9b5-bckj5   1/1     Running<br>capi-system                         capi-controller-manager-6cc7b949c4-rmd7h                        1/1     Running<br>cert-manager                        cert-manager-598d877b78-9lkmd                                   1/1     Running<br>cert-manager                        cert-manager-cainjector-6b5777d564-7mfzz                        1/1     Running<br>cert-manager                        cert-manager-webhook-5d9fc6b4ff-slscg                           1/1     Running</pre><p>You’ll also notice cert-manager is installed. This is a dependency that Cluster API installs automatically because the controllers use TLS certificates for webhook communication.</p><p>Another concept worth understanding is ClusterClass, which was introduced as an experimental feature and is what we use in our lab. ClusterClass lets you define a reusable “template” or “blueprint” for clusters. Instead of specifying every detail for each individual cluster, you define a ClusterClass once, covering things like what machine templates to use, how the control plane should be configured, and what the worker node pools look like. Then, when you create an actual Cluster resource, you simply reference the ClusterClass and provide a minimal set of overrides like the cluster name, Kubernetes version, and replica counts. This is enabled by setting the CLUSTER_TOPOLOGY feature gate to true before initializing the management cluster:</p><pre>export CLUSTER_TOPOLOGY=true<br>clusterctl init --infrastructure docker</pre><p>And when you look at the feature gates that the core controller is running with, you can confirm it:</p><pre>kubectl describe -n capi-system deployment.apps/capi-controller-manager | grep feature-gates<br><br>--feature-gates=MachinePool=true,ClusterTopology=true,RuntimeSDK=false,...</pre><p>With ClusterTopology enabled, the generated cluster manifest contains both a ClusterClass definition (the blueprint) and a Cluster resource that references it. If you inspect the generated YAML, the resource types make this clear:</p><pre>cat capi-quickstart.yaml | grep -E &#39;^apiVersion:|^kind:&#39;<br><br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: ClusterClass<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerClusterTemplate<br>apiVersion: controlplane.cluster.x-k8s.io/v1beta2<br>kind: KubeadmControlPlaneTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachineTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachineTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachinePoolTemplate<br>apiVersion: bootstrap.cluster.x-k8s.io/v1beta2<br>kind: KubeadmConfigTemplate<br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: Cluster</pre><p>The ClusterClass at the top is the blueprint. The Cluster at the bottom is the actual workload cluster that references it. Everything in between, the DockerClusterTemplate, KubeadmControlPlaneTemplate, DockerMachineTemplates, and KubeadmConfigTemplate, are the building blocks that the ClusterClass stitches together.</p><p>One of the most powerful benefits of this architecture is how upgrades work. Because the management cluster controls the workload cluster declaratively, upgrading the Kubernetes version of an entire workload cluster is as simple as patching a single field on the Cluster resource. The controllers handle the rest: they roll out new control plane nodes one by one, wait for each to become healthy, drain and remove old nodes, and then do the same for worker nodes. There’s no need to SSH into machines or run manual commands. You just change the desired version, and the system converges to match it.</p><p>To summarize the mental model: the Management Cluster is the operator, and the Workload Cluster is what gets operated on. The Management Cluster runs CAPI controllers that watch custom resources describing desired cluster state. When you create, modify, or delete these resources, the controllers spring into action, provisioning or tearing down real infrastructure to match your intent. This separation of concerns means you can manage dozens or even hundreds of workload clusters from a single management cluster, all using the same familiar Kubernetes API and tooling you already know.</p><h3>Cluster API Hands-On</h3><p>In this section, we will walk through every step of setting up a management cluster, provisioning a full workload cluster with three control plane nodes and three workers, deploying a sample application onto it, and finally upgrading that workload cluster to a newer Kubernetes version. Everything here uses the Docker infrastructure provider, which means each “machine” is actually a Docker container pretending to be a node. This makes it perfect for learning and experimentation on your local machine without needing any cloud accounts or bare-metal servers.</p><h4>Setting Up the Management Cluster (KinD + Docker Provider)</h4><p>Before we can use Cluster API to provision workload clusters, we need a management cluster. This is a regular Kubernetes cluster that runs the Cluster API controllers. These controllers watch for custom resources like Cluster, Machine, MachineDeployment, and so on, and then reconcile the desired state by creating actual infrastructure. In our case, the infrastructure provider is Docker, so the controllers will create Docker containers to act as Kubernetes nodes.</p><p>Let us start by creating a working directory. This keeps all our files organized in one place.</p><pre>mkdir capi-docker &amp;&amp; cd capi-docker</pre><p>Before proceeding, it is a good idea to check your Docker context and make sure there are no leftover containers from previous experiments that might interfere.</p><pre>docker context ls<br>docker ps -a</pre><p>Now we create a KinD (Kubernetes in Docker) cluster. This will serve as our management cluster. Notice that we mount the Docker socket into the KinD node. This is critical because the Cluster API Docker provider needs access to the host’s Docker daemon in order to create containers that will become our workload cluster nodes. We also set up a couple of extra port mappings for convenience, one for a sample application and one for kube-ops-view, a nice visual dashboard.</p><pre>kind create cluster --name myk8s --image kindest/node:v1.35.0 --config - &lt;&lt;EOF<br><br>kind: Cluster<br>apiVersion: kind.x-k8s.io/v1alpha4<br>nodes:<br>- role: control-plane<br>  extraMounts:<br>  - hostPath: /var/run/docker.sock <br>    containerPath: /var/run/docker.sock<br>  extraPortMappings:<br>  - containerPort: 30000<br>    hostPort: 30000<br>  - containerPort: 30001<br>    hostPort: 30001<br>EOF</pre><p>Once the cluster is up, you can optionally install kube-ops-view on the management cluster. This gives you a nice visual representation of your cluster’s pods and nodes. It is not required, but it makes the experience much more enjoyable when you are watching things happen in real time.</p><pre>helm repo add geek-cookbook https://geek-cookbook.github.io/charts/<br>helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 \<br>  --set service.main.type=NodePort,service.main.ports.http.nodePort=30001 \<br>  --set env.TZ=&quot;Asia/Seoul&quot; --namespace kube-system</pre><p>You can then open your browser and navigate to <a href="http://127.0.0.1:30001/#scale=1.5">http://127.0.0.1:30001/#scale=1.5</a> to see the dashboard.</p><p>Next, we need the clusterctl command-line tool. This is the primary CLI for interacting with Cluster API. On macOS you can install it with Homebrew, and on Linux you can download the binary directly.</p><pre># macOS<br>brew install clusterctl  <br><br># Linux amd64<br>curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.12.2/clusterctl-linux-amd64 -o clusterctl<br>sudo install -o root -g root -m 0755 clusterctl /usr/local/bin/clusterctl</pre><p>Verify the installation by checking the version.</p><pre>clusterctl version -o json | jq</pre><p>Now comes the important part. We initialize the management cluster by telling clusterctl to install all the necessary Cluster API components with the Docker infrastructure provider. We also enable the ClusterTopology experimental feature, which lets us use ClusterClass resources. ClusterClass is essentially a reusable template that defines how clusters should be structured, making it much easier to stamp out multiple clusters with consistent configurations.</p><pre>export CLUSTER_TOPOLOGY=true<br>clusterctl init --infrastructure docker</pre><p>This command does a lot behind the scenes. It installs cert-manager (which Cluster API depends on for webhook certificates), and then it deploys four sets of controllers into separate namespaces. You will see output indicating that each provider is being installed. Once it finishes, let us verify that everything landed correctly.</p><pre>kubectl get pod -A</pre><p>You should see pods running in several new namespaces. The capd-system namespace contains the Docker infrastructure provider controller. The capi-system namespace has the core Cluster API controller. The capi-kubeadm-bootstrap-system namespace runs the bootstrap provider, which is responsible for generating the cloud-init or kubeadm configuration that nodes use when they first boot. And capi-kubeadm-control-plane-system runs the control plane provider, which manages the lifecycle of control plane nodes including scaling and upgrades.</p><pre>NAMESPACE                           NAME                                                            READY   STATUS<br>capd-system                         capd-controller-manager-7c9d67ffdf-7npsd                        1/1     Running<br>capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-bd5f89bbd-9c9ng       1/1     Running<br>capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55c48d9b5-bckj5   1/1     Running<br>capi-system                         capi-controller-manager-6cc7b949c4-rmd7h                        1/1     Running<br>cert-manager                        cert-manager-598d877b78-9lkmd                                   1/1     Running<br>cert-manager                        cert-manager-cainjector-6b5777d564-7mfzz                        1/1     Running<br>cert-manager                        cert-manager-webhook-5d9fc6b4ff-slscg                           1/1     Running</pre><p>You can also inspect the installed providers using the providers custom resource. This is a handy way to confirm which providers are active and what version they are running.</p><pre>kubectl get providers.clusterctl.cluster.x-k8s.io -A</pre><p>The output tells you exactly what is installed: the core Cluster API provider, the kubeadm bootstrap provider, the kubeadm control plane provider, and the Docker infrastructure provider, all at version v1.12.2.</p><pre>NAMESPACE                           NAME                    TYPE                     PROVIDER      VERSION<br>capd-system                         infrastructure-docker   InfrastructureProvider   docker        v1.12.2<br>capi-system                         cluster-api             CoreProvider             cluster-api   v1.12.2<br>capi-kubeadm-bootstrap-system       bootstrap-kubeadm       BootstrapProvider        kubeadm       v1.12.2<br>capi-kubeadm-control-plane-system   control-plane-kubeadm   ControlPlaneProvider     kubeadm       v1.12.2</pre><p>A large number of Custom Resource Definitions (CRDs) have also been installed. These define all the resource types that Cluster API uses to represent clusters, machines, machine deployments, machine pools, and more. You can see them all by running the following command.</p><pre>kubectl get crd | grep x-k8s</pre><p>At this point, your management cluster is fully initialized and ready to provision workload clusters. The controllers are running, the CRDs are in place, and the Docker provider is standing by to create containers on demand.</p><h4>Provisioning a Workload Cluster (3 CP + 3 Workers)</h4><p>With the management cluster ready, we can now tell it to create a workload cluster. The first thing to do is set some environment variables that control the networking configuration of the new cluster. These values will be injected into the cluster manifest when we generate it.</p><pre>export SERVICE_CIDR=[&quot;10.20.0.0/16&quot;]<br>export POD_CIDR=[&quot;10.10.0.0/16&quot;]<br>export SERVICE_DOMAIN=&quot;myk8s-1.local&quot;<br>export POD_SECURITY_STANDARD_ENABLED=&quot;false&quot;</pre><p>The SERVICE_CIDR defines the IP range for Kubernetes Services. The POD_CIDR defines the IP range for Pods. The SERVICE_DOMAIN sets the internal DNS domain. And we disable Pod Security Standards for simplicity in this lab environment.</p><p>Now we use clusterctl to generate the cluster manifest. The development flavor is specifically designed for the Docker provider. We request Kubernetes v1.34.3 with three control plane nodes and three worker nodes.</p><pre>clusterctl generate cluster capi-quickstart --flavor development \<br>  --kubernetes-version v1.34.3 \<br>  --control-plane-machine-count=3 \<br>  --worker-machine-count=3 \<br>  &gt; capi-quickstart.yaml</pre><p>This produces a YAML file containing all the Cluster API resources needed to describe our desired cluster. Let us peek at what kinds of resources are in there.</p><pre>cat capi-quickstart.yaml | grep -E &#39;^apiVersion:|^kind:&#39;</pre><p>The output reveals quite a few resource types, and understanding them helps demystify how Cluster API works.</p><pre>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: ClusterClass<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerClusterTemplate<br>apiVersion: controlplane.cluster.x-k8s.io/v1beta2<br>kind: KubeadmControlPlaneTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachineTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachineTemplate<br>apiVersion: infrastructure.cluster.x-k8s.io/v1beta2<br>kind: DockerMachinePoolTemplate<br>apiVersion: bootstrap.cluster.x-k8s.io/v1beta2<br>kind: KubeadmConfigTemplate<br>apiVersion: cluster.x-k8s.io/v1beta2<br>kind: Cluster</pre><p>The ClusterClass is like a blueprint. It defines what a cluster should look like in terms of its infrastructure, control plane configuration, and worker node templates. The Cluster resource at the bottom is the actual instance that references this ClusterClass and specifies the concrete values like version and replica counts. The various template resources (DockerClusterTemplate, DockerMachineTemplate, KubeadmControlPlaneTemplate, KubeadmConfigTemplate) define how each piece of infrastructure and configuration should be created.</p><p>Now let us apply this manifest to the management cluster and watch the magic happen.</p><pre>kubectl apply -f capi-quickstart.yaml</pre><p>You should see output confirming that all the resources were created.</p><pre>clusterclass.cluster.x-k8s.io/quick-start created<br>dockerclustertemplate.infrastructure.cluster.x-k8s.io/quick-start-cluster created<br>kubeadmcontrolplanetemplate.controlplane.cluster.x-k8s.io/quick-start-control-plane created<br>dockermachinetemplate.infrastructure.cluster.x-k8s.io/quick-start-control-plane created<br>dockermachinetemplate.infrastructure.cluster.x-k8s.io/quick-start-default-worker-machinetemplate created<br>dockermachinepooltemplate.infrastructure.cluster.x-k8s.io/quick-start-default-worker-machinepooltemplate created<br>kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/quick-start-default-worker-bootstraptemplate created<br>cluster.cluster.x-k8s.io/capi-quickstart created</pre><p>The controllers immediately begin reconciling. The Docker provider starts creating containers, the bootstrap provider generates kubeadm configurations, and the control plane provider orchestrates the bring-up of each control plane node one by one. You can watch the progress in several ways. Opening a few terminal windows with monitoring commands makes this much more fun to observe.</p><pre>watch -d &quot;docker ps ; echo ; clusterctl describe cluster capi-quickstart&quot;<br>watch -d kubectl get cluster -o wide<br>watch -d kubectl get machines</pre><p>After a few minutes, if you run docker ps, you will see new containers appearing. These are the workload cluster nodes, each running the kindest/node image.</p><pre>docker ps</pre><p>You will notice several types of containers. There are three containers with port 6443 exposed, which are the control plane nodes. Three containers without exposed ports are the worker nodes. One container running kindest/haproxy serves as a load balancer in front of the control plane API servers. And the original myk8s-control-plane container is your management cluster.</p><pre>CONTAINER ID   IMAGE                                COMMAND                  NAMES<br>d7622e825f15   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-hvt8l-plmp7<br>803b5917463f   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-hvt8l-j6tpk<br>051bcc807657   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-md-0-64l6n-9hjzs-lcjv4<br>46b27b6eb154   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-md-0-64l6n-9hjzs-8rz7j<br>a5271457c0d6   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-md-0-64l6n-9hjzs-qnd5z<br>1607db8af33f   kindest/node:v1.34.3                 &quot;/usr/local/bin/entr…&quot;   capi-quickstart-hvt8l-lpp75<br>6c1f5141492c   kindest/haproxy:v20230606-42a2262b   &quot;haproxy -W -db -f /…&quot;   capi-quickstart-lb<br>2f4d7bd4d2ad   kindest/node:v1.35.0                 &quot;/usr/local/bin/entr…&quot;   myk8s-control-plane</pre><p>Now we need to get the kubeconfig for the workload cluster so we can interact with it directly.</p><pre>clusterctl get kubeconfig capi-quickstart &gt; capi-quickstart.kubeconfig</pre><p>There is an important gotcha here that is worth mentioning. If you are running KinD on your local machine, the kubeconfig file may contain an internal Docker network IP address (something like 172.18.0.3) as the API server endpoint. This address is not reachable from your host machine.</p><p>You need to replace it with the host-accessible address. Check docker ps for the capi-quickstart-lb container’s port mapping. If it shows something like 0.0.0.0:55000-&gt;6443/tcp, then edit the kubeconfig to point to 127.0.0.1:55000 instead.</p><pre># Check the LB port mapping<br>docker ps | grep capi-quickstart-lb<br><br># Edit the kubeconfig<br>vim capi-quickstart.kubeconfig<br># Change server: https://172.18.0.3:6443<br># To:    server: https://127.0.0.1:55000</pre><p>Try connecting to the workload cluster.</p><pre>kubectl --kubeconfig=capi-quickstart.kubeconfig get nodes -owide</pre><p>At this point, all six nodes will show up but they will be in NotReady status. This is expected and completely normal. The nodes are not ready because there is no CNI (Container Network Interface) plugin installed yet. Without a CNI plugin, pods cannot communicate with each other, and the kubelet reports the node as not ready. Let us fix that by installing Calico.</p><pre>kubectl --kubeconfig=capi-quickstart.kubeconfig apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/calico.yaml</pre><p>Give it a minute or so, and then check the nodes again.</p><pre>kubectl --kubeconfig=capi-quickstart.kubeconfig get nodes -owide</pre><p>Now you should see all six nodes in Ready status, three control plane nodes and three workers, all running Kubernetes v1.34.3 on Debian Bookworm with containerd as the container runtime.</p><pre>NAME                                     STATUS   ROLES           AGE     VERSION   INTERNAL-IP<br>capi-quickstart-md-0-lt794-jfjvx-9wrql   Ready    &lt;none&gt;          113s    v1.34.3   192.168.97.6<br>capi-quickstart-md-0-lt794-jfjvx-f4pr4   Ready    &lt;none&gt;          113s    v1.34.3   192.168.97.5<br>capi-quickstart-md-0-lt794-jfjvx-rjfz9   Ready    &lt;none&gt;          113s    v1.34.3   192.168.97.7<br>capi-quickstart-sn62k-6n9sd              Ready    control-plane   2m14s   v1.34.3   192.168.97.4<br>capi-quickstart-sn62k-pr4kp              Ready    control-plane   46s     v1.34.3   192.168.97.9<br>capi-quickstart-sn62k-xhmtw              Ready    control-plane   75s     v1.34.3   192.168.97.8</pre><p>You can also use clusterctl describe to get a nice summary view of the entire cluster hierarchy.</p><pre>clusterctl describe cluster capi-quickstart</pre><p>The output shows the cluster, its infrastructure (DockerCluster), the control plane (KubeadmControlPlane) with its three machines, and the workers (MachineDeployment) with their three machines. Everything should show as Ready and Available.</p><pre>NAME                                                           READY   STATUS<br>Cluster/capi-quickstart                                        True    Available<br>├─ClusterInfrastructure - DockerCluster/capi-quickstart-wcb2g  True    Ready<br>├─ControlPlane - KubeadmControlPlane/capi-quickstart-sn62k     True    Available<br>│ └─3 Machines...                                              True    Ready<br>└─Workers<br>  └─MachineDeployment/capi-quickstart-md-0-lt794               True    Available<br>    └─3 Machines...                                            True    Ready</pre><p>Your workload cluster is now fully operational.</p><h4>Deploying a Sample App</h4><p>To make things more convenient for the rest of our work, let us set up a shell alias so we do not have to type the kubeconfig flag every time.</p><pre>alias k8s1=&#39;kubectl --kubeconfig=capi-quickstart.kubeconfig&#39;<br>k8s1 cluster-info</pre><p>You should see the control plane endpoint and CoreDNS reported as running. Optionally, you can also install kube-ops-view on the workload cluster for visual monitoring. This time we use port-forward since we do not have a NodePort conveniently mapped.</p><pre>helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 \<br>  --set env.TZ=&quot;Asia/Seoul&quot; --namespace kube-system --kubeconfig=capi-quickstart.kubeconfig<br><br>k8s1 -n kube-system port-forward svc/kube-ops-view 8080:8080 &amp;</pre><p>Open <a href="http://127.0.0.1:8080/#scale=1.5">http://127.0.0.1:8080/#scale=1.5</a> in your browser to see the workload cluster’s visual dashboard.</p><p>Now let us deploy a simple application to prove that the cluster works end to end. We will deploy a whoami web server using a Deployment with three replicas and expose it via a NodePort Service. The pod anti-affinity rule encourages the scheduler to spread the pods across different nodes.</p><pre>cat &lt;&lt; EOF | kubectl --kubeconfig=capi-quickstart.kubeconfig apply -f -<br>apiVersion: apps/v1<br>kind: Deployment<br>metadata:<br>  name: webpod<br>spec:<br>  replicas: 3<br>  selector:<br>    matchLabels:<br>      app: webpod<br>  template:<br>    metadata:<br>      labels:<br>        app: webpod<br>    spec:<br>      affinity:<br>        podAntiAffinity:<br>          requiredDuringSchedulingIgnoredDuringExecution:<br>          - labelSelector:<br>              matchExpressions:<br>              - key: app<br>                operator: In<br>                values:<br>                - sample-app<br>            topologyKey: &quot;kubernetes.io/hostname&quot;<br>      containers:<br>      - name: webpod<br>        image: traefik/whoami<br>        ports:<br>        - containerPort: 80<br>---<br>apiVersion: v1<br>kind: Service<br>metadata:<br>  name: webpod<br>  labels:<br>    app: webpod<br>spec:<br>  selector:<br>    app: webpod<br>  ports:<br>  - protocol: TCP<br>    port: 80<br>    targetPort: 80<br>    nodePort: 30003<br>  type: NodePort<br>EOF</pre><p>Check that the deployment, pods, service, and endpoints are all in good shape.</p><pre>k8s1 get deploy,pod,svc,ep -owide</pre><p>You should see three pods running, spread across your worker nodes, with the service exposing NodePort 30003. To actually test the connectivity, we can use the management cluster’s KinD node to reach into one of the workload cluster nodes on that NodePort. First, pick one of the control plane container names from docker ps, then curl it.</p><pre>docker ps<br>CT1=capi-quickstart-hvt8l-hb4hb<br><br>docker exec -it myk8s-control-plane curl -s $CT1:30003</pre><p>You can even run a loop to see load balancing in action across the three pods.</p><pre>while true; do docker exec -it myk8s-control-plane curl -s $CT1:30003 | grep Hostname; date; sleep 1; done</pre><p>Each request should return a different Hostname value as it hits different pods behind the service.</p><p>Since the workload cluster has a HAProxy load balancer in front of the three control plane API servers, you can also inspect its stats page. Check the LB container’s port mappings first.</p><pre>docker inspect capi-quickstart-lb | jq | grep -i hostport</pre><p>If you see a HostPort like 55001 mapped to port 8404, you can open <a href="http://127.0.0.1:55001/stats">http://127.0.0.1:55001/stats</a> in your browser to see the HAProxy statistics dashboard showing the health status of all three backend API servers.</p><h4>Upgrading the Workload Cluster (v1.34 to v1.35)</h4><p>One of the most powerful features of Cluster API is declarative cluster lifecycle management. Upgrading a cluster from one Kubernetes version to another is as simple as changing a single field in the Cluster resource. The controllers handle all the complexity of rolling out new nodes, draining old ones, and ensuring the cluster remains available throughout the process.</p><p>Before starting the upgrade, open several terminal windows for monitoring so you can watch the process unfold in real time.</p><pre>watch -d kubectl --kubeconfig=capi-quickstart.kubeconfig get node<br>watch -d &quot;docker ps ; echo ; clusterctl describe cluster capi-quickstart&quot;<br>kubectl --kubeconfig=capi-quickstart.kubeconfig get node -w</pre><p>If you still have kube-ops-view running, keep an eye on that browser tab as well. It is quite satisfying to watch nodes appear and disappear during the rolling upgrade.</p><p>Now, because we used ClusterClass and Topology when we created the cluster, the upgrade is done by patching the Cluster resource’s topology version field. You do not touch the KubeadmControlPlane or MachineDeployment directly; the topology controller takes care of propagating the version change to all the right places.</p><pre>kubectl patch cluster capi-quickstart --type merge -p &#39;{&quot;spec&quot;:{&quot;topology&quot;:{&quot;version&quot;:&quot;v1.35.0&quot;}}}&#39;</pre><p>The moment you run this command, the controllers spring into action. The upgrade proceeds in a careful, ordered fashion. First, the control plane nodes are upgraded one at a time. A new control plane node running v1.35.0 is created, it joins the cluster and the etcd ring, and then one of the old v1.34.3 control plane nodes is cordoned, drained, and removed. This repeats until all three control plane nodes have been replaced. Only after the entire control plane is upgraded does the controller move on to the worker nodes, following the same rolling replacement strategy.</p><p>You can watch the nodes cycling through by repeatedly checking the node list.</p><pre>k8s1 get node</pre><p>During the upgrade, you will see a mix of old and new nodes. At some point, you might see output like this, where the last node is being drained and removed.</p><pre>NAME                                      STATUS                     ROLES           VERSION<br>capi-quickstart-74kjs-ncktx               Ready                      control-plane   v1.35.0<br>capi-quickstart-74kjs-npnjr               Ready                      control-plane   v1.35.0<br>capi-quickstart-74kjs-tdd8d               Ready                      control-plane   v1.35.0<br>capi-quickstart-md-0-npmnz-x7b2z-892h8   Ready                      &lt;none&gt;          v1.35.0<br>capi-quickstart-md-0-npmnz-x7b2z-fszrs   Ready                      &lt;none&gt;          v1.35.0<br>capi-quickstart-md-0-npmnz-x7b2z-lksqj   Ready                      &lt;none&gt;          v1.35.0<br>capi-quickstart-md-0-npmnz-x9jx6-qx8rf   Ready,SchedulingDisabled   &lt;none&gt;          v1.34.3</pre><p>Notice the last worker node shows SchedulingDisabled. It has been cordoned and is being drained before removal.</p><p>If you check docker ps during the upgrade, you will see new containers being created with the kindest/node:v1.35.0 image while the old v1.34.3 containers are gradually removed.</p><pre>docker ps</pre><pre>CONTAINER ID   IMAGE                                NAMES<br>9a79e524c198   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-lksqj<br>f5658e9d5385   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-fszrs<br>1a71ef5e651a   kindest/node:v1.35.0                 capi-quickstart-md-0-npmnz-x7b2z-892h8<br>3baae4af04d4   kindest/node:v1.35.0                 capi-quickstart-74kjs-tdd8d<br>5e85d57bd948   kindest/node:v1.35.0                 capi-quickstart-74kjs-npnjr<br>e7742cd044b3   kindest/node:v1.35.0                 capi-quickstart-74kjs-ncktx<br>c125003ec5d7   kindest/haproxy:v20230606-42a2262b   capi-quickstart-lb<br>6419a7cf30ce   kindest/node:v1.35.0                 myk8s-control-plane</pre><p>Every container is now running v1.35.0. The HAProxy load balancer container remains untouched since it only proxies traffic and does not run Kubernetes itself.</p><p>You can also verify that the load balancer’s backend configuration has been updated to point to the new control plane nodes by copying the HAProxy config out of the container.</p><pre>docker cp capi-quickstart-lb:/usr/local/etc/haproxy/haproxy.cfg .<br>cat haproxy.cfg | grep capi-quickstart</pre><p>The backend server entries should now reference the new control plane containers with their new internal IP addresses.</p><p>Once the upgrade completes, all nodes will be running v1.35.0 and the cluster is fully healthy again. The entire process was triggered by a single kubectl patch command. There was no manual SSH-ing into nodes, no running kubeadm upgrade on each machine, no worrying about the order of operations. Cluster API handled everything declaratively.</p><p>This is the real power of Cluster API. Whether you are managing two clusters or two hundred, the workflow is the same: define the desired state, apply it, and let the controllers do the work. Combined with GitOps tools, you can version-control your entire cluster fleet and roll out upgrades with a pull request.</p><h3>Cleanup</h3><p>When you’re done experimenting with everything, it’s important to tear down all the resources properly so nothing lingers on your machine. We have two separate environments to clean up here: the Cluster API lab with its KinD-based management and workload clusters, and the RKE2 lab with its Vagrant-based virtual machines. Let’s walk through each one carefully.</p><p>Starting with the Cluster API side, the first thing to remove is the workload cluster that was provisioned through Cluster API. This is one of the most elegant parts of the entire Cluster API experience. Because the workload cluster is represented as a single Cluster custom resource in the management cluster, deleting it triggers the full reconciliation loop in reverse. The controller will drain the nodes, delete the Machine objects, tear down the Docker containers that were acting as virtual machines, and remove the load balancer. You can chain the delete command with docker ps to immediately see the effect.</p><pre>kubectl delete cluster capi-quickstart &amp;&amp; docker ps</pre><p>The output should look like this. Notice how all the workload cluster containers have disappeared, leaving only the original KinD management cluster container running.</p><pre>cluster.cluster.x-k8s.io &quot;capi-quickstart&quot; deleted<br>CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS                                                             NAMES<br>037b1d199fb7   kindest/node:v1.35.0   &quot;/usr/local/bin/entr…&quot;   57 minutes ago   Up 57 minutes   0.0.0.0:30000-30001-&gt;30000-30001/tcp, 127.0.0.1:57103-&gt;6443/tcp   myk8s-control-plane</pre><p>Before this command, you had quite a few containers running. Three containers for control plane nodes, three for worker nodes, one HAProxy load balancer container, and the management cluster container itself. That’s eight containers in total. After the delete, only the management cluster’s KinD container survives. Everything else was cleaned up automatically by the Cluster API controllers. This is one of the most powerful aspects of the declarative model. You don’t have to manually SSH into machines, run drain commands, or stop services one by one. The controller handles the entire teardown sequence for you.</p><p>You should also clean up the kubeconfig file and any other artifacts that were generated during the lab. The capi-quickstart.kubeconfig file and the capi-quickstart.yaml manifest file are no longer needed.</p><pre>rm -f capi-quickstart.kubeconfig capi-quickstart.yaml</pre><p>Now that the workload cluster is completely gone, you can safely remove the KinD management cluster itself. This will destroy the last remaining Docker container and also clean up the kubeconfig context that KinD registered on your host machine.</p><pre>kind delete cluster --name myk8s</pre><p>The output confirms the deletion.</p><pre>Deleting cluster &quot;myk8s&quot; ...<br>Deleted nodes: [&quot;myk8s-control-plane&quot;]</pre><p>At this point, if you run docker ps, you should see no containers at all. Your Docker environment is back to a completely clean state. If you also installed clusterctl via Homebrew on macOS and no longer need it, you can optionally remove it as well.</p><pre>brew uninstall clusterctl</pre><p>You can also clean up the working directory you created at the beginning of the Cluster API lab.</p><pre>cd ~<br>rm -rf capi-docker</pre><p>That takes care of the entire Cluster API environment. Now let’s move on to the RKE2 lab.</p><p>The RKE2 lab was built on top of Vagrant with two virtual machines, k8s-node1 acting as the server (control plane) and k8s-node2 acting as the agent (worker node). Since all the RKE2 components, the systemd services, the binaries, the etcd data, the certificates, and everything else live entirely inside those VMs, the cleanest way to remove everything is to simply destroy the VMs through Vagrant. Navigate back to the directory where your Vagrantfile lives and run the destroy command with the force flag so it doesn’t prompt you for confirmation on each VM.</p><pre>cd ~/k8s-rke2<br>vagrant destroy -f</pre><p>Vagrant will stop both VMs and delete their disk images. You can confirm they’re gone by checking the status.</p><pre>vagrant status</pre><p>Both k8s-node1 and k8s-node2 should show as “not created,” which means they have been fully removed from your system.</p><p>If you want to clean up the Vagrantfile and init script as well since you won’t need them anymore, you can remove the entire working directory.</p><pre>cd ~<br>rm -rf k8s-rke2</pre><p>Now, there is one more thing worth mentioning for situations where you might want to remove RKE2 from a node without destroying the VM itself. Perhaps you’re working in a non-Vagrant environment, or you want to repurpose the machine for something else. RKE2 actually ships with uninstall scripts that get placed on the system during installation. On a server node, you would run the server uninstall script, and on an agent node, you would run the agent version. These scripts are thorough. They stop the rke2 services, kill any remaining containerd processes, remove the binaries from the data directory, delete the systemd unit files, clean up the yum repository configurations that were added during install, and wipe the data directories under /var/lib/rancher and /etc/rancher.</p><p>For a server node, the command would be:</p><pre>/usr/bin/rke2-uninstall.sh</pre><p>And for an agent node:</p><pre>/usr/bin/rke2-agent-uninstall.sh</pre><p>After running the appropriate script, the node is returned to a pre-RKE2 state. No Kubernetes components, no containerd, no certificates, and no leftover configuration. This is handy in production scenarios where you need to decommission a node or reinstall RKE2 from scratch without rebuilding the entire OS.</p><p>You should also remember to clean up the symbolic links we created during the RKE2 setup if you went the uninstall script route instead of destroying the VM. During the lab, we created several symlinks in /usr/local/bin to expose the RKE2 bundled binaries like kubectl, crictl, containerd, runc, and ctr to the system PATH. The uninstall script may or may not remove these depending on the version, so it’s good practice to verify.</p><pre>rm -f /usr/local/bin/kubectl<br>rm -f /usr/local/bin/crictl<br>rm -f /usr/local/bin/containerd<br>rm -f /usr/local/bin/runc<br>rm -f /usr/local/bin/ctr<br>rm -f /etc/crictl.yaml</pre><p>And the shell profile customizations we added for kubectl completion and the k alias:</p><pre>sed -i &#39;/source &lt;(kubectl completion bash)/d&#39; /etc/profile<br>sed -i &#39;/alias k=kubectl/d&#39; /etc/profile<br>sed -i &#39;/complete -F __start_kubectl k/d&#39; /etc/profile</pre><p>Again, none of this manual cleanup is necessary if you’re simply destroying the Vagrant VMs, which is by far the simplest and most complete approach for a lab environment. But it’s good to know these details for real-world scenarios where the underlying machines are long-lived and managed independently of the Kubernetes installation.</p><p>For the system upgrade controller that we installed during the automated upgrade section, that was deployed inside the RKE2 cluster itself, so it gets destroyed along with the cluster when the VMs go away. But if you were running this on a persistent cluster and wanted to remove just the upgrade controller, you would delete the resources in reverse order: first the Plan custom resources, then the controller deployment and its associated RBAC and namespace.</p><pre>kubectl delete plan -n system-upgrade server-plan agent-plan<br>kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml<br>kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml</pre><p>This removes the system-upgrade namespace, the controller deployment, the ClusterRole and ClusterRoleBinding resources, the ServiceAccount, and the Plan CRD itself. Your cluster continues to run normally at whatever version it was last upgraded to, and no further automatic upgrades will be attempted.</p><p>At this point, both lab environments are fully cleaned up. Your host machine should have no leftover containers, no lingering VMs, and no stale configuration files. You’re ready to start fresh whenever you want to revisit these exercises or move on to new experiments.</p><h3>Wrap Up &amp; Key Takeaways</h3><p>So let&#39;s wrap up everything we covered in this hands-on journey through RKE2 and Cluster API. This was a dense session, so it&#39;s worth taking a moment to reflect on what we actually did, what we learned, and why it matters.</p><p>We started with RKE2, Rancher&#39;s security-focused Kubernetes distribution. The very first thing you probably noticed is how simple the installation process is. A single curl command followed by running the install script is all it takes to get the binary on your machine.</p><pre>curl -sfL https://get.rke2.io --output install.sh<br>chmod +x install.sh<br>INSTALL_RKE2_CHANNEL=v1.33 ./install.sh</pre><p>From there, we wrote a configuration file and started the systemd service. That&#39;s it. No manually bootstrapping etcd, no setting up certificates by hand, no juggling multiple binaries. RKE2 handles all of that internally through its supervisor process. The control plane components like etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all come up as static pods, managed by the kubelet that RKE2 starts internally. You can confirm this after startup by checking the running pods in the kube-system namespace.</p><pre>kubectl get pod -A</pre><p>You will see etcd, kube-apiserver, kube-controller-manager, kube-scheduler, and kube-proxy all running as pods, alongside the Helm-installed components like Canal (the default CNI), CoreDNS, and the metrics server. This is a key architectural point. RKE2 does not install these components using raw manifests or direct binary execution on the host. Instead, it uses a combination of static pod manifests and Helm charts, giving you a consistent and manageable deployment model.</p><p>Adding a worker node was equally straightforward. On the agent node, you install the RKE2 agent binary, point it at the server&#39;s registration endpoint on port 9345, and provide the node token that the server generated during its initial startup.</p><pre>curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=&quot;agent&quot; INSTALL_RKE2_CHANNEL=v1.33 sh -<br><br>mkdir -p /etc/rancher/rke2/<br>cat &lt;&lt; EOF &gt; /etc/rancher/rke2/config.yaml<br>server: https://192.168.10.11:9345<br>token: K10cfbf1f601080e27248e795b54de68ea18961910d639be08257095a7109e0dbf0::server:5add6b365458d11cc8a0164c005fc749<br>EOF<br><br>systemctl enable --now rke2-agent.service</pre><p>Once the agent joined, we verified it from the control plane and noticed something interesting. The control plane node had no taints applied by default. This means pods can be scheduled on the control plane node as well, which is different from what you might expect if you are coming from a kubeadm-based setup where control plane nodes typically have a NoSchedule taint. We confirmed this directly.</p><pre>kubectl describe node k8s-node1 | grep -i taints<br>Taints:             &lt;none&gt;</pre><p>This is a design choice in RKE2 that you should be aware of. In production, you may want to add taints to your control plane nodes manually if you want to keep workloads off them.</p><p>Moving on to Day-2 operations, we explored certificate management. RKE2 issues client and server certificates that are valid for 365 days. When a certificate is within 120 days of expiration, RKE2 automatically renews it on the next restart. You can inspect all certificates and their expiration dates at any time using the built-in certificate check command.</p><pre>rke2 certificate check --output table</pre><p>This gives you a clear table showing every certificate file, its subject, usage type, expiration date, residual time, and current status. For manual rotation, the process is stop the service, rotate, then start the service again.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4b1c2c96be2d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Kubernetes Cluster in an Air-Gapped Environment with Kubespray-Offline]]></title>
            <link>https://sigridjin.medium.com/building-a-kubernetes-cluster-in-an-air-gapped-environment-with-kubespray-offline-b43249319d73?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/b43249319d73</guid>
            <category><![CDATA[claude]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[docker]]></category>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 14 Feb 2026 23:29:08 GMT</pubDate>
            <atom:updated>2026-02-14T23:29:08.680Z</atom:updated>
            <content:encoded><![CDATA[<h4>The comprehensive step-by-step guide on deploying a Kubernetes cluster in an air-gapped environment using kubespray-offline, covering the entire process from setting up essential infrastructure to the final cluster installation.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5YFoHWgpW-GP87WnOuADmA.jpeg" /></figure><h3>Introduction</h3><p>Most Kubernetes tutorials assume one thing: your nodes can reach the internet. Pull an image from Docker Hub, download a binary from GitHub, install a package from the upstream repo — it all just works. Until it doesn’t.</p><p>In regulated industries — finance, defense, healthcare, government — production clusters often sit behind an air gap. No outbound internet access, period. Every container image, every OS package, every Python dependency has to be staged internally before a single pod can run. If you’ve never dealt with this constraint, it might sound like a minor inconvenience. In practice, it changes almost everything about how you plan and execute a Kubernetes deployment.</p><p>I went through this process recently as part of the CloudNet@ community’s K8S Deploy study group (Week 6), and this post documents the full journey: standing up the internal infrastructure (DNS, NTP, package mirrors, a private container registry), using the kubespray-offline tooling to pre-download all required artifacts, and finally running Kubespray’s Ansible playbook to bring up a working cluster — all without a single packet leaving the private network.</p><p>The lab environment is modest — one admin server and two k8s nodes running Rocky Linux 10 on VirtualBox — but the workflow maps directly to real-world air-gapped deployments. By the end, you’ll have a clear picture of what it takes to go from a blank, isolated network to a functioning Kubernetes cluster with Flannel networking, a private image registry, Helm OCI support, and internal package repositories.</p><h3>Architecture &amp; Lab Setup</h3><h4>Network Topology Overview</h4><p>The lab mimics a typical air-gapped deployment pattern. Three virtual machines sit on a private network (192.168.10.0/24), and only one of them — the admin server — has a path to the outside world.</p><pre>┌─────────────────────────────────────────────────┐<br>│                 Private Network                  │<br>│               192.168.10.0/24                    │<br>│                                                  │<br>│   ┌───────────┐                                  │<br>│   │   admin    │──── enp0s8 ───► Internet        │<br>│   │  .10.10   │     (NAT GW)                    │<br>│   │           │                                  │<br>│   │  - DNS Server (BIND, :53)                   │<br>│   │  - NTP Server (chrony)                      │<br>│   │  - Container Registry (:35000)              │<br>│   │  - Nginx File Server (:80)                  │<br>│   │  - YUM/DNF Repo Mirror                      │<br>│   │  - PyPI Mirror                              │<br>│   │  - NAT Gateway                              │<br>│   └─────┬─────┘                                  │<br>│         │ enp0s9                                 │<br>│         │                                        │<br>│   ┌─────┴─────┐     ┌───────────┐               │<br>│   │ k8s-node1 │     │ k8s-node2 │               │<br>│   │  .10.11   │     │  .10.12   │               │<br>│   │  (master) │     │  (worker) │               │<br>│   └───────────┘     └───────────┘               │<br>│     enp0s8 DOWN       enp0s8 DOWN               │<br>│     (no internet)     (no internet)              │<br>└─────────────────────────────────────────────────┘</pre><p>The admin server wears many hats. In a production air-gapped environment, these responsibilities would typically be split across dedicated infrastructure — a DNS team managing BIND or CoreDNS, a networking team handling the NAT gateway, a platform team running Harbor for container images.</p><p>For this lab, one server does it all:</p><ul><li><strong>DNS Server (BIND)</strong> — resolves internal hostnames and forwards external domain queries</li><li><strong>NTP Server (chrony)</strong> — keeps cluster time synchronized across all nodes</li><li><strong>NAT Gateway</strong> — routes internal traffic to the internet through enp0s8 during the preparation phase, then gets disabled to simulate a true air gap</li><li><strong>Container Registry (Docker Registry)</strong> — serves all Kubernetes container images on port 35000</li><li><strong>Nginx File Server</strong> — hosts Kubernetes binaries, RPM packages, and Python packages over HTTP on port 80</li><li><strong>YUM/DNF Repository Mirror</strong> — a full mirror of Rocky Linux BaseOS, AppStream, and Extras repos</li><li><strong>PyPI Mirror</strong> — offline Python package index for Kubespray’s Ansible dependencies</li></ul><p>The two k8s nodes have their external-facing enp0s8 interfaces disabled entirely. Their only network path is enp0s9, pointing at 192.168.10.10 as the default gateway. Even DNS queries go through the admin server. These nodes are, for all practical purposes, fully isolated from the internet.</p><h4>Vagrant Environment</h4><p>The whole setup runs on VirtualBox, managed by Vagrant. Each VM gets Rocky Linux 10, 4 vCPUs, and 2 GB of RAM. The admin server has one important difference: its primary disk is expanded to 120 GB, because it needs to hold all the mirrored packages, container images, and binaries — easily exceeding 20 GB in total.</p><pre>BOX_IMAGE = &quot;bento/rockylinux-10.0&quot;<br>BOX_VERSION = &quot;202510.26.0&quot;<br>N = 2</pre><pre>Vagrant.configure(&quot;2&quot;) do |config|</pre><pre>  # k8s nodes<br>  (1..N).each do |i|<br>    config.vm.define &quot;k8s-node#{i}&quot; do |subconfig|<br>      subconfig.vm.box = BOX_IMAGE<br>      subconfig.vm.box_version = BOX_VERSION<br>      subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>        vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/Kubespary-offline-Lab&quot;]<br>        vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>        vb.name = &quot;k8s-node#{i}&quot;<br>        vb.cpus = 4<br>        vb.memory = 2048<br>        vb.linked_clone = true<br>      end<br>      subconfig.vm.host_name = &quot;k8s-node#{i}&quot;<br>      subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.1#{i}&quot;<br>      subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: &quot;6000#{i}&quot;,<br>                           auto_correct: true, id: &quot;ssh&quot;<br>      subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>      subconfig.vm.provision &quot;shell&quot;, path: &quot;init_cfg.sh&quot;, args: [N]<br>    end<br>  end</pre><pre>  # Admin server<br>  config.vm.define &quot;admin&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/Kubespary-offline-Lab&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;admin&quot;<br>      vb.cpus = 4<br>      vb.memory = 2048<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.host_name = &quot;admin&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.10&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: &quot;60000&quot;,<br>                         auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.disk :disk, size: &quot;120GB&quot;, primary: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;admin.sh&quot;, args: [N]<br>  end<br>end</pre><p>The linked_clone = true setting keeps disk usage manageable by sharing a base image across VMs rather than fully duplicating it. Each k8s node provisions itself through init_cfg.sh, while the admin server runs a separate admin.sh script with additional setup.</p><h3>Provisioning Scripts</h3><p><strong>admin.sh</strong> handles the admin server’s initial configuration — timezone, firewall/SELinux, IP forwarding, SSH key distribution to all nodes, Helm installation, and disk expansion:</p><pre>echo &quot;&gt;&gt;&gt;&gt; Initial Config Start &lt;&lt;&lt;&lt;&quot;</pre><pre>echo &quot;[TASK 1] Change Timezone and Enable NTP&quot;<br>timedatectl set-local-rtc 0<br>timedatectl set-timezone Asia/Seoul</pre><pre>echo &quot;[TASK 2] Disable firewalld and selinux&quot;<br>systemctl disable --now firewalld &gt;/dev/null 2&gt;&amp;1<br>setenforce 0<br>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config</pre><pre>echo &quot;[TASK 3] Setting Local DNS Using Hosts file&quot;<br>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts<br>echo &quot;192.168.10.10 admin&quot; &gt;&gt; /etc/hosts<br>for (( i=1; i&lt;=$1; i++ )); do<br>  echo &quot;192.168.10.1$i k8s-node$i&quot; &gt;&gt; /etc/hosts<br>done</pre><pre>echo &quot;[TASK 4] Delete default routing - enp0s9 NIC&quot;<br>nmcli connection modify enp0s9 ipv4.never-default yes<br>nmcli connection up enp0s9 &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 5] Config net.ipv4.ip_forward&quot;<br>cat &lt;&lt; EOF &gt; /etc/sysctl.d/99-ipforward.conf<br>net.ipv4.ip_forward = 1<br>EOF<br>sysctl --system &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 6] Install packages&quot;<br>dnf install -y python3-pip git sshpass cloud-utils-growpart &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 7] Install Helm&quot;<br>curl -fsSL <a href="https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3">https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3</a> \<br>  | DESIRED_VERSION=v3.20.0 bash &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 8] Increase Disk Size&quot;<br>growpart /dev/sda 3 &gt;/dev/null 2&gt;&amp;1<br>xfs_growfs /dev/sda3 &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 9] Setting SSHD&quot;<br>echo &quot;root:qwe123&quot; | chpasswd<br>cat &lt;&lt; EOF &gt;&gt; /etc/ssh/sshd_config<br>PermitRootLogin yes<br>PasswordAuthentication yes<br>EOF<br>systemctl restart sshd &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 10] Setting SSH Key&quot;<br>ssh-keygen -t rsa -N &quot;&quot; -f /root/.ssh/id_rsa &gt;/dev/null 2&gt;&amp;1<br>sshpass -p &#39;qwe123&#39; ssh-copy-id -o StrictHostKeyChecking=no \<br>  root@192.168.10.10 &gt;/dev/null 2&gt;&amp;1<br>for (( i=1; i&lt;=$1; i++ )); do<br>  sshpass -p &#39;qwe123&#39; ssh-copy-id -o StrictHostKeyChecking=no \<br>    root@192.168.10.1$i &gt;/dev/null 2&gt;&amp;1<br>done<br>for (( i=1; i&lt;=$1; i++ )); do<br>  ssh -o StrictHostKeyChecking=no root@k8s-node$i hostname &gt;/dev/null 2&gt;&amp;1<br>done</pre><pre>echo &quot;[TASK 11] Install K9s&quot;<br>CLI_ARCH=amd64<br>if [ &quot;$(uname -m)&quot; = &quot;aarch64&quot; ]; then CLI_ARCH=arm64; fi<br>wget -P /tmp <a href="https://github.com/derailed/k9s/releases/latest/download/k9s_linux_${CLI_ARCH}.tar.gz">https://github.com/derailed/k9s/releases/latest/download/k9s_linux_${CLI_ARCH}.tar.gz</a> &gt;/dev/null 2&gt;&amp;1<br>tar -xzf /tmp/k9s_linux_${CLI_ARCH}.tar.gz -C /tmp<br>chown root:root /tmp/k9s<br>mv /tmp/k9s /usr/local/bin/<br>chmod +x /usr/local/bin/k9s</pre><pre>echo &quot;[TASK 12] ETC&quot;<br>echo &quot;sudo su -&quot; &gt;&gt; /home/vagrant/.bashrc</pre><pre>echo &quot;&gt;&gt;&gt;&gt; Initial Config End &lt;&lt;&lt;&lt;&quot;</pre><p>Task 5 is particularly important — net.ipv4.ip_forward = 1 enables the admin server to function as a NAT gateway later, forwarding packets from the isolated k8s nodes to the internet. Task 8 uses growpart and xfs_growfs to expand the disk from the default size to the 120 GB specified in the Vagrantfile. Task 10 distributes SSH keys to all nodes, which is essential for Ansible (and therefore Kubespray) to work without password prompts.</p><p><strong>init_cfg.sh</strong> runs on each k8s node and handles Kubernetes-specific prerequisites:</p><pre>echo &quot;&gt;&gt;&gt;&gt; Initial Config Start &lt;&lt;&lt;&lt;&quot;</pre><pre>echo &quot;[TASK 1] Change Timezone and Enable NTP&quot;<br>timedatectl set-local-rtc 0<br>timedatectl set-timezone Asia/Seoul</pre><pre>echo &quot;[TASK 2] Disable firewalld and selinux&quot;<br>systemctl disable --now firewalld &gt;/dev/null 2&gt;&amp;1<br>setenforce 0<br>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config</pre><pre>echo &quot;[TASK 3] Disable and turn off SWAP &amp; Delete swap partitions&quot;<br>swapoff -a<br>sed -i &#39;/swap/d&#39; /etc/fstab<br>sfdisk --delete /dev/sda 2 &gt;/dev/null 2&gt;&amp;1<br>partprobe /dev/sda &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 4] Config kernel &amp; module&quot;<br>cat &lt;&lt; EOF &gt; /etc/modules-load.d/k8s.conf<br>overlay<br>br_netfilter<br>vxlan<br>EOF<br>modprobe overlay &gt;/dev/null 2&gt;&amp;1<br>modprobe br_netfilter &gt;/dev/null 2&gt;&amp;1</pre><pre>cat &lt;&lt; EOF &gt; /etc/sysctl.d/k8s.conf<br>net.bridge.bridge-nf-call-iptables  = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward                 = 1<br>EOF<br>sysctl --system &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 5] Setting Local DNS Using Hosts file&quot;<br>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts<br>echo &quot;192.168.10.10 admin&quot; &gt;&gt; /etc/hosts<br>for (( i=1; i&lt;=$1; i++ )); do<br>  echo &quot;192.168.10.1$i k8s-node$i&quot; &gt;&gt; /etc/hosts<br>done</pre><pre>echo &quot;[TASK 6] Delete default routing - enp0s9 NIC&quot;<br>nmcli connection modify enp0s9 ipv4.never-default yes<br>nmcli connection up enp0s9 &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 7] Setting SSHD&quot;<br>echo &quot;root:qwe123&quot; | chpasswd<br>cat &lt;&lt; EOF &gt;&gt; /etc/ssh/sshd_config<br>PermitRootLogin yes<br>PasswordAuthentication yes<br>EOF<br>systemctl restart sshd &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 8] Install packages&quot;<br>dnf install -y python3-pip git &gt;/dev/null 2&gt;&amp;1</pre><pre>echo &quot;[TASK 9] ETC&quot;<br>echo &quot;sudo su -&quot; &gt;&gt; /home/vagrant/.bashrc</pre><pre>echo &quot;&gt;&gt;&gt;&gt; Initial Config End &lt;&lt;&lt;&lt;&quot;</pre><p>Task 3 is a hard Kubernetes requirement — kubelet refuses to start if swap is enabled, so the script disables it, removes the fstab entry, and deletes the swap partition entirely. Task 4 loads three kernel modules that Kubernetes networking depends on: overlay for the container filesystem, br_netfilter so bridge traffic passes through iptables (required by kube-proxy and most CNI plugins), and vxlan for overlay network tunneling. The sysctl parameters in the same task ensure that bridged IPv4/IPv6 traffic gets processed by iptables and that IP forwarding is active.</p><h4>Bringing It Up</h4><pre>mkdir k8s-offline &amp;&amp; cd k8s-offline<br><br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/Vagrantfile<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/admin.sh<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/k8s-kubespary-offline/init_cfg.sh<br><br>vagrant up<br>vagrant status</pre><p>Once all three VMs are running, SSH into each one:</p><pre>ssh root@192.168.10.10   # admin  (password: qwe123)<br>ssh root@192.168.10.11   # k8s-node1<br>ssh root@192.168.10.12   # k8s-node2</pre><p>At this point the VMs are up, the network is in place, and the k8s nodes still have internet access through their enp0s8 interfaces. The admin server has already distributed SSH keys to every node during provisioning, so Ansible can reach them without password prompts. The next step is to cut off that external access and start building the internal services that will replace it.</p><h3>Core Infrastructure Services</h3><p>With the VMs running, the next step is cutting off the k8s nodes from the internet and building the internal services they’ll depend on instead. Three pieces need to be in place before anything Kubernetes-related can happen: a network gateway with NAT, time synchronization via NTP, and DNS resolution.</p><h4>Network Gateway &amp; NAT</h4><p>The goal here is simple: make it so the k8s nodes have zero direct internet access, and their only path to the outside world (when needed) runs through the admin server.</p><p>This involves two things — disabling the external interface on each node and pointing their default route at the admin server.</p><p><strong>On each k8s node</strong>, bring down enp0s8 (the NAT-attached interface that Vagrant uses for internet access) and set enp0s9 (the private network interface) as the sole network path:</p><pre># Disable external interface immediately<br>nmcli connection down enp0s8<br>nmcli connection modify enp0s8 connection.autoconnect no<br><br># Add default route through admin server (priority 200)<br>nmcli connection modify enp0s9 +ipv4.routes &quot;0.0.0.0/0 192.168.10.10 200&quot;<br>nmcli connection up enp0s9</pre><p>You can verify the routing table looks correct:</p><pre>ip route</pre><p>On k8s-node1:</p><pre>default via 192.168.10.10 dev enp0s9 proto static metric 200<br>192.168.10.0/24 dev enp0s9 proto kernel scope link src 192.168.10.11 metric 100</pre><p>On k8s-node2:</p><pre>default via 192.168.10.10 dev enp0s9 proto static metric 200<br>192.168.10.0/24 dev enp0s9 proto kernel scope link src 192.168.10.12 metric 100</pre><p>The autoconnect no setting persists across reboots. You can confirm it by checking the connection file:</p><pre>cat /etc/NetworkManager/system-connections/enp0s8.nmconnection</pre><pre>[connection]<br>id=enp0s8<br>uuid=7f94e839-e070-4bfe-9330-07090381d89f<br>type=ethernet<br>autoconnect=false<br>...</pre><p>The enp0s9 connection file now carries the new route:</p><pre>[ipv4]<br>address1=192.168.10.11/24<br>method=manual<br>never-default=true<br>route1=0.0.0.0/0,192.168.10.10</pre><p>At this point, the nodes can’t reach anything outside the private network. A quick test confirms it:</p><pre>ping -c 1 -w 1 -W 1 8.8.8.8<br># 1 packets transmitted, 0 received, 100% packet loss<br><br>curl www.google.com<br># curl: (6) Could not resolve host: www.google.com</pre><p>Even manually setting DNS servers doesn’t help — the packets simply have nowhere to go without a NAT gateway:</p><pre>cat &lt;&lt; EOF &gt; /etc/resolv.conf<br>nameserver 168.126.63.1<br>nameserver 8.8.8.8<br>EOF<br><br>curl www.google.com<br># still fails — no route to the internet</pre><p><strong>On the admin server</strong>, enable NAT so internal traffic can reach the internet through enp0s8. IP forwarding was already enabled by admin.sh during provisioning, but you can verify and set it explicitly:</p><pre>sysctl -w net.ipv4.ip_forward=1<br><br>cat &lt;&lt;EOF &gt; /etc/sysctl.d/99-ipforward.conf<br>net.ipv4.ip_forward = 1<br>EOF<br>sysctl --system</pre><p>Now add the MASQUERADE rule:</p><pre>iptables -t nat -A POSTROUTING -o enp0s8 -j MASQUERADE</pre><p>Verify it:</p><pre>iptables -t nat -L -n -v</pre><pre>Chain POSTROUTING (policy ACCEPT 1 packets, 120 bytes)<br> pkts bytes target     prot opt in     out     source               destination<br>    2   168 MASQUERADE  all  --  *      enp0s8  0.0.0.0/0            0.0.0.0/0</pre><p>Go back to any k8s node and test — external connectivity should now work through the admin server’s NAT.</p><p>The important part comes next. Remove the NAT rule to simulate the actual air-gapped state:</p><pre>iptables -t nat -D POSTROUTING -o enp0s8 -j MASQUERADE</pre><p>Back on the k8s nodes, internet access is gone again. This toggle — NAT on during preparation, NAT off during installation — is how we prove the cluster deployment works entirely offline. For the rest of this guide, the NAT rule stays off unless explicitly noted otherwise.</p><h4>NTP Server &amp; Client (chrony)</h4><p>Kubernetes components are sensitive to clock skew. Certificate validation, etcd leader election, log correlation — all of these break or behave unpredictably when nodes disagree about what time it is. In an air-gapped environment, the nodes can’t reach public NTP pools, so the admin server needs to act as the internal time source.</p><p><strong>On the admin server</strong>, start by checking the current chrony status and configuration:</p><pre>systemctl status chronyd.service --no-pager<br>grep &quot;^[^#]&quot; /etc/chrony.conf</pre><p>The default Rocky Linux configuration looks like this:</p><pre>pool 2.rocky.pool.ntp.org iburst<br>sourcedir /run/chrony-dhcp<br>driftfile /var/lib/chrony/drift<br>makestep 1.0 3<br>rtcsync<br>ntsdumpdir /var/lib/chrony<br>logdir /var/log/chrony</pre><p>A few of these settings are worth understanding. The iburst option sends a burst of packets right after startup to synchronize quickly instead of waiting for the normal polling interval.</p><p>The driftfile records how much the local hardware clock drifts from real time — chrony uses this to compensate even when the network is unavailable. The makestep 1.0 3 directive tells chrony to force an immediate time jump (rather than gradual slewing) if the offset exceeds 1 second, but only during the first 3 updates after startup.</p><p>And rtcsync periodically copies the system clock to the hardware RTC, so the time stays reasonably accurate across reboots.</p><p>You can check which NTP sources chrony is currently using:</p><pre>chronyc sources -v<br>dig +short 2.rocky.pool.ntp.org</pre><p>Now replace the configuration with one suited for our air-gapped setup:</p><pre>cp /etc/chrony.conf /etc/chrony.bak</pre><pre>cat &lt;&lt; EOF &gt; /etc/chrony.conf<br># External NTP servers (used during preparation phase)<br>server pool.ntp.org iburst<br>server kr.pool.ntp.org iburst</pre><pre># Allow internal network clients to sync from this server<br>allow 192.168.10.0/24</pre><pre># If external servers become unreachable, serve local time<br>local stratum 10</pre><pre>logdir /var/log/chrony<br>EOF</pre><pre>systemctl restart chronyd.service</pre><p>The local stratum 10 line is critical for air-gapped operation. Without it, chrony would stop serving time to clients once it loses contact with upstream servers. With this setting, the admin server falls back to its own clock at stratum 10 — not perfectly accurate, but good enough to keep the cluster running.</p><p>Verify the configuration:</p><pre>timedatectl status</pre><pre>Local time: Wed 2026-02-11 22:36:16 KST<br>           Universal time: Wed 2026-02-11 13:36:16 UTC<br>                 RTC time: Wed 2026-02-11 14:09:31<br>                Time zone: Asia/Seoul (KST, +0900)<br>System clock synchronized: yes<br>              NTP service: active<br>          RTC in local TZ: no</pre><pre>chronyc sources -v</pre><p><strong>On each k8s node</strong>, point chrony at the admin server:</p><pre>cp /etc/chrony.conf /etc/chrony.bak<br><br>cat &lt;&lt; EOF &gt; /etc/chrony.conf<br>server 192.168.10.10 iburst<br>logdir /var/log/chrony<br>EOF<br><br>systemctl restart chronyd.service</pre><p>Check that synchronization is working:</p><pre>timedatectl status</pre><pre>Local time: Wed 2026-02-11 22:38:19 KST<br>           Universal time: Wed 2026-02-11 13:38:19 UTC<br>                 RTC time: Wed 2026-02-11 14:11:34<br>                Time zone: Asia/Seoul (KST, +0900)<br>System clock synchronized: no<br>              NTP service: active<br>          RTC in local TZ: no</pre><pre>chronyc sources -v</pre><pre>MS Name/IP address         Stratum Poll Reach LastRx Last sample<br>===============================================================================<br>^* admin                         0   7     0     -     +0ns[   +0ns] +/-    0ns</pre><p>The ^* prefix indicates this source is currently selected and synchronized. Back <strong>on the admin server</strong>, you can confirm both nodes are connecting:</p><pre>chronyc clients</pre><pre>Hostname                      NTP   Drop Int IntL Last     Cmd   Drop Int  Last<br>===============================================================================<br>k8s-node1                       3      0   1   -     1       0      0   -     -<br>k8s-node2                       2      0   1   -     0       0      0   -     -</pre><p>Both nodes are syncing their clocks through the admin server. Even after the NAT rule is removed and the admin server itself can’t reach public NTP pools, the local stratum 10 fallback keeps time flowing to the cluster.</p><h4>DNS Server &amp; Client (BIND)</h4><p>The k8s nodes need to resolve hostnames — both internal names like admin and k8s-node1, and external names like registry.k8s.io (which appear in container image references even if the actual pull happens from the private registry). BIND on the admin server handles both: it resolves internal queries directly and forwards everything else to upstream DNS servers (when the NAT is active) or returns cached results (when offline).</p><p><strong>On the admin server</strong>, install BIND:</p><pre>dnf install -y bind bind-utils</pre><p>Write the full /etc/named.conf:</p><pre>cp /etc/named.conf /etc/named.bak</pre><pre>cat &lt;&lt;EOF &gt; /etc/named.conf<br>options {<br>        listen-on port 53 { any; };<br>        listen-on-v6 port 53 { ::1; };<br>        directory       &quot;/var/named&quot;;<br>        dump-file       &quot;/var/named/data/cache_dump.db&quot;;<br>        statistics-file &quot;/var/named/data/named_stats.txt&quot;;<br>        memstatistics-file &quot;/var/named/data/named_mem_stats.txt&quot;;<br>        secroots-file   &quot;/var/named/data/named.secroots&quot;;<br>        recursing-file  &quot;/var/named/data/named.recursing&quot;;<br>        allow-query     { 127.0.0.1; 192.168.10.0/24; };<br>        allow-recursion { 127.0.0.1; 192.168.10.0/24; };</pre><pre>        forwarders {<br>                168.126.63.1;<br>                8.8.8.8;<br>        };</pre><pre>        recursion yes;</pre><pre>        dnssec-validation auto;</pre><pre>        managed-keys-directory &quot;/var/named/dynamic&quot;;<br>        geoip-directory &quot;/usr/share/GeoIP&quot;;</pre><pre>        pid-file &quot;/run/named/named.pid&quot;;<br>        session-keyfile &quot;/run/named/session.key&quot;;</pre><pre>        include &quot;/etc/crypto-policies/back-ends/bind.config&quot;;<br>};</pre><pre>logging {<br>        channel default_debug {<br>                file &quot;data/named.run&quot;;<br>                severity dynamic;<br>        };<br>};</pre><pre>zone &quot;.&quot; IN {<br>        type hint;<br>        file &quot;named.ca&quot;;<br>};</pre><pre>include &quot;/etc/named.rfc1912.zones&quot;;<br>include &quot;/etc/named.root.key&quot;;<br>EOF</pre><p>The key options here: listen-on port 53 { any; } makes BIND listen on all interfaces, so the k8s nodes can reach it. allow-query and allow-recursion restrict access to localhost and the private network. The forwarders block points to KT&#39;s public DNS (168.126.63.1) and Google&#39;s (8.8.8.8) — these handle any external domain lookups when the NAT gateway is active.</p><p>The dnssec-validation auto setting enables DNSSEC validation; if you run into resolution failures related to DNSSEC, you might need to set this to no as a workaround.</p><p>Validate and start the service:</p><pre>named-checkconf /etc/named.conf   # no output means no errors<br>systemctl enable --now named</pre><p>Set the admin server itself to use its own DNS:</p><pre>echo &quot;nameserver 192.168.10.10&quot; &gt; /etc/resolv.conf</pre><p>Test it:</p><pre>dig +short google.com @192.168.10.10<br># 142.250.183.110</pre><pre>dig +short google.com<br># 142.250.183.110</pre><p><strong>On each k8s node</strong>, there’s an extra step needed. NetworkManager likes to overwrite /etc/resolv.conf on every connection change or reboot. To prevent that:</p><pre>cat &lt;&lt; EOF &gt; /etc/NetworkManager/conf.d/99-dns-none.conf<br>[main]<br>dns=none<br>EOF</pre><pre>systemctl restart NetworkManager</pre><p>Now set the DNS to point at the admin server:</p><pre>echo &quot;nameserver 192.168.10.10&quot; &gt; /etc/resolv.conf</pre><p>Test DNS resolution — and this is the interesting part. Even though the k8s nodes have no internet access (the NAT rule is off), DNS queries still work because BIND on the admin server forwards them through its own enp0s8 interface:</p><pre>dig +short google.com @192.168.10.10<br># 142.250.183.110</pre><pre>dig +short google.com<br># 142.250.183.110</pre><p>The query reaches the admin server’s BIND instance, which forwards it to the upstream DNS servers through its internet-connected interface, and returns the answer to the k8s node. The k8s node itself never touches the internet directly. Once the admin server’s NAT is also removed (for the fully air-gapped installation phase), BIND serves cached results for any domains it has previously resolved. New external domains won’t resolve, but that’s fine — by that point, everything the cluster needs is already available locally.</p><p>With NAT, NTP, and DNS in place, the admin server is ready to take on its next set of responsibilities: hosting the package repositories, container images, and Python packages that Kubespray will need during the offline installation.</p><h3>Offline Repositories</h3><p>Before touching kubespray-offline, it helps to understand the three categories of dependencies that Kubernetes needs — and how to serve each of them internally. The admin server has internet access during the preparation phase, so we use that window to mirror everything into local repositories. Once the mirror is ready and validated, the external route can be removed entirely.</p><p>kubespray-offline will automate most of this in a later step (Section 5), but knowing how each piece works makes troubleshooting far easier when something inevitably goes wrong.</p><h4>Local YUM/DNF Mirror</h4><p>Kubernetes nodes need OS-level packages — container runtimes, SELinux policy modules, socat, conntrack, ipvsadm, and many others. In an air-gapped network, dnf install has nowhere to go unless you provide a local repository.</p><p>On the admin server, dnf reposync pulls the entire upstream repository tree down to a local directory:</p><pre>dnf reposync --repoid=baseos --download-metadata -p /data/repos/rocky/10<br>dnf reposync --repoid=appstream --download-metadata -p /data/repos/rocky/10<br>dnf reposync --repoid=extras --download-metadata -p /data/repos/rocky/10</pre><p>This took about 12 minutes in the lab. The baseos repo came in at 6.2 GB, appstream at 14 GB, and extras at a modest 67 MB — roughly 20 GB total. Once synced, nginx serves the directory over HTTP at <a href="http://192.168.10.10/rpms/rocky/10/.">http://192.168.10.10/rpms/rocky/10/.</a></p><p>On each k8s node, the default Rocky Linux repo files get replaced with a single internal-rocky.repo pointing all three channels to the admin server&#39;s URL. After that, dnf install and dnf update work exactly as they would on a connected system — the nodes just don&#39;t know (or care) that the packages are coming from a machine one hop away.</p><p>One thing to watch out for: reposync copies the RPMs, but you also need the repository metadata. The --download-metadata flag handles this, but if you ever re-sync and the metadata gets stale, run createrepo --update on the directory to regenerate it. Stale metadata leads to dependency resolution failures that look completely unrelated to the actual problem.</p><h4>Private Container Registry</h4><p>Every container image that the cluster needs — kube-apiserver, kube-proxy, coredns, flannel, pause, etcd, metrics-server, and more — must be available from an internal registry. In this lab, we run Docker Registry v3.0.0 on port 35000:</p><pre>nerdctl run -d --name registry --network host \<br>  -e REGISTRY_HTTP_ADDR=0.0.0.0:35000 \<br>  -v /var/lib/registry:/var/lib/registry \<br>  registry:3.0.0</pre><p>The --network host flag avoids NAT overhead and keeps things simple for a single-node registry. The data directory at /var/lib/registry is mounted as a volume, so registry contents survive container restarts.</p><p>The workflow for populating it is straightforward: pull an image on a machine with internet access, tag it with the 192.168.10.10:35000 prefix, then push. For the 55 container images that Kubespray requires, this process gets automated by load-push-all-images.sh in the kubespray-offline toolchain — but the principle is the same whether you push one image or five hundred.</p><p>Since this is an HTTP registry (not HTTPS), every node’s containerd configuration must include it as an insecure registry. Miss this step and you’ll get TLS handshake errors that can be confusing if you’re not expecting them.</p><h4>Private PyPI Mirror</h4><p>This one catches people off guard. Kubespray is an Ansible project, and Ansible has Python dependencies — jinja2, PyYAML, netaddr, cryptography, and a handful of others. If the nodes can&#39;t reach pypi.org, pip install fails silently or throws cryptic SSL errors.</p><p>Two approaches work here. The first is devpi-server, which runs a full PyPI-compatible index at http://192.168.10.10:3141. It caches packages as they&#39;re requested, which is convenient but requires the server itself to have had internet access at least once to prime the cache. The second approach uses pypi-mirror to generate a static file-based index served directly by nginx at http://192.168.10.10/pypi/. This is simpler to reason about — it&#39;s just files on disk — and pairs naturally with the nginx instance that&#39;s already serving RPMs.</p><p>Either way, global pip configuration goes into /etc/pip.conf:</p><pre>[global]<br>index-url = http://192.168.10.10/pypi/<br>trusted-host = 192.168.10.10</pre><p>With this in place, every pip install on the network resolves against the internal mirror. The trusted-host line is necessary because we&#39;re serving over plain HTTP — without it, pip refuses the connection by default.</p><p>Worth noting: kubespray-offline’s pypi-mirror.sh script handles the package download during the preparation phase, and setup-offline.sh writes the pip configuration automatically during deployment. Understanding the manual setup here mostly pays off when you need to add Python packages later, after the initial deployment is done and you&#39;re operating the cluster day-to-day.</p><h3>Kubespray-Offline: Download &amp; Prepare</h3><p>With the core infrastructure in place — NAT gateway, DNS, NTP, and local repos all running on the admin server — it’s time to tackle the big question: how do you get <em>everything</em> Kubernetes needs onto a machine that can’t reach the internet?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*A3Ma57XOlf-3qaAx" /><figcaption><a href="https://devlos.tistory.com/120">https://devlos.tistory.com/120</a></figcaption></figure><p>The answer is kubespray-offline. This project wraps around Kubespray (the popular Ansible-based Kubernetes installer) and adds a complete offline preparation layer. You run a single master script on an internet-connected machine, it pulls down every binary, container image, RPM package, and Python dependency that Kubespray will need, and packages them into a portable outputs/ directory. Copy that directory to your air-gapped admin server, run a few setup scripts, and you&#39;ve got a fully self-contained deployment environment.</p><p>In my case, the entire download took about 17 minutes and produced roughly 3.3 GB of artifacts.</p><pre>git clone https://github.com/kubespray-offline/kubespray-offline.git<br>cd kubespray-offline</pre><h4>How download-all.sh Works</h4><p>The entry point is download-all.sh, and it&#39;s essentially a runner that calls 10 sub-scripts in sequence. Each one handles a specific category of dependencies. Here&#39;s what happens when you execute it:</p><pre>cd /root/kubespray-offline<br>./download-all.sh</pre><p>The first script to run is config.sh. This is where every version number lives — Kubespray itself, containerd, runc, the CNI plugins, nginx, the registry image, and so on. In my setup it looked like this:</p><pre># config.sh (key variables)<br>KUBESPRAY_VERSION=2.30.0<br>CONTAINERD_VERSION=2.2.1<br>RUNC_VERSION=1.3.4<br>CNI_VERSION=1.8.0<br>NGINX_VERSION=1.29.4<br>REGISTRY_VERSION=3.0.0</pre><p>Getting these version numbers right matters more than you might expect. If config.sh says containerd 2.2.1 but you later tell Kubespray to use a different version in your inventory, the playbook will try to download a file that doesn&#39;t exist on your local nginx server — and in an air-gapped network, there&#39;s nowhere to fall back to.</p><p>After config.sh, the scripts run in this order:</p><p>precheck.sh validates that podman or docker is installed and checks SELinux status. Nothing fancy, but it catches the obvious &quot;you forgot to install podman&quot; mistake early.</p><p>prepare-pkgs.sh installs the system-level tools needed for the rest of the process: rsync, gcc, libffi-devel, createrepo, git, podman, and python3. These all come from the standard Rocky Linux repos, so this script needs internet access.</p><p>prepare-py.sh creates a Python virtual environment at ~/.venv/3.12 and installs the Ansible dependencies into it. Kubespray is fundamentally a collection of Ansible playbooks, so having the right Python environment is a prerequisite for everything that follows.</p><p>get-kubespray.sh downloads the kubespray-2.30.0.tar.gz release from GitHub and extracts it into the cache directory.</p><p>pypi-mirror.sh is where things get interesting for the offline story. It downloads every Python package that Kubespray&#39;s Ansible roles will need — pip, setuptools, ansible-core, PyYAML, jinja2, and dozens of others — into a local directory that will later be served as a static PyPI mirror.</p><p>download-kubespray-files.sh does the heaviest lifting. It runs Kubespray&#39;s own contrib/offline/generate_list.sh to produce two critical files: files.list and images.list. The files list contained 27 URLs pointing to binaries like kubectl, kubelet, kubeadm, etcd, containerd, crictl, and the CNI plugins. The images list had 55 container image references — everything from registry.k8s.io/kube-apiserver to docker.io/flannel/flannel to the CoreDNS and metrics-server images. The script then downloads every one of these.</p><pre># Generated lists (under outputs/)<br>outputs/files/files.list    # 27 binary URLs<br>outputs/images/images.list  # 55 container images</pre><p>download-additional-containers.sh pulls two extra images that aren&#39;t part of Kubernetes itself but are needed for the offline infrastructure: nginx:1.29.4 (to serve files over HTTP) and registry:3.0.0 (the private container registry).</p><p>create-repo.sh builds a local RPM repository using createrepo and adds modulemd metadata for RHEL 8/9 compatibility. This repo gets bundled into the outputs directory so that target nodes can install OS packages without touching the internet.</p><p>Finally, copy-target-scripts.sh copies the deployment scripts into outputs/, making the whole directory self-contained and ready to transfer.</p><p>When download-all.sh finishes, the outputs/ directory has everything:</p><pre>outputs/<br>├── files/          # Kubernetes binaries (kubectl, kubelet, etcd, containerd, etc.)<br>├── images/         # Container images as .tar.gz archives<br>├── rpms/           # RPM packages with repo metadata<br>├── pypi/           # Python packages for offline pip<br>├── kubespray-2.30.0/   # Kubespray source with playbooks<br>├── setup-container.sh<br>├── start-nginx.sh<br>├── start-registry.sh<br>├── load-push-all-images.sh<br>├── setup-offline.sh<br>└── setup-py.sh</pre><p>At this point, you copy the entire outputs/ directory to your air-gapped admin server (via USB, SCP through a bastion, or whatever transfer method your security policy allows) and move on to deployment.</p><h4>Deploying the Nginx File Server and Registry</h4><p>On the air-gapped admin server, the deployment scripts in outputs/ are meant to be run in a specific order. Skipping a step or running them out of sequence will break things downstream.</p><p>The first script is setup-container.sh. It installs containerd, runc, nerdctl, and the CNI plugins from the local files — no package manager involved, just direct binary installation from the archives that were downloaded earlier. After this runs, you have a working container runtime on the admin server.</p><pre>cd /root/kubespray-offline/outputs<br>./setup-container.sh</pre><p>Next, start-nginx.sh launches an nginx container on port 80. This nginx instance serves the entire outputs/ directory over HTTP — the binaries in files/, the container image tarballs in images/, the Python packages in pypi/, and the RPM repository in rpms/. Any node on the 192.168.10.0/24 network can now fetch these artifacts from <a href="http://192.168.10.10/.">http://192.168.10.10/.</a></p><pre>./start-nginx.sh<br># nginx:1.29.4 now running on port 80<br># Serves: files/, images/, pypi/, rpms/ over HTTP</pre><p>Then start-registry.sh brings up the Docker Registry v3.0.0 container, listening on port 35000. At this point the registry is empty — it&#39;s just waiting for images to be pushed into it.</p><pre>./start-registry.sh<br># registry:3.0.0 now running on port 35000</pre><p>One thing worth noting: the registry runs with --network host and binds directly to 0.0.0.0:35000, so there&#39;s no port mapping or container networking complexity to debug. If port 35000 is reachable, the registry is reachable.</p><h4>Loading and Pushing Images</h4><p>This is the step that takes the longest in the deployment phase, and it’s also the one most likely to reveal problems if something went wrong during the download phase.</p><p>load-push-all-images.sh iterates through every .tar.gz file in the images/ directory. For each image, it loads the archive into the local containerd image store, re-tags it with the 192.168.10.10:35000 prefix, and pushes it to the private registry.</p><pre>./load-push-all-images.sh</pre><p>Under the hood, each image goes through three operations:</p><pre># What happens for each image (conceptually):<br>nerdctl load -i images/kube-apiserver-v1.34.3.tar.gz<br>nerdctl tag registry.k8s.io/kube-apiserver:v1.34.3 192.168.10.10:35000/kube-apiserver:v1.34.3<br>nerdctl push 192.168.10.10:35000/kube-apiserver:v1.34.3 --insecure-registry</pre><p>With 55 images to process, this took about 2–3 minutes in my lab. On slower hardware or with larger image sets, expect it to take longer.</p><p>Once the script finishes, you can verify everything landed correctly:</p><pre>curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool</pre><p>This should return a JSON list of all 55+ repositories in the registry. If any image is missing here, Kubespray will fail when it tries to pull that image during cluster installation — and the error message won’t always make it obvious that the root cause is a missing image in your local registry.</p><p>After the images are pushed, two more setup scripts finalize the environment:</p><p>setup-offline.sh configures the admin server to use its own local repositories. It writes /etc/yum.repos.d/offline.repo pointing to the local RPM mirror, and creates ~/.config/pip/pip.conf pointing pip at the local PyPI mirror. From this point forward, even dnf install and pip install on the admin server itself go through the local copies.</p><p>setup-py.sh installs Python 3.12 from the offline RPM repository and prepares the Ansible environment. Kubespray&#39;s playbooks need a specific set of Python packages, and this script makes sure they&#39;re all available without any network calls.</p><pre>./setup-offline.sh    # Configure local repos for dnf and pip<br>./setup-py.sh         # Install Python 3.12 + Ansible from offline packages</pre><p>At this point, the admin server is fully self-sufficient. It has a file server, a container registry populated with every required image, local package repositories for both RPMs and Python packages, and a working Ansible installation. The cluster deployment can proceed entirely offline.</p><p>The whole deployment phase — from running setup-container.sh to having a populated registry and configured repos — took about 3 minutes in my lab environment. Most of that time was spent in load-push-all-images.sh. The other scripts finished in seconds.</p><h3>Installing &amp; Verifying the Cluster</h3><p>At this point, the admin server is doing a lot of heavy lifting. It’s serving OS packages over HTTP, hosting a container registry on port 35000, mirroring PyPI, and running an nginx file server with all the Kubernetes binaries staged and ready. The k8s nodes can reach all of these services over the internal 192.168.10.0/24 network, and nothing else.</p><p>Now comes the part where all that preparation pays off.</p><h4>Preparing the Nodes for Offline Installation</h4><p>Before Kubespray touches the k8s nodes, each one needs a few things in place: a working container runtime, access to the offline package repo, and a pip configuration pointing at the internal mirror. The kubespray-offline project ships a set of scripts that handle this sequentially.</p><p>On the admin server, from the outputs directory:</p><pre>cd /root/kubespray-offline/outputs<br><br># Install containerd, runc, nerdctl, and CNI plugins from local files<br>./setup-container.sh<br><br># Start nginx on port 80 — serves binaries, images, PyPI, and RPMs<br>./start-nginx.sh<br><br># Start the private registry on port 35000<br>./start-registry.sh<br><br># Load every .tar.gz image, re-tag with the 192.168.10.10:35000 prefix, and push<br>./load-push-all-images.sh<br><br># Point yum and pip at the internal servers<br>./setup-offline.sh<br><br># Install Python 3.12 from the offline repo<br>./setup-py.sh</pre><p>The load-push-all-images.sh step takes the longest. It walks through every image archive in the outputs/images/ directory, loads each one into the local container runtime, re-tags it with the 192.168.10.10:35000 prefix, and pushes it to the private registry. For 55 images, expect this to take a couple of minutes.</p><p>After setup-offline.sh runs, you can verify that the nodes are pointed at the right repos:</p><pre># Check yum is configured for the internal mirror<br>cat /etc/yum.repos.d/offline.repo<br><br># Check pip is configured for the internal PyPI<br>cat ~/.config/pip/pip.conf</pre><h4>Kubespray Configuration</h4><p>Kubespray lives inside the outputs directory as a tarball that was extracted during the download phase. The inventory and group variables need a few edits before anything can run.</p><pre>cd /root/kubespray-offline/outputs/kubespray-2.30.0</pre><h4>The Inventory File</h4><p>The inventory defines which nodes play which roles. For this two-node setup, k8s-node1 serves as the control plane and etcd host, while k8s-node2 is a pure worker.</p><pre># inventory/mycluster/inventory.ini<br><br>[all]<br>k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11<br>k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12<br><br>[kube_control_plane]<br>k8s-node1<br><br>[etcd]<br>k8s-node1<br><br>[kube_node]<br>k8s-node1<br>k8s-node2<br><br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>Nothing unusual here — a minimal two-node topology. In production you’d want three etcd members and separate control plane nodes, but for validating the offline workflow, this is enough.</p><h4>The offline.yml File</h4><p>This is the file that makes the entire air-gap setup work. It tells Kubespray where to find every binary and container image, replacing all the default upstream URLs with internal equivalents.</p><pre># inventory/mycluster/group_vars/all/offline.yml<br><br>http_server: &quot;http://192.168.10.10&quot;<br>registry_host: &quot;192.168.10.10:35000&quot;</pre><pre># Binaries<br>kubeadm_download_url: &quot;{{ http_server }}/kubernetes/kubeadm-{{ kube_version }}-linux-{{ image_arch }}&quot;<br>kubectl_download_url: &quot;{{ http_server }}/kubernetes/kubectl-{{ kube_version }}-linux-{{ image_arch }}&quot;<br>kubelet_download_url: &quot;{{ http_server }}/kubernetes/kubelet-{{ kube_version }}-linux-{{ image_arch }}&quot;</pre><pre># This one bit me — the default has &quot;linux-amd64&quot; hardcoded.<br># If you&#39;re on ARM64 or want portability, use the template variable.<br>etcd_download_url: &quot;{{ http_server }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz&quot;</pre><pre>containerd_download_url: &quot;{{ http_server }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz&quot;<br>runc_download_url: &quot;{{ http_server }}/runc.{{ image_arch }}&quot;<br>cni_download_url: &quot;{{ http_server }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz&quot;</pre><pre># Container images — all pulled from the private registry<br>kube_image_repo: &quot;{{ registry_host }}&quot;<br>gcr_image_repo: &quot;{{ registry_host }}&quot;<br>docker_image_repo: &quot;{{ registry_host }}&quot;<br>quay_image_repo: &quot;{{ registry_host }}&quot;</pre><pre># Registry mirror configuration for containerd<br>containerd_registries_mirrors:<br>  - prefix: docker.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>  - prefix: quay.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>  - prefix: registry.k8s.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;</pre><p>A word on the etcd_download_url line. The default value shipped by kubespray-offline had linux-amd64 hardcoded instead of linux-{{ image_arch }}. On an x86_64 machine this works fine and you&#39;d never notice.</p><p>But the moment you try to run this on an ARM64 host, the download fails with a 404 because the file simply doesn&#39;t exist at that path. Swapping in {{ image_arch }} makes the template portable across architectures.</p><h4>Cluster Variables</h4><p>Beyond the offline-specific settings, a few cluster-level variables need attention. These go in inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml or can be passed as extra vars:</p><pre># Network plugin — Flannel is lightweight and works well for lab environments<br>kube_network_plugin: flannel<br>flannel_interface: enp0s9</pre><pre># Proxy mode<br>kube_proxy_mode: iptables</pre><pre># Disable nodelocaldns — one less thing to debug in a lab<br>enable_nodelocaldns: false</pre><pre># Enable Helm and metrics-server<br>helm_enabled: true<br>metrics_server_enabled: true</pre><p>The flannel_interface setting matters more than it might look. VirtualBox VMs typically have multiple network interfaces — enp0s3 for NAT, enp0s8 for the host-only network, enp0s9 for the internal network.</p><p>Flannel needs to bind to the right one, and if you don&#39;t specify it, it might pick the NAT interface and nothing will route correctly. I spent an annoying amount of time on this the first time around.</p><h4>Running the Playbook</h4><p>With the configuration in place, the actual deployment happens in two playbook runs.</p><p>The first playbook sets up the offline repository configuration on all target nodes — making sure yum and pip point to the admin server:</p><pre>cd /root/kubespray-offline/outputs/kubespray-2.30.0</pre><pre>ansible-playbook -i inventory/mycluster/inventory.ini \<br>  offline-repo/playbook/offline-repo.yml</pre><p>This is a short run. It pushes the repo files and pip configuration to each node, and verifies that packages can be resolved from the internal mirror.</p><p>The second playbook is the main event — the full Kubernetes cluster deployment:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini \<br>  cluster.yml \<br>  -e kube_version=&quot;1.34.3&quot;</pre><p>Most of the time goes to the containerd installation, etcd bootstrap, and control plane initialization. Flannel comes up quickly once the kubelet is running.</p><p>If something fails mid-run — and in an air-gap environment, something probably will on your first attempt — Ansible’s idempotency means you can fix the issue and re-run the same command. It’ll skip the tasks that already completed successfully.</p><p>A few things that tend to go wrong on the first try:</p><p>The flannel timeout. If flannel can’t find its subnet environment file, it usually means the default route is missing on the internal interface. The fix is to add one before running the playbook:</p><pre># On each k8s node<br>nmcli connection modify enp0s9 +ipv4.routes &quot;0.0.0.0/0 192.168.10.10 200&quot;<br>nmcli connection up enp0s9</pre><p>The metric 200 keeps this from conflicting with any existing default route on another interface.</p><p>Image pull failures. If you see failed to pull image: no route to host, the image is either missing from the private registry or containerd&#39;s mirror configuration hasn&#39;t been applied yet. Quick check:</p><pre># List everything in the registry<br>curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool</pre><pre># Verify containerd mirror config on the node<br>cat /etc/containerd/certs.d/docker.io/hosts.toml</pre><h4>Post-Install Verification</h4><p>Once the playbook finishes without errors, SSH into the control plane node and run through the basics.</p><pre>kubectl get nodes -o wide</pre><p>You should see both nodes in Ready state:</p><pre>NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP      OS-IMAGE<br>k8s-node1   Ready    control-plane   5m    v1.34.3   192.168.10.11    Rocky Linux 10<br>k8s-node2   Ready    &lt;none&gt;          4m    v1.34.3   192.168.10.12    Rocky Linux 10</pre><p>Next, check that all system pods are running:</p><pre>kubectl get pods -A</pre><p>The output should show healthy pods for coredns, flannel, kube-apiserver, kube-controller-manager, kube-scheduler, kube-proxy, etcd, and metrics-server. Every single one of these images should have been pulled from 192.168.10.10:35000 — that&#39;s the whole point.</p><p>You can confirm the image source on any deployment:</p><pre>kubectl get deploy -n kube-system -o wide</pre><p>Every image field should show the 192.168.10.10:35000 prefix. If any image references an external registry like registry.k8s.io or docker.io directly, something in the offline.yml configuration was missed.</p><p>A quick smoke test to make sure workloads actually schedule and run:</p><pre>kubectl run test-nginx --image=192.168.10.10:35000/library/nginx:alpine --port=80<br>kubectl get pod test-nginx -w</pre><p>Wait for it to reach Running status, then clean up:</p><pre>kubectl delete pod test-nginx</pre><p>If Helm was enabled, verify that too:</p><pre>helm version<br>helm list -A</pre><p>At this point, you have a working Kubernetes cluster that was deployed entirely from local resources. No internet traffic left the 192.168.10.0/24 network during the installation. The registry holds all 55 container images, the nginx file server has every binary, and the nodes are configured to resolve everything internally.</p><p>The planning and infrastructure setup that came before this section took the most time. The actual Kubespray run is the easy part; getting the offline supply chain right is where the real work lives.</p><h3>Day-2 Operations</h3><p>The cluster is up. Pods are running, kubectl get nodes shows everything Ready, and you&#39;ve confirmed that all system images were pulled from your internal registry at 192.168.10.10:35000. Now what?</p><p>A Kubernetes cluster that can only run its own system components isn’t very useful. You need to deploy actual workloads — your applications, third-party tools, monitoring stacks — and all of those images still need to come from somewhere inside your network. Same goes for Helm charts if you’re using them. This section covers the three approaches I worked through for getting application images into the cluster, plus how to set up Helm in an OCI-native way that plays nicely with your existing registry infrastructure.</p><h4>Deploying Applications via the Private Registry</h4><p>The most straightforward approach: pull the image you need on a machine that has internet access (or use a previously downloaded tarball), tag it with your internal registry prefix, push it, and reference the full path in your Kubernetes manifests.</p><p>On the admin server, where podman is already installed:</p><pre># Pull from Docker Hub (admin server has internet via NAT)<br>podman pull nginx:alpine</pre><pre># Tag it for the internal registry<br>podman tag nginx:alpine 192.168.10.10:35000/library/nginx:alpine</pre><pre># Push to the private registry<br>podman push 192.168.10.10:35000/library/nginx:alpine --tls-verify=false</pre><p>You can verify the image landed correctly:</p><pre>curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool</pre><pre>{<br>    &quot;repositories&quot;: [<br>        &quot;library/nginx&quot;,<br>        &quot;flannel/flannel&quot;,<br>        &quot;flannel/flannel-cni-plugin&quot;,<br>        &quot;coredns/coredns&quot;,<br>        ...<br>    ]<br>}</pre><p>To check which tags exist for a given image:</p><pre>curl -s http://192.168.10.10:35000/v2/library/nginx/tags/list</pre><pre>{<br>    &quot;name&quot;: &quot;library/nginx&quot;,<br>    &quot;tags&quot;: [&quot;alpine&quot;]<br>}</pre><p>Now, when you write a Deployment manifest, you reference the full internal path:</p><pre>apiVersion: apps/v1<br>kind: Deployment<br>metadata:<br>  name: my-nginx<br>spec:<br>  replicas: 2<br>  selector:<br>    matchLabels:<br>      app: my-nginx<br>  template:<br>    metadata:<br>      labels:<br>        app: my-nginx<br>    spec:<br>      containers:<br>      - name: nginx<br>        image: 192.168.10.10:35000/library/nginx:alpine<br>        ports:<br>        - containerPort: 80</pre><p>This works, and it’s easy to reason about. Every manifest explicitly states where the image comes from. The downside is obvious, though — every image reference in every manifest, every Helm values file, every quick kubectl run command needs the full 192.168.10.10:35000/... prefix. If you ever migrate to a different registry or change the port, you&#39;re doing a find-and-replace across your entire deployment inventory.</p><h4>Containerd Registry Mirror Configuration</h4><p>A better approach for most teams: configure containerd on each node to automatically redirect image pulls to your internal registry. When a pod spec says image: nginx:alpine, containerd checks the internal registry first before (failing to) reach Docker Hub.</p><p>The configuration lives in containerd’s certs.d directory. For each upstream registry you want to mirror, you create a hosts.toml file:</p><pre>/etc/containerd/certs.d/<br>├── docker.io/<br>│   └── hosts.toml<br>├── quay.io/<br>│   └── hosts.toml<br>├── registry.k8s.io/<br>│   └── hosts.toml<br>└── gcr.io/<br>    └── hosts.toml</pre><p>Here’s the hosts.toml for Docker Hub:</p><pre># /etc/containerd/certs.d/docker.io/hosts.toml<br>server = &quot;https://docker.io&quot;</pre><pre>[host.&quot;http://192.168.10.10:35000&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;]<br>  skip_verify = true</pre><p>For quay.io:</p><pre># /etc/containerd/certs.d/quay.io/hosts.toml<br>server = &quot;https://quay.io&quot;</pre><pre>[host.&quot;http://192.168.10.10:35000&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;]<br>  skip_verify = true</pre><p>Same pattern for registry.k8s.io and gcr.io.</p><p>After placing these files, restart containerd on each node:</p><pre>systemctl restart containerd</pre><p>From this point on, a pod spec that says image: nginx:alpine will resolve through the mirror. Containerd sees the pull request for docker.io/library/nginx:alpine, checks hosts.toml, finds the mirror entry pointing to 192.168.10.10:35000, and pulls from there instead. Your manifests stay clean — no registry prefixes, no special paths.</p><p>One thing to be aware of: the image still needs to exist in the internal registry under the correct repository path. When you pushed library/nginx:alpine earlier, that matches the default Docker Hub library path. For non-library images (say, grafana/grafana:latest), you need to push with the matching path: 192.168.10.10:35000/grafana/grafana:latest.</p><p>You can set this up manually on each node, but that’s tedious. The better way is to let Kubespray handle it.</p><h4>Automating Mirror Setup with Kubespray</h4><p>In your inventory’s group_vars/all/offline.yml, add the containerd_registries_mirrors block:</p><pre># inventory/mycluster/group_vars/all/offline.yml (append)</pre><pre>containerd_registries_mirrors:<br>  - prefix: docker.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: true<br>  - prefix: quay.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: true<br>  - prefix: registry.k8s.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: true<br>  - prefix: gcr.io<br>    mirrors:<br>      - host: &quot;http://192.168.10.10:35000&quot;<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: true</pre><p>Apply it across the cluster with a targeted Ansible run:</p><pre>cd /root/kubespray-offline/outputs/kubespray-2.30.0</pre><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  --tags containerd</pre><p>This regenerates the hosts.toml files on every node and restarts containerd automatically. Much cleaner than SSH-ing into each machine.</p><p>To verify the mirror is working, try deploying something with a plain Docker Hub image reference:</p><pre>kubectl run test-nginx --image=nginx:alpine --restart=Never</pre><pre>kubectl describe pod test-nginx | grep &quot;Successfully pulled&quot;</pre><p>You should see the pull succeed even though the nodes have no internet access. The image is transparently served from 192.168.10.10:35000.</p><h4>Helm Chart Management with OCI Registry</h4><p>Helm has supported OCI registries as chart storage since Helm 3.8, and it’s been GA since 3.12. This is a big win for air-gapped environments because you don’t need a separate chart repository server like ChartMuseum — your existing container registry at 192.168.10.10:35000 can store both container images and Helm charts.</p><p>The workflow for getting a chart into the registry:</p><pre># Say you have a chart directory at /root/nginx-chart/<br># Package it into a .tgz archive<br>helm package /root/nginx-chart</pre><pre># Output: Successfully packaged chart and saved it to:<br>#         /root/nginx-chart-1.0.0.tgz</pre><p>Push the packaged chart to the OCI registry:</p><pre>helm push nginx-chart-1.0.0.tgz oci://192.168.10.10:35000/helm-charts</pre><p>If your registry is running plain HTTP (no TLS), Helm will refuse the push by default. You need to tell it the registry is insecure. Create or edit ~/.config/helm/registries.json:</p><pre>{<br>  &quot;insecureRegistries&quot;: [&quot;192.168.10.10:35000&quot;]<br>}</pre><p>Or you can set it per-command (Helm 3.13+):</p><pre>helm push nginx-chart-1.0.0.tgz oci://192.168.10.10:35000/helm-charts \<br>  --insecure-skip-tls-verify</pre><p>Verify the chart is stored:</p><pre>curl -s http://192.168.10.10:35000/v2/helm-charts/nginx-chart/tags/list</pre><pre>{<br>    &quot;name&quot;: &quot;helm-charts/nginx-chart&quot;,<br>    &quot;tags&quot;: [&quot;1.0.0&quot;]<br>}</pre><p>Install the chart directly from the OCI registry:</p><pre>helm install my-nginx oci://192.168.10.10:35000/helm-charts/nginx-chart \<br>  --version 1.0.0</pre><p>You can also pull the chart locally first if you want to inspect it before installing:</p><pre>helm pull oci://192.168.10.10:35000/helm-charts/nginx-chart --version 1.0.0</pre><pre># This downloads nginx-chart-1.0.0.tgz to the current directory<br>tar -tzf nginx-chart-1.0.0.tgz | head -10</pre><pre>helm install my-nginx ./nginx-chart-1.0.0.tgz</pre><p>For teams that use helm template to render manifests before applying them (a common pattern in GitOps workflows), you can template directly from OCI:</p><pre>helm template my-nginx oci://192.168.10.10:35000/helm-charts/nginx-chart \<br>  --version 1.0.0 \<br>  --set replicaCount=3 \<br>  &gt; rendered-nginx.yaml</pre><pre>kubectl apply -f rendered-nginx.yaml</pre><h4>Migrating Public Charts to the Internal Registry</h4><p>In practice, you’ll want to pull popular charts from the internet (on a connected machine), then push them into your internal registry for air-gapped consumption. Here’s the pattern I used:</p><pre># On a machine with internet access:<br>helm repo add bitnami https://charts.bitnami.com/bitnami<br>helm repo update</pre><pre># Pull the chart as a .tgz file<br>helm pull bitnami/postgresql --version 16.4.1</pre><pre># Transfer postgresql-16.4.1.tgz to the admin server (USB, scp, whatever)<br># Then on the admin server:<br>helm push postgresql-16.4.1.tgz oci://192.168.10.10:35000/helm-charts</pre><p>Now anyone in the cluster can install PostgreSQL:</p><pre>helm install my-pg oci://192.168.10.10:35000/helm-charts/postgresql \<br>  --version 16.4.1 \<br>  --set auth.postgresPassword=secretpassword</pre><p>One gotcha: the chart itself might reference container images from Docker Hub or other public registries. If you’ve configured the containerd mirrors from section 7.2, those pulls will redirect to your internal registry automatically — but only if you’ve already pushed those images. For a PostgreSQL Helm chart, that means you need bitnami/postgresql, bitnami/postgres-exporter, and whatever else the chart&#39;s values.yaml references. Check the default values before deploying:</p><pre>helm show values oci://192.168.10.10:35000/helm-charts/postgresql \<br>  --version 16.4.1 | grep -i &quot;image:&quot; -A 3</pre><p>This tells you exactly which images to stage in your registry.</p><h4>Why Not ChartMuseum?</h4><p>You might be wondering about ChartMuseum, the traditional standalone Helm repository server. It still works fine, and the setup is quick:</p><pre>podman run -d --name chartmuseum -p 8080:8080 \<br>  -v /data/chartmuseum/charts:/charts \<br>  -e STORAGE=local -e STORAGE_LOCAL_ROOTDIR=/charts \<br>  ghcr.io/helm/chartmuseum:v0.16.4</pre><pre>helm repo add internal <a href="http://192.168.10.10:8080">http://192.168.10.10:8080</a></pre><p>But the OCI approach has a clear advantage here: one fewer service to maintain. Your container registry is already running, already backed up (you are backing up /var/lib/registry, right?), and already monitored. Adding Helm charts to it is a zero-infrastructure-cost operation. ChartMuseum is one more container to keep alive, one more port to open, one more thing to troubleshoot at 2 AM when a deployment pipeline fails.</p><p>For new setups in 2025, I’d go OCI-native unless you have a specific reason not to.</p><h3>Troubleshooting &amp; Conclusion</h3><h4>Flannel Subnet Timeout</h4><p>This one cost me a solid hour. The cluster.yml playbook ran smoothly for about fifteen minutes, then hung at a task called &quot;Wait for flannel subnet.env.&quot; After the timeout, the play failed and left the cluster in a half-configured state.</p><p>The error looked like this in the Ansible output:</p><pre>TASK [network_plugin/flannel : Wait for flannel subnet.env file presence] ******<br>fatal: [k8s-node1]: FAILED! =&gt; {&quot;changed&quot;: false, &quot;msg&quot;: &quot;Timeout when waiting for file /run/flannel/subnet.env&quot;}<br>fatal: [k8s-node2]: FAILED! =&gt; {&quot;changed&quot;: false, &quot;msg&quot;: &quot;Timeout when waiting for file /run/flannel/subnet.env&quot;}</pre><p>Flannel’s logs on the node told the real story:</p><pre>journalctl -u containerd | grep flannel</pre><pre>E0214 03:22:18.123456  Failed to find any valid interface: no compatible interfaces found</pre><p>The root cause: Flannel needs a default route to determine which network interface to bind to. In this lab, the k8s nodes have three interfaces — enp0s3 (VirtualBox NAT), enp0s8 (host-only), and enp0s9 (the internal 192.168.10.0/24 network where the cluster actually lives). The default route pointed at enp0s3, which is the VirtualBox NAT adapter and has nothing to do with cluster traffic. Flannel picked up that interface, couldn&#39;t reach other nodes through it, and gave up.</p><p>Two things fixed this. First, I told Flannel explicitly which interface to use by setting flannel_interface in the inventory:</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml<br>flannel_interface: enp0s9</pre><p>Second, I added a default route through the internal network on each node so that Flannel’s subnet acquisition would work:</p><pre>nmcli connection modify enp0s9 +ipv4.routes &quot;0.0.0.0/0 192.168.10.10 200&quot;<br>nmcli connection up enp0s9</pre><p>The metric value 200 matters here — it keeps this route as a lower-priority fallback so it doesn&#39;t interfere with the existing default route on enp0s3. After making both changes, I re-ran the playbook and Flannel came up within seconds.</p><p>One more thing: if you’re re-running cluster.yml after a failed attempt, it&#39;s worth doing a reset.yml first to clean up the partial state. Otherwise you can end up with stale CNI configurations that cause even more confusing errors.</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini reset.yml<br>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml</pre><h4>etcd Download Fails with 404</h4><p>This one is subtle and only shows up if you’re not running on x86_64. The symptom:</p><pre>TASK [download : Download_file | Download item] *******************************<br>fatal: [k8s-node1]: FAILED! =&gt; {<br>    &quot;msg&quot;: &quot;HTTP Error 404: Not Found&quot;,<br>    &quot;url&quot;: &quot;http://192.168.10.10/kubernetes/etcd/etcd-v3.5.26-linux-amd64.tar.gz&quot;<br>}</pre><p>The file doesn’t exist on the nginx file server because the actual binary staged during the download phase was etcd-v3.5.26-linux-arm64.tar.gz (or whatever your architecture is). The problem is in offline.yml, where the etcd download URL is hardcoded with amd64:</p><pre># The problematic line in offline.yml:<br>etcd_download_url: &quot;{{ files_repo }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-amd64.tar.gz&quot;</pre><p>The fix is to replace amd64 with Kubespray&#39;s image_arch variable, which resolves to the correct architecture at runtime:</p><pre># Corrected:<br>etcd_download_url: &quot;{{ files_repo }}/kubernetes/etcd/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz&quot;</pre><p>This pattern — hardcoded amd64 where {{ image_arch }} should be — might appear in other URLs too. It&#39;s worth doing a quick grep across your offline.yml before running the playbook:</p><pre>grep -n &quot;amd64&quot; inventory/mycluster/group_vars/all/offline.yml</pre><p>If any lines show up that aren’t comments, replace them with {{ image_arch }}. On a pure x86_64 lab like the one in this post, you won&#39;t hit this issue because amd64 happens to be correct. But if you&#39;re planning to reuse the same offline.yml in a mixed-architecture or ARM-based environment, fix it now.</p><h4>Image Pull Failures After Cluster Setup</h4><p>After the cluster was running, I tried deploying a test workload and hit this:</p><pre>Events:<br>  Type     Reason     Age   From               Message<br>  ----     ------     ----  ----               -------<br>  Warning  Failed     12s   kubelet            Failed to pull image &quot;nginx:alpine&quot;:<br>           failed to resolve reference &quot;docker.io/library/nginx:alpine&quot;:<br>           failed to do request: dial tcp 104.18.xx.xx:443: connect: no route to host</pre><p>The kubelet on the node tried to pull nginx:alpine from Docker Hub and failed, because of course there&#39;s no internet. This means one of two things: either the image isn&#39;t in the private registry, or the containerd mirror isn&#39;t configured on that node.</p><p>Debugging this is a two-step process. First, check whether the image exists in the registry:</p><pre>curl -s http://192.168.10.10:35000/v2/_catalog | python3 -m json.tool</pre><p>If library/nginx doesn&#39;t show up in the repository list, the image was never pushed. Go back to section 7.1, push it, and try again.</p><p>If the image is there, check the mirror configuration on the failing node:</p><pre>cat /etc/containerd/certs.d/docker.io/hosts.toml</pre><p>The file should exist and point to http://192.168.10.10:35000. If it&#39;s missing or has the wrong content, either re-apply the Kubespray containerd tags (section 7.2.1) or create the file manually.</p><p>After fixing the configuration, restart containerd and delete the failed pod so the kubelet retries the pull:</p><pre>systemctl restart containerd<br>kubectl delete pod &lt;pod-name&gt;</pre><p>The new pod should come up with the image pulled from the internal registry. You can confirm by checking the events:</p><pre>kubectl describe pod &lt;pod-name&gt; | grep &quot;Successfully pulled&quot;</pre><pre>Normal  Pulled  3s  kubelet  Successfully pulled image &quot;nginx:alpine&quot; in 1.2s</pre><h4>SELinux and Firewall Considerations</h4><p>Rocky Linux 10 ships with SELinux in enforcing mode and firewalld active by default. During this lab setup, both were disabled early on to reduce variables — setenforce 0 and systemctl stop firewalld. That&#39;s fine for a lab. For anything closer to production, here&#39;s what you need to keep in mind.</p><p>Containerd and kubelet both work with SELinux in enforcing mode, but you need the right policies installed. Kubespray handles most of this automatically when selinux_state is set in the inventory. The container-selinux package (pulled from the local RPM mirror if you&#39;ve synced it) provides the base policies. Where things get tricky is with custom volume mounts — if your pods mount host paths that SELinux doesn&#39;t have a context for, you&#39;ll see Permission denied errors that look like filesystem issues but are actually policy violations. The audit2allow tool is your friend here, or you can label the directories with chcon -Rt svirt_sandbox_file_t /path/to/data before mounting.</p><p>For firewalld, the key ports that need to be open on the control plane node are 6443 (API server), 2379–2380 (etcd), and 10250 (kubelet). Worker nodes need 10250 and whatever NodePort range you’ve configured (default 30000–32767). Flannel’s VXLAN traffic uses UDP port 8472. If you’re running the private registry on the admin server, port 35000 needs to be reachable from all nodes.</p><p>Rather than listing every firewall-cmd invocation, here&#39;s the practical advice: if you&#39;re in a lab, turn firewalld off and focus on getting the cluster working. Once everything is stable, turn it back on and add rules one service at a time, testing after each change. The Kubespray documentation has a full port matrix you can reference.</p><h3>Conclusion</h3><p>Setting up Kubernetes in an air-gapped environment is, frankly, a lot of work. What takes twenty minutes with kubeadm init on an internet-connected machine turns into a multi-day project when every binary, every image, and every package has to be pre-staged.</p><p>That said, kubespray-offline takes the worst part out of the equation. The download-all.sh script and its ten sub-scripts handle the tedious work of figuring out which files, images, and packages are needed, downloading them, and organizing the output directory structure. Without it, you&#39;d be manually reading Kubespray&#39;s source code to build those lists yourself.</p><p>The pieces that still require hands-on attention come down to a few specific areas. Network interface selection is one — Flannel needs to know which interface to bind to, and in multi-NIC VMs that’s never automatic. Architecture strings in URL templates is another — the amd64 vs {{ image_arch }} issue is easy to miss until the playbook fails halfway through. And the post-install work of setting up containerd mirrors and populating the registry with application images is something you&#39;ll keep doing for the lifetime of the cluster.</p><p>If I were doing this again from scratch, I’d change a few things about the order of operations. I’d set up the containerd registry mirrors as part of the initial offline.yml configuration rather than adding them after the fact. I&#39;d also script the &quot;push all application images&quot; step more aggressively — have a text file listing every image your workloads need, and a loop that pulls, tags, and pushes each one in batch. That way, when a new team member needs to deploy something, the image is already waiting in the registry.</p><h4>Kubespray Skills Reference</h4><p><a href="https://github.com/sigridjineth/kubespray-skills">GitHub - sigridjineth/kubespray-skills: Kubespray skills for Kubernetes cluster management</a></p><p>While writing this post, I distilled the entire workflow into a set of reusable Claude Code skills for the kubespray-skills project. The idea is simple: instead of re-reading this 5,000-line article every time you need to recall a specific command or configuration block, you can point Claude Code at the relevant skill file and get accurate, context-aware guidance on the spot.</p><p>Three skills came directly out of this air-gap work:</p><p>kubespray-airgap covers the end-to-end offline deployment pipeline — the download-all.sh workflow, offline.yml configuration, containerd registry mirrors, the image load-and-push sequence, and the most common failure modes. This one existed before in a minimal form, but the rewrite now includes the full kubespray-offline tooling that we walked through in sections 5 and 6.</p><p>kubespray-offline-infra is dedicated to the supporting infrastructure that an air-gapped cluster depends on: BIND for DNS, chrony for NTP, iptables NAT for the network gateway, reposync + nginx for the YUM/DNF mirror, and devpi for the PyPI mirror. Everything from section 3 and 4 of this post lives here. If your DNS forwarding breaks or your nodes can&#39;t sync time, this is the skill to consult.</p><p>kubespray-helm-airgap handles Helm chart management in offline environments — packaging charts into .tgz archives, pushing them to an OCI registry, setting up ChartMuseum as an alternative, and the often-overlooked step of staging the container images that the charts reference. Section 7.3 of this post maps directly to this skill.</p><p>These three join seven other skills in the project — kubespray-lab-setup, kubespray-deployment, kubespray-operations, kubespray-ha-configuration, kubespray-certificates, kubespray-monitoring, and kubespray-troubleshooting — bringing the total to ten. Together they cover the complete Kubernetes cluster lifecycle with Kubespray, from spinning up a Vagrant lab to upgrading a production HA cluster.</p><p>The skills are available in the above linked github repository. Each one follows the same structure: YAML frontmatter with a trigger-based description, then the actual reference content with working code examples, configuration blocks, and troubleshooting tables.</p><h4>Final Thoughts</h4><p>The air-gap constraint forces you to think carefully about supply chain and dependency management in a way that internet-connected clusters let you ignore. Every image has a provenance. Every binary has a version. Nothing appears magically from the internet at deploy time. That discipline is worth carrying forward even if your next cluster does have internet access.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b43249319d73" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kubespray HA & Upgrade: A Hands-On Guide from v1.32 to v1.34]]></title>
            <link>https://sigridjin.medium.com/kubespray-ha-upgrade-a-hands-on-guide-from-v1-32-to-v1-34-43076ef54676?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/43076ef54676</guid>
            <category><![CDATA[kubespray]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[kubernetes]]></category>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 07 Feb 2026 23:39:38 GMT</pubDate>
            <atom:updated>2026-02-07T23:39:38.853Z</atom:updated>
            <content:encoded><![CDATA[<h4>Production Kubernetes with Kubespray: HA Configuration and Rolling Upgrades in Practice</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_f9pacPdXq7h3yZ2ERBv_Q.jpeg" /></figure><h3>Why HA and Upgrade Strategy Matter</h3><p>Everyone loves the “Day 1” excitement of spinning up a new Kubernetes cluster. The terminal logs fly by, the nodes report Ready, and you feel like you have successfully built a modern infrastructure. But the real work begins on “Day 2.” That is when the network flakes out, a disk fills up, or a critical security vulnerability forces an immediate patch.</p><p>In a production environment, stability is the only metric that truly counts. If you run a single Control Plane node, you are living dangerously. I recall an incident early in my career where a simple OS security update required a server reboot. Because we lacked High Availability (HA), that reboot meant the API server vanished. Existing pods kept running, but we lost the ability to deploy fixes, scale up during a traffic spike, or even query the cluster status. We were effectively flying blind until the node came back online.</p><p>To sleep soundly at night, you generally need a minimum of three Control Plane nodes. This number exists for a specific reason: maintaining the <strong>etcd quorum</strong>.</p><p>Etcd is the brain of your cluster, storing the state of everything. If you lose quorum, the cluster goes into read-only mode or stops functioning entirely. With three nodes, you can lose one and still maintain a majority (two), keeping the cluster operational.</p><p>Here is a simplified view of what a robust Kubespray inventory looks like compared to a fragile one. We define multiple control plane nodes to distribute the risk.</p><pre># A fragile, single-point-of-failure setup<br>[kube_control_plane]<br>node1<br><br># A production-ready HA setup<br>[kube_control_plane]<br>node1<br>node2<br>node3<br><br>[etcd]<br>node1<br>node2<br>node3</pre><p>Beyond just surviving hardware failures, HA is the prerequisite for a sane upgrade strategy. Kubernetes evolves rapidly. New versions arrive every few months, bringing performance improvements and security patches. If you treat your cluster as a static monument that should never be touched, it quickly becomes a liability.</p><p>I have seen teams paralyzed by fear, sticking to End-of-Life (EOL) versions because they worry an upgrade will break everything.</p><p>A proper HA setup changes that dynamic. It allows you to perform <strong>Rolling Upgrades</strong>. You can drain one node, upgrade it, reboot it, and bring it back online — all while the other nodes handle the traffic. The users never notice a thing. This turns upgrades from a terrifying event into a routine maintenance task.</p><p>In this guide, we use Kubespray because it handles this complexity for us. It automates the distribution of components and manages the rolling upgrade logic, ensuring that we adhere to these best practices without having to manually wire every connection.</p><h3>What This Guide Covers</h3><p>Running Kubernetes in production is rarely about the initial kubeadm init. It’s about what happens three months later — when certificates expire, a node crashes, or a security patch forces a version upgrade.</p><p>This guide moves beyond the “Hello World” of cluster creation. We are building a High Availability (HA) cluster using <strong>Kubespray</strong>, and more importantly, we are going to break it, fix it, and upgrade it. The goal is to simulate a real-world environment where uptime matters, even when you need to swap out a control plane node.</p><p>We will start by provisioning a local lab environment using <strong>Vagrant</strong> and <strong>VirtualBox</strong>. This isn’t a single-node Minikube setup; we are spinning up a multi-node architecture running <strong>Rocky Linux 10</strong>. The infrastructure looks like this:</p><ul><li><strong>1 Admin/LoadBalancer Node:</strong> Runs HAProxy and executes Ansible.</li><li><strong>3 Control Plane Nodes:</strong> Ensures the API server and etcd have redundancy.</li><li><strong>2 Worker Nodes:</strong> Runs the actual workloads.</li></ul><p>Here is the target inventory structure we will build in inventory.ini.</p><pre>[kube_control_plane]<br>k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11 etcd_member_name=etcd1<br>k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12 etcd_member_name=etcd2<br>k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13 etcd_member_name=etcd3<br><br>[etcd:children]<br>kube_control_plane<br><br>[kube_node]<br>k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14<br>k8s-node5 ansible_host=192.168.10.15 ip=192.168.10.15</pre><p>Once the cluster is up, we shift focus to <strong>Day-2 Operations</strong>. This is where the real learning happens. We will configure an external HAProxy to manage API traffic and test if the cluster survives when a control plane node goes dark.</p><p>You will also see how to handle lifecycle events using Kubespray’s playbooks. We aren’t just running the installation script once; we will use specific playbooks to scale the cluster and remove faulty nodes safely.</p><pre># Example: Scaling the cluster by adding a new worker<br>ansible-playbook -i inventory/mycluster/inventory.ini -v scale.yml --limit=k8s-node5</pre><p>Finally, we tackle the task that makes most operators nervous: <strong>Upgrades</strong>. We will walk through a rolling upgrade from Kubernetes v1.32.9 to v1.32.10, and eventually to v1.34.3, ensuring that workloads stay running while the underlying infrastructure shifts beneath them. By the end, you should feel comfortable managing the full lifecycle of a production-grade Kubernetes cluster.</p><p>Here is the drafted content for <strong>Section 2. Kubernetes HA Fundamentals</strong>, written in English. I’ve focused on a natural, practical tone that avoids robotic “AI-isms” while keeping technical precision.</p><h3>Control Plane Components and Their Roles</h3><p>Setting up a single-node cluster on a laptop is satisfying. You run a script, see “Ready,” and everything just works. But moving that setup to production is a different story. The moment you start managing real traffic, the fragility of a single control plane becomes a liability you can’t afford.</p><p>This section covers why High Availability (HA) isn’t just a “nice-to-have” feature but a baseline requirement for any serious Kubernetes environment.</p><p>Before we talk about redundancy, we need to agree on what we are actually replicating. The Control Plane isn’t a monolith; it’s a collection of specific processes that coordinate the entire cluster.</p><p>When you run a command like kubectl apply -f deployment.yaml, you aren’t talking to the cluster as a whole — you are talking to these specific components:</p><ol><li><strong>kube-apiserver</strong>: The front door. It handles all REST requests, validates them, and updates the state in etcd. It is the only component that talks directly to the database.</li><li><strong>etcd</strong>: The brain. This is a consistent, distributed key-value store where all cluster data lives. If you lose etcd, you lose the cluster.</li><li><strong>kube-scheduler</strong>: The decision maker. It watches for new Pods with no assigned node and selects the best one based on resources and constraints.</li><li><strong>kube-controller-manager</strong>: The reconciler. It runs controller loops (like the Node Controller or ReplicaSet Controller) to ensure the current state matches the desired state.</li></ol><p>You can see these components running as static pods on your control plane node. If you are logged into a control plane node, a quick check looks like this.</p><pre># Checking the static pod manifests<br>$ ls /etc/kubernetes/manifests/<br>etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml<br><br># Verifying they are running<br>$ crictl ps --name kube<br>CONTAINER ID   IMAGE                    NAME                      STATE<br>a1b2c3d4e5f6   .../kube-apiserver       kube-apiserver            Running<br>b2c3d4e5f6g7   .../etcd                 etcd                      Running<br>c3d4e5f6g7h8   .../kube-scheduler       kube-scheduler            Running<br>d4e5f6g7h8i9   .../kube-controller-mgr  kube-controller-manager   Running</pre><h3>Why Single Control Plane Is a Risk</h3><p>In a single control plane setup, all the components listed above run on one machine. This creates a classic Single Point of Failure (SPOF).</p><p>If that one machine goes down, the impact is immediate and severe:</p><ul><li><strong>The API goes silent:</strong> You cannot run kubectl commands. CI/CD pipelines fail. Monitoring tools that query the API stop gathering data.</li><li><strong>Scheduling stops:</strong> If a worker node crashes while the control plane is down, the Pods on that worker are gone. No scheduler means no new Pods are created to replace them.</li><li><strong>State is locked:</strong> You cannot scale up, change configurations, or roll back deployments.</li></ul><p>The applications already running on healthy worker nodes <em>might</em> keep running for a while, but the cluster effectively becomes a “zombie” — functioning but brain-dead. You are one disk failure or one bad OS patch away from a total outage.</p><h3>Recommended HA Architecture</h3><p>To fix this, we don’t just add a backup; we create a cluster of control planes. The standard production architecture involves three control plane nodes.</p><p>Why three? It comes down to <strong>etcd</strong>.</p><p>Etcd uses the Raft consensus algorithm, which requires a majority (quorum) to write data.</p><ul><li><strong>1 node:</strong> Quorum is 1. If it fails, you stop.</li><li><strong>2 nodes:</strong> Quorum is 2 (Majority of 2 is 2). If <em>one</em> fails, you only have 1 left. You lose quorum. This is actually <em>worse</em> than a single node because you have twice the hardware risk for zero gain in availability.</li><li><strong>3 nodes:</strong> Quorum is 2. You can lose 1 node and still have 2 left to form a majority. The cluster keeps writing data.</li></ul><p>Here is what the topology typically looks like in a inventory.ini file for Kubespray.</p><pre>[kube_control_plane]<br>control-node-01 ansible_host=192.168.10.11<br>control-node-02 ansible_host=192.168.10.12<br>control-node-03 ansible_host=192.168.10.13<br><br>[etcd]<br>control-node-01<br>control-node-02<br>control-node-03</pre><p>In this setup, we usually place a Load Balancer (like HAProxy or a cloud LB) in front of the three API servers. The worker nodes and your local kubectl talk to the Load Balancer, which distributes traffic to the healthy control plane nodes.</p><h3>Worker Node HA: Built-in but Dependent on Control Plane</h3><p>People often confuse Worker HA with Control Plane HA. Kubernetes handles Worker HA natively. If a worker node fails, the ReplicaSet controller notices that the Pods are gone and spins up replacements on other available nodes.</p><p>But there is a catch: <strong>This relies entirely on the Control Plane.</strong></p><p>If your Control Plane is down, the cluster has no way of knowing a worker node has failed. The logic that says “Current Replicas &lt; Desired Replicas” lives in the Controller Manager on the master node.</p><p>So, while Kubernetes is designed to heal worker failures automatically, that self-healing capability is only as reliable as your Control Plane. Without a robust CP setup, your Worker HA guarantees are effectively null and void during a master outage.</p><h3>Common Failure Scenarios Without HA</h3><p>I have seen clusters break in many creative ways. Without HA, minor operational tasks turn into major downtime events.</p><ul><li><strong>OS Patching &amp; Reboots:</strong> You need to apply a security patch to the kernel. On a single master, rebooting means taking the entire API offline for 5–10 minutes. With HA, you can roll through the nodes one by one without anyone noticing.</li><li><strong>Disk Filling Up:</strong> Logs or backups fill the root partition. Etcd is extremely sensitive to disk latency and space. If the disk fills up on a single master, etcd panics and goes read-only or crashes.</li><li><strong>Expired Certificates:</strong> If the API server certificates expire on a single node, you are locked out. In an HA setup, if one node has an issue, you can still access the others to troubleshoot and rotate certificates.</li></ul><p>The goal of High Availability is to make day-to-day maintenance boring and predictable.</p><h3>3. Day-2 Operations: The Case for Regular Upgrades</h3><p>Building a Kubernetes cluster is often the easy part. Keeping it secure and stable over the next two years — that is the real challenge.</p><p>In the world of Kubernetes, “Day-2” refers to everything that happens after the initial install. Among these tasks, upgrades are often the most dreaded. However, treating a cluster as a “set it and forget it” artifact is a guaranteed recipe for future instability.</p><h3>Kubernetes as a Continuously Evolving Platform</h3><p>Kubernetes moves fast. The project releases a new minor version roughly every four months. Unlike a traditional Linux distribution where you might sit on a Long Term Support (LTS) release for five years, Kubernetes forces you to keep moving.</p><p>Support windows are short. Generally, only the latest three minor versions are supported (N-2 policy). For example, if you are running v1.28 and the current version is v1.32, you are already falling out of the support window. This means no security patches for critical components like kube-apiserver or kubelet.</p><p>New versions bring more than just patches; they introduce essential performance improvements and security fixes. Stagnation holds your infrastructure back from becoming more efficient and secure.</p><h3>Technical Debt from Deferred Upgrades</h3><p>The longer you wait to upgrade, the harder it gets. I often compare it to skipping dental appointments; what could have been a routine cleaning eventually turns into a root canal.</p><p>If you try to jump across multiple versions at once — say, from v1.30 straight to v1.34 — you run into a minefield of breaking changes. Kubernetes frequently retires old APIs. If you haven’t updated your manifests incrementally, your deployments might simply fail during a major jump.</p><p>Fixing one deprecated API in a manifest is easy. Fixing hundreds across fifty repositories while your production upgrade is stalled? That is the technical debt you want to avoid. Regular, incremental upgrades are the only way to keep this debt manageable.</p><h3>Prerequisites for Zero-Downtime Upgrades</h3><p>You cannot just replace a binary and hope for the best. A production upgrade must be seamless to the end user.</p><p>To achieve this, your cluster needs to meet specific conditions before you even think about running an upgrade playbook:</p><ol><li><strong>High Availability (HA):</strong> As discussed in Section 2, you need multiple Control Plane nodes. Upgrading the API server involves restarting it. If you have only one, your API goes down. If you have three, you can upgrade them one by one (Rolling Update) without losing availability.</li><li><strong>Spare Capacity:</strong> When a worker node is being upgraded, it must be drained of all running workloads. You need enough resource headroom on other nodes to accept these displaced Pods.</li><li><strong>Pod Disruption Budgets (PDB):</strong> You should define how many replicas of an application can be down at once. This prevents the upgrade process from accidentally taking down <em>all</em> instances of a critical service.</li></ol><p>The upgrade process conceptually involves isolating a node (cordon), moving its workload elsewhere (drain), upgrading the software, and bringing it back (uncordon). Kubespray automates this entire dance, but the underlying infrastructure must be ready to support it. We will see this in action in <strong>Section 9</strong>.</p><h3>Declarative Cluster Management with Kubespray and Git</h3><p>The most stable way to manage upgrades is through <strong>Infrastructure as Code (IaC)</strong>. We avoid making ad-hoc changes directly on the servers. Instead, we define the desired state of the cluster in configuration files.</p><p>Kubespray fits perfectly into this workflow. Your cluster configuration — inventory, variables, and versions — lives in text files. Ideally, you manage these files in a Git repository. When you want to upgrade, you modify the kube_version variable in a file like group_vars/k8s_cluster/k8s-cluster.yml and commit the change.</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br><br># 🔴 OLD STATE<br># kube_version: v1.32.9<br><br># 🟢 NEW STATE<br>kube_version: v1.32.10</pre><p>This declarative approach provides a permanent record of <em>when</em> an upgrade happened and <em>who</em> triggered it.</p><p><em>Note: In our lab environment later, we will sometimes use the command line (-e kube_version=…) to override this version for speed and simplicity. However, in a real production environment, updating the inventory file via Git is the recommended best practice.</em></p><p>This draft focuses on resolving the critical issues around self-referential SSH, Python 3.12 package management, and script reliability, while restoring descriptive depth where needed.</p><h3>Lab Environment Setup</h3><p>Before we deploy Kubernetes, we need a solid foundation. In a real data center, this would involve racking servers and cabling switches. In our lab, we will simulate this infrastructure using <strong>VirtualBox</strong> and <strong>Vagrant</strong>.</p><p>This section walks through building a 6-node cluster environment. By the end, you will have a fully functional lab that mimics a production-grade HA setup.</p><ul><li><strong>Prerequisites:</strong> Ensure you have <strong>VirtualBox 7.2.4+</strong> and <strong>Vagrant 2.4.9+</strong> installed.</li></ul><h3>Architecture Overview</h3><p>We are building a standard high-availability topology. To make it realistic, we separate the “management” layer from the cluster itself.</p><pre>                    ┌─────────────┐<br>                    │  admin-lb   │<br>                    │ HAProxy/NFS │<br>                    │ .10.10      │<br>                    └──────┬──────┘<br>                           │ :6443 (API Load Balancer)<br>              ┌────────────┼────────────┐<br>              ▼            ▼            ▼<br>        ┌──────────┐ ┌──────────┐ ┌──────────┐<br>        │k8s-node1 │ │k8s-node2 │ │k8s-node3 │<br>        │  CP+etcd │ │  CP+etcd │ │  CP+etcd │<br>        │ .10.11   │ │ .10.12   │ │ .10.13   │<br>        └──────────┘ └──────────┘ └──────────┘<br>              ┌────────────┴────────────┐<br>              ▼                         ▼<br>        ┌──────────┐             ┌──────────┐<br>        │k8s-node4 │             │k8s-node5 │<br>        │  Worker  │             │ (Standby)│<br>        │ .10.14   │             │ .10.15   │<br>        └──────────┘             └──────────┘</pre><ul><li><strong>Admin/LB Node (1EA):</strong> This node wears multiple hats. It acts as the Ansible controller (to run Kubespray), the external Load Balancer (HAProxy) for the API server, an NFS server for shared storage, and a jump host with management tools (kubectl, k9s, helm).</li><li><strong>Control Plane Nodes (3EA):</strong> These three nodes run the core Kubernetes components (kube-apiserver, etcd, controller-manager, scheduler). Three is the magic number for etcd quorum; it allows the cluster to survive the loss of one node.</li><li><strong>Worker Nodes (2EA):</strong> These nodes run your actual workloads (Pods). k8s-node4 will be active initially, while k8s-node5 will be provisioned as a VM but added to the cluster later in <strong>Section 7</strong>.</li></ul><h3>VM Specifications and Network Layout</h3><p>We will use <strong>Rocky Linux 10</strong> as the base OS. Each VM needs two network interfaces:</p><ol><li><strong>NAT Network (NIC1, typically enp0s3):</strong> For internet access (downloading packages) and Vagrant management.</li><li><strong>Host-Only Network (NIC2, typically enp0s9):</strong> For internal cluster communication. We will use this interface (enp0s9) for the CNI (Flannel) and API traffic.</li></ol><h3>4.3 Vagrantfile Walkthrough</h3><p>The Vagrantfile defines our entire infrastructure as code. It loops through a configuration to create all 6 nodes at once.</p><p>Key points to note:</p><ul><li>We use bento/rockylinux-10 as the base box.</li><li>We set nicpromisc2 to allow-all. This enables promiscuous mode on the host-only adapter, which is critical for CNI overlay traffic (like Flannel VXLAN) to reach its destination properly.</li><li>We use linked_clone = true to save disk space by sharing the base image across VMs, rather than duplicating it for each node.</li></ul><pre># Vagrantfile<br>BOX_IMAGE = &quot;bento/rockylinux-10&quot;<br>N = 5 # Number of k8s nodes (3 CP + 2 Workers)<br><br>Vagrant.configure(&quot;2&quot;) do |config|<br>  <br>  # 1. Define K8s Nodes (k8s-node1 to k8s-node5)<br>  (1..N).each do |i|<br>    config.vm.define &quot;k8s-node#{i}&quot; do |subconfig|<br>      subconfig.vm.box = BOX_IMAGE<br>      subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>        vb.name = &quot;k8s-node#{i}&quot;<br>        vb.cpus = 4<br>        vb.memory = 2048<br>        # Enable promiscuous mode for CNI traffic<br>        vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>        vb.linked_clone = true<br>      end<br>      subconfig.vm.hostname = &quot;k8s-node#{i}&quot;<br>      subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.1#{i}&quot;<br>      subconfig.vm.provision &quot;shell&quot;, path: &quot;init_cfg.sh&quot;<br>    end<br>  end<br><br>  # 2. Define Admin/LB Node<br>  config.vm.define &quot;admin-lb&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.name = &quot;admin-lb&quot;<br>      vb.cpus = 2<br>      vb.memory = 1024<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.hostname = &quot;admin-lb&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.10&quot;<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;admin-lb.sh&quot;<br>  end<br>end</pre><h3>Admin-LB Node Bootstrap Script (admin-lb.sh)</h3><p>This script configures the admin node. It installs HAProxy for load balancing, sets up NFS for storage, prepares SSH keys for Ansible, and installs cluster management tools.</p><pre>#!/bin/bash<br># admin-lb.sh<br><br>echo &quot;[TASK 1] OS Prep (Timezone, Firewall, SELinux, SSH)&quot;<br>timedatectl set-timezone Asia/Seoul<br>systemctl disable --now firewalld<br>setenforce 0<br>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config<br><br># Enable Root Login for consistency (optional for admin-lb, critical for nodes)<br>echo &quot;root:qwe123&quot; | chpasswd<br>sed -i &#39;s/^#PermitRootLogin.*/PermitRootLogin yes/&#39; /etc/ssh/sshd_config<br>sed -i &#39;s/^PasswordAuthentication no/PasswordAuthentication yes/&#39; /etc/ssh/sshd_config<br>systemctl restart sshd<br><br>echo &quot;[TASK 2] Local DNS Setup&quot;<br># Clean up cloud-init entries that might conflict with our static IPs<br>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts<br>cat &lt;&lt;EOF &gt;&gt; /etc/hosts<br>192.168.10.10 k8s-api-srv.admin-lb.com admin-lb<br>192.168.10.11 k8s-node1<br>192.168.10.12 k8s-node2<br>192.168.10.13 k8s-node3<br>192.168.10.14 k8s-node4<br>192.168.10.15 k8s-node5<br>EOF<br><br>echo &quot;[TASK 3] Install HAProxy&quot;<br>dnf install -y haproxy</pre><pre># HAProxy Config: API LB + Stats + Prometheus Metrics<br>cat &lt;&lt;EOF &gt; /etc/haproxy/haproxy.cfg<br>global<br>    log 127.0.0.1 local2<br>    stats socket /var/lib/haproxy/stats</pre><pre>defaults<br>    mode tcp<br>    timeout connect 10s<br>    timeout client 1m<br>    timeout server 1m</pre><pre># Kubernetes API Server LB<br>frontend k8s-api<br>    bind *:6443<br>    default_backend k8s-api-backend</pre><pre>backend k8s-api-backend<br>    option tcp-check<br>    balance roundrobin<br>    server k8s-node1 192.168.10.11:6443 check<br>    server k8s-node2 192.168.10.12:6443 check<br>    server k8s-node3 192.168.10.13:6443 check</pre><pre># HAProxy Stats Dashboard (http://&lt;admin-ip&gt;:9000/haproxy_stats)<br>listen stats<br>    bind *:9000<br>    mode http<br>    stats enable<br>    stats uri /haproxy_stats</pre><pre># Prometheus Metrics Exporter (http://&lt;admin-ip&gt;:8405/metrics)<br>frontend prometheus<br>    bind *:8405<br>    mode http<br>    http-request use-service prometheus-exporter if { path /metrics }<br>    no log<br>EOF<br>systemctl enable --now haproxy</pre><pre>echo &quot;[TASK 4] Install NFS Server&quot;<br>dnf install -y nfs-utils<br>mkdir -p /srv/nfs/share<br>chown nobody:nobody /srv/nfs/share<br>chmod 755 /srv/nfs/share<br># Note: &#39;async&#39; improves performance for lab environments but risks data loss in production<br>echo &#39;/srv/nfs/share *(rw,async,no_root_squash,no_subtree_check)&#39; &gt; /etc/exports<br>systemctl enable --now nfs-server<br>exportfs -rav</pre><pre>echo &quot;[TASK 5] Install Tools (kubectl, k9s, helm)&quot;<br># Install kubectl<br>cat &lt;&lt; EOF &gt; /etc/yum.repos.d/kubernetes.repo<br>[kubernetes]<br>name=Kubernetes<br>baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/<br>enabled=1<br>gpgcheck=1<br>gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key<br>exclude=kubectl<br>EOF<br>dnf install -y -q kubectl --disableexcludes=kubernetes</pre><pre># Install k9s (using curl since wget might be missing)<br>curl -fsSL -o /tmp/k9s_linux_amd64.tar.gz <a href="https://github.com/derailed/k9s/releases/latest/download/k9s_linux_amd64.tar.gz">https://github.com/derailed/k9s/releases/latest/download/k9s_linux_amd64.tar.gz</a><br>tar -xzf /tmp/k9s_linux_amd64.tar.gz -C /usr/local/bin/ k9s<br>chmod +x /usr/local/bin/k9s</pre><pre># Install Helm<br>curl -fsSL <a href="https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3">https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3</a> | DESIRED_VERSION=v3.16.2 bash</pre><pre>echo &quot;[TASK 6] SSH Key Distribution&quot;<br>dnf install -y python3-pip git sshpass<br>ssh-keygen -t rsa -N &quot;&quot; -f /root/.ssh/id_rsa<br># Distribute key to all k8s nodes (admin-lb itself doesn&#39;t need to be in inventory)<br>for i in {1..5}; do<br>  sshpass -p &#39;qwe123&#39; ssh-copy-id -o StrictHostKeyChecking=no root@192.168.10.1$i<br>done</pre><pre>echo &quot;[TASK 7] Clone Kubespray&quot;<br>git clone -b v2.29.1 <a href="https://github.com/kubernetes-sigs/kubespray.git">https://github.com/kubernetes-sigs/kubespray.git</a> /root/kubespray<br># Use --break-system-packages for Rocky 10 (Python 3.12+ PEP 668)<br>pip3 install --break-system-packages -r /root/kubespray/requirements.txt</pre><h3>4.5 Worker/Control Plane Node Init Script (init_cfg.sh)</h3><p>This script prepares the cluster nodes. It disables swap, loads required kernel modules, sets up local DNS, and configures SSH to allow Ansible connections.</p><pre>#!/bin/bash<br># init_cfg.sh</pre><pre>echo &quot;[TASK 1] Timezone &amp; Swap&quot;<br>timedatectl set-timezone Asia/Seoul<br>swapoff -a<br>sed -i &#39;/swap/d&#39; /etc/fstab</pre><pre>echo &quot;[TASK 2] Disable Firewalld &amp; SELinux&quot;<br>systemctl disable --now firewalld<br>setenforce 0<br>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config</pre><pre>echo &quot;[TASK 3] Kernel Modules &amp; Sysctl&quot;<br>cat &lt;&lt;EOF &gt; /etc/modules-load.d/k8s.conf<br>overlay<br>br_netfilter<br>EOF<br>modprobe overlay<br>modprobe br_netfilter</pre><pre>cat &lt;&lt;EOF &gt; /etc/sysctl.d/k8s.conf<br>net.bridge.bridge-nf-call-iptables  = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward                 = 1<br>EOF<br>sysctl --system</pre><pre>echo &quot;[TASK 4] Local DNS&quot;<br># Remove cloud-init generated hostname entries that conflict with our static mappings<br>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts</pre><pre># We also add the admin node alias &#39;k8s-api-srv.admin-lb.com&#39; for use in Section 6 (External LB)<br>cat &lt;&lt;EOF &gt;&gt; /etc/hosts<br>192.168.10.10 k8s-api-srv.admin-lb.com admin-lb<br>192.168.10.11 k8s-node1<br>192.168.10.12 k8s-node2<br>192.168.10.13 k8s-node3<br>192.168.10.14 k8s-node4<br>192.168.10.15 k8s-node5<br>EOF</pre><pre>echo &quot;[TASK 5] SSH Config (Allow Root Login for Ansible)&quot;<br>echo &quot;root:qwe123&quot; | chpasswd<br># Use sed to safely modify existing config without duplication<br>sed -i &#39;s/^#PermitRootLogin.*/PermitRootLogin yes/&#39; /etc/ssh/sshd_config<br>sed -i &#39;s/^PasswordAuthentication no/PasswordAuthentication yes/&#39; /etc/ssh/sshd_config<br>systemctl restart sshd</pre><h3>Deploying the Environment with Vagrant</h3><p>With the three files (Vagrantfile, admin-lb.sh, init_cfg.sh) in the same directory, we are ready to launch.</p><ol><li><strong>Start the VMs</strong></li></ol><ul><li>$ vagrant up</li></ul><ol><li>This process will take a few minutes as it downloads the OS image and provisions all 6 nodes.</li><li><strong>Verify the Status:</strong><br>Run vagrant status to confirm all machines are up.</li></ol><pre>$ vagrant status<br>Current machine states:<br><br>k8s-node1                 running (virtualbox)<br>k8s-node2                 running (virtualbox)<br>k8s-node3                 running (virtualbox)<br>k8s-node4                 running (virtualbox)<br>k8s-node5                 running (virtualbox)<br>admin-lb                  running (virtualbox)</pre><p><strong>Log in to the Admin Node:</strong><br>All subsequent commands for deploying Kubernetes will be run from inside this node.</p><pre>$ vagrant ssh admin-lb<br>[vagrant@admin-lb ~]$ sudo -i<br>[root@admin-lb ~]#</pre><p><strong>Test Connectivity:</strong><br>First, verify network connectivity and DNS resolution.</p><p>Next, verify that HAProxy is running by checking the stats page response (200 OK).</p><pre>ping -c 2 k8s-node1<br>ping -c 2 k8s-node4<br><br>curl -s -I http://192.168.10.10:9000/haproxy_stats | head -1<br># HTTP/1.1 200 OK</pre><p>If everything responds, your infrastructure is ready. We can now proceed to <strong>Section 5</strong> to deploy the cluster using Kubespray.</p><h3>Deploying Kubernetes with Kubespray</h3><p>With the lab environment fully provisioned — the admin-lb node running HAProxy and NFS, three control plane VMs, and two worker VMs all booted and initialized — we are now ready to deploy a production-grade Kubernetes cluster using Kubespray.</p><p>This section walks through every step of the deployment process in detail: understanding the Kubespray project structure, configuring the inventory, tuning cluster variables, executing the installation playbook, and performing thorough post-deployment validation.</p><h3>Kubespray Directory Structure and Key Files</h3><p>Kubespray is an Ansible-based project that wraps kubeadm with extensive automation for deploying and managing Kubernetes clusters. During the admin-lb bootstrap phase (covered in Section 4), we cloned the Kubespray v2.29.1 repository into /root/kubespray. Before modifying any configuration, it is essential to understand the project layout.</p><pre>/root/kubespray/<br>├── ansible.cfg                  # Ansible configuration (forks, pipelining, etc.)<br>├── cluster.yml                  # Main deployment playbook<br>├── scale.yml                    # Add-node playbook<br>├── remove-node.yml              # Remove-node playbook<br>├── upgrade-cluster.yml          # Rolling upgrade playbook<br>├── reset.yml                    # Full cluster teardown playbook<br>├── roles/                       # Ansible roles for each component<br>│   ├── download/                # Container image and binary downloads<br>│   ├── kubernetes/              # kubeadm, kubelet, static pod manifests<br>│   ├── kubernetes-apps/         # CoreDNS, metrics-server, ingress, etc.<br>│   ├── etcd/                    # etcd cluster bootstrap and management<br>│   ├── network_plugin/          # CNI plugins (flannel, calico, cilium, etc.)<br>│   └── container-engine/        # containerd, CRI-O runtime setup<br>├── inventory/<br>│   └── sample/                  # Template inventory (copy this)<br>│       ├── inventory.ini        # Node definitions and group assignments<br>│       └── group_vars/<br>│           ├── all/<br>│           │   └── all.yml      # Global variables (LB config, DNS, etc.)<br>│           ├── k8s_cluster/<br>│           │   ├── k8s-cluster.yml    # Core K8s settings (version, CNI, proxy mode)<br>│           │   └── addons.yml         # Optional add-ons (metrics-server, ingress, etc.)<br>│           └── etcd.yml         # etcd-specific variables<br>└── contrib/<br>    └── inventory_builder/       # Dynamic inventory generator script</pre><h3>Preparing the Inventory Directory</h3><p>Rather than editing the sample inventory directly, we copy it to create our own cluster-specific configuration:</p><pre>cd /root/kubespray<br>cp -rfp inventory/sample inventory/mycluster</pre><p>This gives us a clean, isolated workspace under inventory/mycluster/ where all our customizations live. The original inventory/sample/ remains untouched, which is important when we later upgrade Kubespray versions via git checkout — our custom inventory directory won&#39;t be overwritten.</p><h3>Inventory Configuration (inventory.ini)</h3><p>The inventory file is the heart of a Kubespray deployment. It defines which nodes exist, what roles they play, and how Ansible groups them for targeted playbook execution.</p><h3>The Complete Inventory File</h3><pre># /root/kubespray/inventory/mycluster/inventory.ini</pre><pre>[all]<br>k8s-node1 ansible_host=192.168.10.11 ip=192.168.20.11<br>k8s-node2 ansible_host=192.168.10.12 ip=192.168.20.12<br>k8s-node3 ansible_host=192.168.10.13 ip=192.168.20.13<br>k8s-node4 ansible_host=192.168.10.14 ip=192.168.20.14</pre><pre>[kube_control_plane]<br>k8s-node1<br>k8s-node2<br>k8s-node3</pre><pre>[etcd:children]<br>kube_control_plane</pre><pre>[kube_node]<br>k8s-node4</pre><pre>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><h3>Breakdown of Each Section</h3><p><strong>[all] — Node Definitions</strong></p><p>Every node in the cluster is listed here with two critical variables:</p><ul><li>ansible_host: The IP address Ansible uses to SSH into the node. In our lab, this is the enp0s8 interface on the 192.168.10.0/24 management network.</li><li>ip: The IP address Kubernetes components bind to for intra-cluster communication. This is the enp0s9 interface on the 192.168.20.0/24 cluster network.</li></ul><p>This dual-network design is intentional: management traffic (Ansible SSH, admin access) stays on one network, while Kubernetes API, etcd, and pod-to-pod traffic flows on a separate, dedicated network. In production environments, this separation improves both security and performance.</p><p><strong>[kube_control_plane] — Control Plane Nodes</strong></p><p>Three nodes are assigned to the control plane group: k8s-node1, k8s-node2, and k8s-node3. Each will run the full set of Kubernetes control plane components as static pods:</p><ul><li>kube-apiserver</li><li>kube-controller-manager</li><li>kube-scheduler</li></ul><p>The <strong>order of nodes in this group matters</strong>. The first node (k8s-node1) acts as the initial control plane node during kubeadm init. Subsequent nodes join via kubeadm join. When adding a new control plane node later, it must always be appended to the <strong>end</strong> of this group — never inserted in the middle.</p><p><strong>[etcd:children] — etcd Cluster Membership</strong></p><pre>[etcd:children]<br>kube_control_plane</pre><p>By using children, we declare that etcd membership is inherited from the kube_control_plane group. This means all three control plane nodes also run etcd, forming a 3-member etcd cluster. This is the &quot;stacked etcd&quot; topology recommended for most deployments — etcd runs on the same nodes as the control plane, reducing infrastructure complexity while still maintaining a proper quorum.</p><p>With 3 etcd members, the cluster can tolerate 1 node failure. The quorum formula is (n/2) + 1, so a 3-node etcd cluster requires at least 2 members to be healthy. This is why odd numbers (3, 5, 7) are always used for etcd — even numbers provide no additional fault tolerance over the odd number below them.</p><p><strong>[kube_node] — Worker Nodes</strong></p><p>Currently only k8s-node4 is listed as a worker. We will add k8s-node5 later using the scale.yml playbook (covered in Section 7). Worker nodes run the kubelet and kube-proxy, but no control plane components.</p><p><strong>[k8s_cluster:children] — Aggregate Group</strong></p><p>This is a convenience group that encompasses all Kubernetes nodes (both control plane and workers). Kubespray uses it to apply cluster-wide configurations such as CNI plugin installation, kubelet settings, and container runtime setup.</p><h3>Ansible Variable Precedence in Kubespray</h3><p>Before customizing any settings, it is critical to understand how Ansible resolves variable conflicts. Kubespray defines variables at multiple levels, and the precedence hierarchy determines which value wins when the same variable is set in multiple places.</p><h3>The Precedence Hierarchy (Lowest to Highest)</h3><pre>1. Role defaults         (roles/xxx/defaults/main.yml)          ← Lowest priority<br>2. Role vars             (roles/xxx/vars/main.yml)<br>3. Inventory group_vars  (inventory/mycluster/group_vars/...)<br>4. Inventory host_vars   (inventory/mycluster/host_vars/...)<br>5. Playbook vars         (vars: section in playbook YAML)<br>6. CLI extra-vars        (--extra-vars / -e on command line)    ← Highest priority</pre><h3>What This Means in Practice</h3><p><strong>Role defaults</strong> are the baseline values that Kubespray ships with. For example, the default Kubernetes version, default CNI plugin, default kube-proxy mode, and so on are all defined in role defaults. These are intentionally low-priority — they are meant to be overridden.</p><p><strong>Role vars</strong> are set by the Kubespray developers for values that should generally not be overridden by users. These take precedence over role defaults but can still be overridden by inventory-level variables.</p><p><strong>Inventory </strong><strong>group_vars</strong> are where most of our customization happens. The files under inventory/mycluster/group_vars/ are the primary configuration surface for Kubespray:</p><pre>group_vars/<br>├── all/<br>│   └── all.yml              # Global settings: LB config, DNS, timezone<br>├── k8s_cluster/<br>│   ├── k8s-cluster.yml      # K8s core: version, CNI, proxy mode, feature gates<br>│   └── addons.yml           # Add-ons: metrics-server, ingress, dashboard<br>└── etcd.yml                 # etcd settings: metrics, compaction, quotas</pre><p><strong>host_vars</strong> allow per-node overrides. This is rarely needed but can be useful for heterogeneous hardware or special-purpose nodes.</p><p><strong>CLI </strong><strong>--extra-vars</strong> (-e) have the highest precedence and override everything. This is the mechanism we use to specify the exact Kubernetes version at deployment time:</p><pre>ansible-playbook cluster.yml -e kube_version=&quot;1.32.9&quot;</pre><p>Even if kube_version is set to a different value in group_vars/k8s_cluster/k8s-cluster.yml, the -e flag on the command line always wins. This is particularly useful during upgrades, where we increment the version via the CLI without modifying any files.</p><p>A common mistake is to set a variable in group_vars and wonder why it has no effect. The cause is almost always that the same variable is defined at a higher precedence level (often in role/vars, which is hard to override from group_vars). In such cases, the only reliable way to override it is via -e on the command line.</p><h3>Customizing Cluster Settings (CNI, Proxy Mode, Add-ons)</h3><p>With the inventory structure understood, let’s configure the key cluster parameters. We edit three files under inventory/mycluster/group_vars/.</p><h3>5.4.1 Global Settings — all/all.yml</h3><pre># /root/kubespray/inventory/mycluster/group_vars/all/all.yml</pre><pre>## Load Balancer Configuration<br>## (Default: client-side LB via nginx-proxy on each worker)<br>## We will modify these later in Section 6 when configuring external LB<br># apiserver_loadbalancer_domain_name: &quot;k8s-api-srv.admin-lb.com&quot;<br># loadbalancer_apiserver:<br>#   address: 192.168.10.10<br>#   port: 6443<br># loadbalancer_apiserver_localhost: true   # default: enables nginx-proxy on workers</pre><p>For the initial deployment, we leave the load balancer settings at their defaults. This means Kubespray will configure client-side load balancing using nginx static pods on each worker node (Case 1 architecture, detailed in Section 6).</p><h3>Core Kubernetes Settings — k8s_cluster/k8s-cluster.yml</h3><p>This is the most important configuration file. Here we define the CNI plugin, kube-proxy mode, service/pod CIDRs, and other core cluster parameters:</p><pre># /root/kubespray/inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><pre>## CNI Plugin<br>kube_network_plugin: flannel</pre><pre>## Flannel Interface Selection<br>## CRITICAL: In multi-NIC environments, flannel must be told which interface to use<br>## for VXLAN overlay traffic. Without this, flannel may pick the wrong NIC<br>## (e.g., the NAT interface enp0s3 instead of the cluster network enp0s9).<br>flannel_interface: enp0s9</pre><pre>## Kube-proxy Mode<br>## Options: iptables (default in Kubespray), ipvs<br>## iptables is simpler and sufficient for small-to-medium clusters<br>kube_proxy_mode: iptables</pre><pre>## DNS Configuration<br>## CoreDNS is the default and only supported DNS provider<br>## Kubespray default clusterDNS is 10.233.0.3 (not the kubeadm default of 10.233.0.10)<br>## This is set via dns_domain and other dns_* variables</pre><pre>## NodeLocal DNS Cache<br>## Disabled in this lab for simplicity<br>enable_nodelocaldns: false</pre><pre>## Service and Pod CIDRs (defaults shown — usually no need to change)<br># kube_service_addresses: 10.233.0.0/18<br># kube_pods_subnet: 10.233.64.0/18</pre><p><strong>Why Flannel?</strong> In this lab environment, we use Flannel for its simplicity. Flannel provides basic L3 overlay networking via VXLAN and is easy to understand and debug. For production environments requiring network policies, Calico or Cilium would be the recommended choices.</p><p><strong>The </strong><strong>flannel_interface Setting Explained:</strong> VirtualBox VMs typically have multiple network interfaces. In our lab:</p><ul><li>enp0s3 — NAT interface (VirtualBox default, used for internet access)</li><li>enp0s8 — Host-only adapter #1 (management network, 192.168.10.0/24)</li><li>enp0s9 — Host-only adapter #2 (cluster network, 192.168.20.0/24)</li></ul><p>Without explicitly setting flannel_interface: enp0s9, Flannel might select enp0s3 (the NAT interface) for its VXLAN tunnel endpoints. This would cause pod-to-pod communication to fail because the NAT interface does not provide direct connectivity between VMs. By specifying enp0s9, we ensure Flannel uses the dedicated cluster network.</p><p><strong>Why </strong><strong>iptables Mode?</strong> The iptables kube-proxy mode is the battle-tested default. While IPVS mode offers better performance at scale (O(1) vs. O(n) for service routing rules), iptables mode is simpler to debug and perfectly adequate for clusters with fewer than a thousand services. For this lab, iptables is the appropriate choice.</p><p><strong>DNS Address Note:</strong> Kubespray sets the cluster DNS service IP to 10.233.0.3 by default, which differs from kubeadm&#39;s default of 10.233.0.10. This is a Kubespray-specific convention. The kubelet on every node is configured with --cluster-dns=10.233.0.3, and the CoreDNS service is created with this ClusterIP. If you&#39;re migrating from a kubeadm-managed cluster, be aware of this difference.</p><h3>Add-ons Configuration — k8s_cluster/addons.yml</h3><pre># /root/kubespray/inventory/mycluster/group_vars/k8s_cluster/addons.yml</pre><pre>## Metrics Server<br>## Enables the Kubernetes Metrics Server for resource usage collection<br>## Required for: kubectl top, HPA (Horizontal Pod Autoscaler), VPA<br>metrics_server_enabled: true</pre><pre>## Other add-ons (disabled in this lab, shown for reference)<br># ingress_nginx_enabled: false<br># dashboard_enabled: false<br># helm_enabled: false</pre><p>The Metrics Server is the only add-on we enable at deployment time. It provides the /apis/metrics.k8s.io endpoint that powers kubectl top nodes, kubectl top pods, and the Horizontal Pod Autoscaler. We will install additional monitoring components (Prometheus, Grafana) separately using Helm in Section 8.</p><h3>etcd Settings — etcd.yml</h3><p>For the initial deployment, we keep etcd settings at their defaults. Later (in Section 8.3), we will enable etcd metrics:</p><pre># /root/kubespray/inventory/mycluster/group_vars/etcd.yml</pre><pre>## etcd metrics (will be enabled later for Prometheus monitoring)<br># etcd_metrics: true<br># etcd_listen_metrics_urls: &quot;http://0.0.0.0:2381&quot;</pre><h3>Running cluster.yml and Verifying Deployment</h3><p>With the inventory and variables configured, we are ready to deploy the cluster.</p><h3>Pre-Flight Checks</h3><p>Before running the playbook, verify Ansible connectivity to all nodes:</p><pre>cd /root/kubespray</pre><pre># Test SSH connectivity to all nodes<br>ansible -i inventory/mycluster/inventory.ini all -m ping</pre><p>Expected output:</p><pre>k8s-node1 | SUCCESS =&gt; {<br>    &quot;changed&quot;: false,<br>    &quot;ping&quot;: &quot;pong&quot;<br>}<br>k8s-node2 | SUCCESS =&gt; {<br>    &quot;changed&quot;: false,<br>    &quot;ping&quot;: &quot;pong&quot;<br>}<br>k8s-node3 | SUCCESS =&gt; {<br>    &quot;changed&quot;: false,<br>    &quot;ping&quot;: &quot;pong&quot;<br>}<br>k8s-node4 | SUCCESS =&gt; {<br>    &quot;changed&quot;: false,<br>    &quot;ping&quot;: &quot;pong&quot;<br>}</pre><p>If any node returns UNREACHABLE, check SSH key distribution (the admin-lb.sh bootstrap script should have handled this) and verify network connectivity on the management network (192.168.10.0/24).</p><h3>Executing the Deployment</h3><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  -e kube_version=&quot;1.32.9&quot;</pre><p><strong>Command Breakdown:</strong></p><ul><li>-i inventory/mycluster/inventory.ini — specifies the inventory file</li><li>cluster.yml — the main deployment playbook</li><li>-e kube_version=&quot;1.32.9&quot; — overrides the Kubernetes version via CLI (highest precedence)</li></ul><h3>What Happens During Deployment</h3><p>The cluster.yml playbook orchestrates a complex, multi-stage deployment process. At a high level, it proceeds through the following phases:</p><p><strong>Phase 1: Prerequisite Checks and OS Configuration</strong></p><ul><li>Validates the target OS (Rocky Linux 10, Ubuntu, etc.)</li><li>Ensures required kernel modules are loaded (overlay, br_netfilter)</li><li>Verifies sysctl settings (net.bridge.bridge-nf-call-iptables, net.ipv4.ip_forward)</li><li>Disables swap (required by kubelet)</li><li>Configures container runtime prerequisites</li></ul><p><strong>Phase 2: Container Runtime Installation</strong></p><ul><li>Installs containerd as the CRI (Container Runtime Interface)</li><li>Configures containerd with appropriate settings for Kubernetes</li><li>Sets up the containerd systemd service</li></ul><p><strong>Phase 3: Download Binaries and Container Images</strong></p><ul><li>Downloads kubeadm, kubelet, kubectl binaries to /tmp/releases/</li><li>Pulls container images for all Kubernetes components:</li><li>registry.k8s.io/kube-apiserver:v1.32.9</li><li>registry.k8s.io/kube-controller-manager:v1.32.9</li><li>registry.k8s.io/kube-scheduler:v1.32.9</li><li>registry.k8s.io/kube-proxy:v1.32.9</li><li>registry.k8s.io/pause:3.10</li><li>registry.k8s.io/coredns/coredns:v1.12.0</li><li>registry.k8s.io/etcd:3.5.25</li><li>docker.io/flannel/flannel:v0.26.7</li><li>docker.io/flannel/flannel-cni-plugin:v1.6.2</li></ul><p><strong>Phase 4: etcd Cluster Bootstrap</strong></p><ul><li>Initializes etcd on the first control plane node (k8s-node1)</li><li>Joins k8s-node2 and k8s-node3 to the etcd cluster</li><li>Configures etcd with peer and client TLS certificates</li><li>Verifies the 3-member etcd quorum</li></ul><p><strong>Phase 5: Control Plane Initialization</strong></p><ul><li>Runs kubeadm init on k8s-node1 (the first control plane node)</li><li>Generates static pod manifests for kube-apiserver, kube-controller-manager, kube-scheduler</li><li>Creates the cluster CA certificates and kubeconfig files</li><li>Joins k8s-node2 and k8s-node3 to the control plane via kubeadm join --control-plane</li></ul><p><strong>Phase 6: Worker Node Join</strong></p><ul><li>Installs kubelet on worker nodes</li><li>Uploads control plane CA certificates to workers</li><li>Runs kubeadm join to register workers with the cluster</li><li>Configures nginx-proxy static pods on workers (for client-side load balancing)</li></ul><p><strong>Phase 7: CNI Plugin and Add-ons</strong></p><ul><li>Deploys Flannel DaemonSet across all nodes</li><li>Deploys CoreDNS as a Deployment in kube-system namespace</li><li>Deploys kube-proxy DaemonSet</li><li>Deploys Metrics Server (if metrics_server_enabled: true)</li><li>Applies node labels and taints as configured</li></ul><h3>Deployment Duration</h3><p>The complete deployment takes approximately <strong>8 minutes</strong> in this lab environment. The bulk of the time is spent in Phase 3 (downloading container images) and Phase 5 (waiting for control plane components to become healthy after each join).</p><h3>Verifying Successful Deployment</h3><p>Once the playbook completes without errors, copy the kubeconfig to the admin-lb node and verify:</p><pre># Copy kubeconfig from the first control plane node<br>mkdir -p /root/.kube<br>scp k8s-node1:/root/.kube/config /root/.kube/config</pre><pre># Verify kubectl connectivity<br>kubectl get nodes -o wide</pre><p>Expected output:</p><pre>NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE              KERNEL-VERSION                CONTAINER-RUNTIME<br>k8s-node1   Ready    control-plane   8m    v1.32.9   192.168.20.11   &lt;none&gt;        Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5<br>k8s-node2   Ready    control-plane   7m    v1.32.9   192.168.20.12   &lt;none&gt;        Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5<br>k8s-node3   Ready    control-plane   6m    v1.32.9   192.168.20.13   &lt;none&gt;        Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5<br>k8s-node4   Ready    &lt;none&gt;          5m    v1.32.9   192.168.20.14   &lt;none&gt;        Rocky Linux 10.0 ..   6.12.x-xxx.el10.x86_64        containerd://2.1.5</pre><p>Key things to verify:</p><ul><li>All 4 nodes show <strong>Ready</strong> status</li><li>Control plane nodes have the control-plane role label</li><li>All nodes report the correct Kubernetes version (v1.32.9)</li><li>INTERNAL-IP shows the cluster network addresses (192.168.20.x), not the management or NAT addresses</li><li>Container runtime is containerd://2.1.5</li></ul><h3>Post-Deployment Validation (Nodes, Pods, etcd, Certificates)</h3><p>A successful cluster.yml run does not guarantee a fully healthy cluster. Thorough post-deployment validation is essential.</p><h3>System Pod Health Check</h3><pre>kubectl get pods -n kube-system -o wide</pre><p>Expected output (abbreviated):</p><pre>NAME                                READY   STATUS    RESTARTS   AGE   IP              NODE<br>coredns-xxxxxxxxx-xxxxx             1/1     Running   0          8m    10.233.64.x     k8s-node1<br>coredns-xxxxxxxxx-xxxxx             1/1     Running   0          8m    10.233.64.x     k8s-node2<br>flannel-xxxxx                       1/1     Running   0          7m    192.168.20.11   k8s-node1<br>flannel-xxxxx                       1/1     Running   0          7m    192.168.20.12   k8s-node2<br>flannel-xxxxx                       1/1     Running   0          7m    192.168.20.13   k8s-node3<br>flannel-xxxxx                       1/1     Running   0          7m    192.168.20.14   k8s-node4<br>kube-apiserver-k8s-node1            1/1     Running   0          8m    192.168.20.11   k8s-node1<br>kube-apiserver-k8s-node2            1/1     Running   0          7m    192.168.20.12   k8s-node2<br>kube-apiserver-k8s-node3            1/1     Running   0          7m    192.168.20.13   k8s-node3<br>kube-controller-manager-k8s-node1   1/1     Running   0          8m    192.168.20.11   k8s-node1<br>kube-controller-manager-k8s-node2   1/1     Running   0          7m    192.168.20.12   k8s-node2<br>kube-controller-manager-k8s-node3   1/1     Running   0          7m    192.168.20.13   k8s-node3<br>kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.11   k8s-node1<br>kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.12   k8s-node2<br>kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.13   k8s-node3<br>kube-proxy-xxxxx                    1/1     Running   0          7m    192.168.20.14   k8s-node4<br>kube-scheduler-k8s-node1            1/1     Running   0          8m    192.168.20.11   k8s-node1<br>kube-scheduler-k8s-node2            1/1     Running   0          7m    192.168.20.12   k8s-node2<br>kube-scheduler-k8s-node3            1/1     Running   0          7m    192.168.20.13   k8s-node3<br>metrics-server-xxxxxxxxx-xxxxx      1/1     Running   0          7m    10.233.64.x     k8s-node2<br>nginx-proxy-k8s-node4               1/1     Running   0          6m    192.168.20.14   k8s-node4</pre><p><strong>What to look for:</strong></p><ul><li><strong>Static pods</strong> (apiserver, controller-manager, scheduler): One instance per control plane node. These are not managed by a Deployment or DaemonSet — they are directly managed by the kubelet on each node based on manifest files in /etc/kubernetes/manifests/.</li><li><strong>Flannel DaemonSet</strong>: One pod per node (all 4 nodes). This provides the CNI overlay network.</li><li><strong>kube-proxy DaemonSet</strong>: One pod per node (all 4 nodes). This handles service-to-pod routing.</li><li><strong>CoreDNS Deployment</strong>: Two replicas (default) for DNS high availability.</li><li><strong>Metrics Server</strong>: One pod. Provides the metrics.k8s.io API.</li><li><strong>nginx-proxy</strong>: One static pod on each <strong>worker</strong> node only (k8s-node4). This is the client-side load balancer that distributes API requests from the worker to all three control plane nodes (covered in detail in Section 6).</li></ul><h3>etcd Cluster Health</h3><p>SSH into any control plane node and use the etcdctl.sh wrapper script (installed by Kubespray) to inspect the etcd cluster:</p><pre># List all etcd members<br>ssh k8s-node1 etcdctl.sh member list -w table</pre><p>Expected output:</p><pre>+------------------+---------+-----------+----------------------------+----------------------------+<br>|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        |<br>+------------------+---------+-----------+----------------------------+----------------------------+<br>| 1a2b3c4d5e6f7890 | started | etcd1     | https://192.168.20.11:2380 | https://192.168.20.11:2379 |<br>| 2b3c4d5e6f789012 | started | etcd2     | https://192.168.20.12:2380 | https://192.168.20.12:2379 |<br>| 3c4d5e6f78901234 | started | etcd3     | https://192.168.20.13:2380 | https://192.168.20.13:2379 |<br>+------------------+---------+-----------+----------------------------+----------------------------+</pre><pre># Check endpoint health and latency<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><p>Expected output:</p><pre>+----------------------------+------------------+---------+---------+-----------+-----------+<br>|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM |<br>+----------------------------+------------------+---------+---------+-----------+-----------+<br>| https://192.168.20.11:2379 | 1a2b3c4d5e6f7890 | 3.5.25  |  5.4 MB |     true  |         2 |<br>| https://192.168.20.12:2379 | 2b3c4d5e6f789012 | 3.5.25  |  5.4 MB |    false  |         2 |<br>| https://192.168.20.13:2379 | 3c4d5e6f78901234 | 3.5.25  |  5.4 MB |    false  |         2 |<br>+----------------------------+------------------+---------+---------+-----------+-----------+</pre><p>Key things to verify:</p><ul><li>All 3 members show started status</li><li>etcd version is 3.5.25 (the version bundled with Kubespray v2.29.1)</li><li>Exactly one member is the leader (IS LEADER: true)</li><li>All members share the same RAFT TERM (consistent cluster state)</li><li>Peer and client addresses use the cluster network (192.168.20.x)</li><li>Peer communication uses port 2380 (TLS-encrypted peer-to-peer)</li><li>Client communication uses port 2379 (TLS-encrypted client access)</li></ul><h3>API Server Endpoint Verification</h3><p>Test direct API server connectivity from the admin-lb node:</p><pre># Test each control plane node&#39;s API server individually<br>curl -sk https://192.168.10.11:6443/version<br>curl -sk https://192.168.10.12:6443/version<br>curl -sk https://192.168.10.13:6443/version</pre><p>Expected output (from each):</p><pre>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  &quot;buildDate&quot;: &quot;...&quot;,<br>  &quot;goVersion&quot;: &quot;go1.23.x&quot;,<br>  &quot;compiler&quot;: &quot;gc&quot;,<br>  &quot;platform&quot;: &quot;linux/amd64&quot;<br>}</pre><p>If all three return the correct version, the API servers are healthy and reachable from the management network.</p><h3>Container Images on Nodes</h3><p>Verify that all required container images were downloaded:</p><pre>ssh k8s-node1 crictl images</pre><p>Expected output:</p><pre>IMAGE                                            TAG        IMAGE ID       SIZE<br>docker.io/flannel/flannel                        v0.26.7    xxxxxxxxxxxx   30.0MB<br>docker.io/flannel/flannel-cni-plugin             v1.6.2     xxxxxxxxxxxx   4.28MB<br>registry.k8s.io/coredns/coredns                  v1.12.0    xxxxxxxxxxxx   18.2MB<br>registry.k8s.io/etcd                             3.5.25     xxxxxxxxxxxx   64.8MB<br>registry.k8s.io/kube-apiserver                   v1.32.9    xxxxxxxxxxxx   30.5MB<br>registry.k8s.io/kube-controller-manager          v1.32.9    xxxxxxxxxxxx   28.7MB<br>registry.k8s.io/kube-proxy                       v1.32.9    xxxxxxxxxxxx   28.5MB<br>registry.k8s.io/kube-scheduler                   v1.32.9    xxxxxxxxxxxx   21.3MB<br>registry.k8s.io/metrics-server/metrics-server    v0.7.x     xxxxxxxxxxxx   17.6MB<br>registry.k8s.io/pause                            3.10       xxxxxxxxxxxx   320kB</pre><h3>Downloaded Binaries</h3><p>Kubespray downloads Kubernetes binaries to /tmp/releases/ on each node:</p><pre>ssh k8s-node1 tree /tmp/releases</pre><p>Expected output:</p><pre>/tmp/releases<br>├── kubeadm-v1.32.9-amd64<br>├── kubectl-v1.32.9-amd64<br>└── kubelet-v1.32.9-amd64</pre><p>These binaries are cached on the node. During upgrades, new version binaries will be downloaded alongside the existing ones, and the active symlinks will be updated.</p><h3>Kubernetes Certificate Inspection</h3><p>Kubespray uses kubeadm to generate all cluster certificates. You can inspect them:</p><pre>ssh k8s-node1 kubeadm certs check-expiration</pre><p>This shows the expiration dates for all certificates. By default, kubeadm generates certificates with a 1-year validity period. Kubespray automatically renews certificates during cluster upgrades, so regular upgrades also serve as a certificate rotation mechanism — another reason not to defer upgrades.</p><h3>CoreDNS Verification</h3><p>Verify that cluster DNS resolution is working:</p><pre>kubectl run dnstest --image=busybox:1.36 --rm -it --restart=Never -- \<br>  nslookup kubernetes.default.svc.cluster.local</pre><p>Expected output:</p><pre>Server:    10.233.0.3<br>Address 1: 10.233.0.3</pre><pre>Name:      kubernetes.default.svc.cluster.local<br>Address 1: 10.233.0.1</pre><p>This confirms:</p><ul><li>The kubelet is correctly passing --cluster-dns=10.233.0.3 to pods</li><li>CoreDNS is running and reachable at the service IP 10.233.0.3</li><li>The kubernetes service (the API server) correctly resolves to 10.233.0.1 (the first IP in the service CIDR)</li></ul><h3>Metrics Server Verification</h3><pre># Check if the metrics API is available<br>kubectl top nodes</pre><p>Expected output:</p><pre>NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%<br>k8s-node1   152m         7%     1124Mi          29%<br>k8s-node2   98m          4%     987Mi           25%<br>k8s-node3   87m          4%     945Mi           24%<br>k8s-node4   45m          2%     512Mi           13%</pre><p>If kubectl top returns results, the Metrics Server is healthy and collecting resource utilization data from the kubelet&#39;s /metrics/resource endpoint on each node.</p><h3>Understanding the K8S API Endpoint</h3><h3>Why API Endpoint Redundancy Matters</h3><p>In a Kubernetes cluster, every component that interacts with the control plane ultimately talks to the <strong>kube-apiserver</strong>. This includes kubelet on every node, kube-proxy, kubectl commands from administrators, and any in-cluster workload that queries the Kubernetes API. If there is only a single kube-apiserver and it goes down, the entire cluster becomes unmanageable: new pods cannot be scheduled, services cannot be updated, and node health checks stop functioning.</p><p>When you deploy a multi-control-plane HA cluster (as we did in the previous section with three control plane nodes), you now have <strong>three kube-apiserver instances</strong> running on k8s-node1 (192.168.10.11), k8s-node2 (192.168.10.12), and k8s-node3 (192.168.10.13). But having multiple API servers alone is not enough — every client component on every node needs a strategy for <strong>how to reach these API servers</strong> and <strong>what to do when one of them fails</strong>.</p><p>This is the API endpoint problem, and Kubespray provides three distinct approaches to solving it.</p><h3>Case 1: Client-Side Load Balancing with NGINX Static Pods</h3><p>This is <strong>Kubespray’s default behavior</strong> when no external load balancer is configured. In this mode, Kubespray deploys an NGINX reverse proxy as a <strong>static pod</strong> on every worker node. This proxy listens on localhost:6443 and round-robins traffic to all control plane nodes.</p><h3>How Worker-Side NGINX Proxy Works</h3><p>When you deploy a cluster with Kubespray’s default settings, the following variable is implicitly set to true:</p><pre># inventory/mycluster/group_vars/all/all.yml<br>loadbalancer_apiserver_localhost: true  # default</pre><p>With this setting enabled, Kubespray performs the following on <strong>every worker node</strong> during cluster deployment:</p><ol><li><strong>Generates an NGINX configuration file</strong> at /etc/nginx/nginx.conf that defines an upstream block pointing to all control plane nodes.</li><li><strong>Creates a static pod manifest</strong> at /etc/kubernetes/manifests/nginx-proxy.yaml that runs an NGINX container using the host network.</li><li><strong>Configures kubelet</strong> on the worker to use https://localhost:6443 as its API server endpoint.</li><li><strong>Configures kube-proxy</strong> to also connect through <a href="https://localhost:6443.">https://localhost:6443.</a></li></ol><p>The resulting architecture looks like this:</p><pre>┌─────────────────────────────────────────────────────────────┐<br>│                        Worker Node (k8s-node4)              │<br>│                                                             │<br>│  ┌──────────┐      ┌────────────────────┐                   │<br>│  │ kubelet   │─────▶│  nginx-proxy       │                   │<br>│  │ kube-proxy│─────▶│  (static pod)      │                   │<br>│  └──────────┘      │  localhost:6443     │                   │<br>│                     └────────┬───────────┘                   │<br>└──────────────────────────────┼───────────────────────────────┘<br>                               │<br>              ┌────────────────┼────────────────┐<br>              ▼                ▼                 ▼<br>     ┌──────────────┐ ┌──────────────┐ ┌──────────────┐<br>     │  k8s-node1   │ │  k8s-node2   │ │  k8s-node3   │<br>     │  apiserver   │ │  apiserver   │ │  apiserver   │<br>     │ :6443        │ │ :6443        │ │ :6443        │<br>     └──────────────┘ └──────────────┘ └──────────────┘</pre><p>Because the NGINX proxy runs as a static pod directly on the worker node, no external infrastructure is required. Each worker independently manages its own API server connection pool.</p><p>You can verify the nginx-proxy pod is running on each worker node:</p><pre>$ kubectl get pods -A -o wide | grep nginx-proxy<br>kube-system   nginx-proxy-k8s-node4   1/1   Running   0   12m   192.168.10.14   k8s-node4   &lt;none&gt;<br>kube-system   nginx-proxy-k8s-node5   1/1   Running   0   12m   192.168.10.15   k8s-node5   &lt;none&gt;</pre><p>Note that <strong>control plane nodes do not run nginx-proxy</strong> — they connect directly to their own local kube-apiserver instance since kube-apiserver is already running on localhost:6443.</p><h3>Configuration Deep Dive (nginx.conf.j2)</h3><p>SSH into a worker node and inspect the generated NGINX configuration:</p><pre>$ ssh k8s-node4 cat /etc/nginx/nginx.conf</pre><pre>error_log stderr notice;</pre><pre>worker_processes 1;<br>worker_rlimit_nofile 130048;<br>worker_shutdown_timeout 10s;</pre><pre>events {<br>  multi_accept on;<br>  use epoll;<br>  worker_connections 16384;<br>}</pre><pre>stream {<br>  upstream kube_apiserver {<br>    least_conn;<br>    server 192.168.10.11:6443;<br>    server 192.168.10.12:6443;<br>    server 192.168.10.13:6443;<br>  }</pre><pre>  server {<br>    listen        127.0.0.1:6443;<br>    proxy_pass    kube_apiserver;<br>    proxy_timeout 10m;<br>    proxy_connect_timeout 1s;<br>  }<br>}</pre><p>Key points to understand about this configuration:</p><ul><li><strong>stream block (not </strong><strong>http)</strong>: This is a Layer 4 (TCP) proxy, not Layer 7 (HTTP). NGINX forwards raw TCP connections without inspecting HTTP headers or TLS content. This is critical because the TLS handshake happens directly between the client (kubelet) and kube-apiserver — NGINX never terminates TLS.</li><li><strong>least_conn load balancing</strong>: Distributes connections to the backend with the fewest active connections. This provides better distribution than round-robin when some API requests take longer than others.</li><li><strong>listen 127.0.0.1:6443</strong>: The proxy only listens on localhost. It is not accessible from outside the node.</li><li><strong>proxy_connect_timeout 1s</strong>: If a control plane node is unreachable, NGINX will fail over to the next backend within 1 second.</li><li><strong>proxy_timeout 10m</strong>: Long-lived connections (such as kubectl exec or watch streams) are kept alive for up to 10 minutes.</li></ul><p>Now inspect the static pod manifest:</p><pre>$ ssh k8s-node4 cat /etc/kubernetes/manifests/nginx-proxy.yaml</pre><pre>apiVersion: v1<br>kind: Pod<br>metadata:<br>  name: nginx-proxy<br>  namespace: kube-system<br>  labels:<br>    addonmanager.kubernetes.io/mode: Reconcile<br>    k8s-app: kube-nginx<br>spec:<br>  hostNetwork: true<br>  dnsPolicy: ClusterFirstWithHostNet<br>  nodeSelector:<br>    kubernetes.io/os: linux<br>  priorityClassName: system-node-critical<br>  containers:<br>    - name: nginx-proxy<br>      image: registry.k8s.io/pause:3.10<br>      resources:<br>        requests:<br>          cpu: 25m<br>          memory: 32M<br>      # ... volume mounts for /etc/nginx/nginx.conf</pre><p>The key setting here is hostNetwork: true, which means the pod uses the node&#39;s network namespace directly. This allows the NGINX process to bind to 127.0.0.1:6443 on the host, making it transparently available to kubelet and other local processes.</p><p>Verify that kubelet on the worker is indeed pointing to localhost:</p><pre>$ ssh k8s-node4 cat /etc/kubernetes/kubelet.conf | grep server<br>    server: https://localhost:6443</pre><p>Similarly, check the kube-proxy configuration:</p><pre>$ kubectl get configmap kube-proxy -n kube-system -o yaml | grep server<br>    server: https://localhost:6443</pre><h3>Failure Simulation: Control Plane Node Down</h3><p>To verify that client-side load balancing works correctly during a control plane failure, we can stop one of the control plane nodes and observe that the cluster continues operating normally.</p><p><strong>Step 1: Check current cluster state</strong></p><pre>$ kubectl get nodes<br>NAME        STATUS   ROLES           AGE   VERSION<br>k8s-node1   Ready    control-plane   35m   v1.32.9<br>k8s-node2   Ready    control-plane   33m   v1.32.9<br>k8s-node3   Ready    control-plane   31m   v1.32.9<br>k8s-node4   Ready    &lt;none&gt;          28m   v1.32.9<br>k8s-node5   Ready    &lt;none&gt;          28m   v1.32.9</pre><p><strong>Step 2: Stop kube-apiserver on k8s-node1</strong></p><pre># On k8s-node1, move the static pod manifest to stop kube-apiserver<br>$ ssh k8s-node1 mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/</pre><p><strong>Step 3: Verify the cluster is still functional</strong></p><pre># From admin-lb, run kubectl commands — they should still work<br>$ kubectl get nodes<br>NAME        STATUS     ROLES           AGE   VERSION<br>k8s-node1   Ready      control-plane   36m   v1.32.9<br>k8s-node2   Ready      control-plane   34m   v1.32.9<br>k8s-node3   Ready      control-plane   32m   v1.32.9<br>k8s-node4   Ready      &lt;none&gt;          29m   v1.32.9<br>k8s-node5   Ready      &lt;none&gt;          29m   v1.32.9</pre><p>The cluster remains fully operational because the NGINX proxy on each worker automatically routes traffic to the remaining two API servers (k8s-node2 and k8s-node3). The proxy_connect_timeout 1s setting ensures the failover happens almost instantly.</p><p>You can observe this in the NGINX error log on a worker:</p><pre>$ ssh k8s-node4 crictl logs $(ssh k8s-node4 crictl ps --name nginx-proxy -q) 2&gt;&amp;1 | tail<br>... connect() failed (111: Connection refused) while connecting to upstream,<br>    upstream: &quot;192.168.10.11:6443&quot; ...</pre><p>NGINX detected that 192.168.10.11:6443 is unreachable and stopped routing traffic to it.</p><p><strong>Step 4: Restore the API server</strong></p><pre>$ ssh k8s-node1 mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/</pre><p>Within a few seconds, kubelet on k8s-node1 will detect the manifest and restart kube-apiserver. NGINX on the workers will begin routing traffic to all three backends again.</p><p><strong>Pros of Client-Side Load Balancing:</strong></p><ul><li>No external infrastructure required (no separate load balancer to manage or maintain)</li><li>Fast failover (1-second connect timeout per backend)</li><li>Each worker is independently resilient — no shared single point of failure</li><li>Works in air-gapped or restricted network environments</li></ul><p><strong>Cons of Client-Side Load Balancing:</strong></p><ul><li>NGINX configuration must be updated on every worker when control plane nodes are added/removed (Kubespray handles this automatically during playbook runs)</li><li>External clients (such as administrators running kubectl from their laptops) must target a specific control plane node IP — there is no single virtual IP for external access</li><li>Certificate SAN management can become complex if you need external access through multiple entry points</li></ul><h3>Case 2: External LB (HAProxy) + Client-Side LB</h3><p>In production environments, you typically want a <strong>single, stable endpoint</strong> that external clients (developers, CI/CD pipelines, monitoring systems) can use to reach the Kubernetes API. This is where an external load balancer comes in.</p><p>In Case 2, we add an <strong>external HAProxy load balancer</strong> on the admin-lb node (192.168.10.10) while <strong>keeping the client-side NGINX proxy on worker nodes</strong>. This gives you the best of both worlds: workers use the local NGINX proxy for resilient internal connectivity, and external users connect through HAProxy.</p><pre>External Clients (kubectl, CI/CD, Monitoring)<br>                    │<br>                    ▼<br>        ┌───────────────────────┐<br>        │   admin-lb (HAProxy)  │<br>        │   192.168.10.10:6443  │<br>        └───────────┬───────────┘<br>                    │<br>       ┌────────────┼────────────┐<br>       ▼            ▼            ▼<br>  ┌──────────┐ ┌──────────┐ ┌──────────┐<br>  │ node1    │ │ node2    │ │ node3    │<br>  │ apiserver│ │ apiserver│ │ apiserver│<br>  └──────────┘ └──────────┘ └──────────┘<br>       ▲            ▲            ▲<br>       └────────────┼────────────┘<br>                    │<br>        ┌───────────────────────┐<br>        │  Worker nodes still   │<br>        │  use nginx-proxy      │<br>        │  (localhost:6443)     │<br>        └───────────────────────┘</pre><h3>Adding the External LB Endpoint</h3><p>In our lab, HAProxy was already configured on the admin-lb node during the initial Vagrant provisioning (see Section 4). The HAProxy configuration in /etc/haproxy/haproxy.cfg includes:</p><pre>frontend kubernetes-api<br>    bind *:6443<br>    mode tcp<br>    option tcplog<br>    default_backend kubernetes-api-backend</pre><pre>backend kubernetes-api-backend<br>    mode tcp<br>    option tcp-check<br>    balance roundrobin<br>    server k8s-node1 192.168.10.11:6443 check fall 3 rise 2<br>    server k8s-node2 192.168.10.12:6443 check fall 3 rise 2<br>    server k8s-node3 192.168.10.13:6443 check fall 3 rise 2</pre><p>HAProxy performs Layer 4 TCP proxying, health-checking each backend every few seconds. If a control plane node fails health checks three times (fall 3), it is removed from the pool. When it passes two consecutive checks (rise 2), it is re-added.</p><p>You can verify HAProxy is forwarding correctly:</p><pre>$ curl -sk https://192.168.10.10:6443/version<br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><p>However, at this point, the kube-apiserver TLS certificate does <strong>not</strong> include 192.168.10.10 or the HAProxy domain name in its Subject Alternative Names (SANs). This means that any client performing proper TLS verification will reject the connection:</p><pre>$ curl -v https://192.168.10.10:6443/version<br>* SSL: no alternative certificate subject name matches target host name &#39;192.168.10.10&#39;<br>curl: (60) SSL: no alternative certificate subject name matches target host name &#39;192.168.10.10&#39;</pre><p>We must update the certificates to include the external LB address.</p><h3>Certificate SAN Update for External LB IP/Domain</h3><p>To make the kube-apiserver certificate valid when accessed through the HAProxy IP (192.168.10.10) or a domain name (k8s-api-srv.admin-lb.com), we need to add these as supplementary SANs.</p><p><strong>Step 1: Update the Kubespray group variables</strong></p><p>Edit inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml and add the following:</p><pre>## Supplementary addresses that can be added in kubernetes ssl keys.<br>## That can be useful for example to setup a determinate endpoint<br>## for FQDN registrations and target that endpoint with HAProxy.<br>supplementary_addresses_in_ssl_keys:<br>  - 192.168.10.10<br>  - k8s-api-srv.admin-lb.com</pre><p>This tells Kubespray to include these additional addresses in the SAN field of the kube-apiserver’s TLS certificate.</p><p><strong>Step 2: Apply the certificate update</strong></p><p>You do not need to re-run the entire cluster.yml playbook. Instead, target only the control plane tag with a limit to control plane nodes:</p><pre>$ cd /root/kubespray</pre><pre>$ ansible-playbook cluster.yml \<br>    --tags &quot;control-plane&quot; \<br>    --limit kube_control_plane \<br>    -e kube_version=&quot;1.32.9&quot;</pre><p>This playbook run will:</p><ol><li>Regenerate the kube-apiserver TLS certificates with the new SANs</li><li>Restart kube-apiserver on each control plane node to pick up the new certificates</li><li>Leave everything else (etcd, worker nodes, CNI) untouched</li></ol><p>The run typically completes in about <strong>3–4 minutes</strong>.</p><p><strong>Step 3: Verify the updated certificate</strong></p><p>After the playbook completes, inspect the certificate SANs:</p><pre>$ ssh k8s-node1 openssl x509 -in /etc/kubernetes/ssl/apiserver.crt -noout -text \<br>    | grep -A 20 &quot;Subject Alternative Name&quot;<br>            X509v3 Subject Alternative Name:<br>                DNS:k8s-node1, DNS:k8s-node2, DNS:k8s-node3,<br>                DNS:lb-apiserver.kubernetes.local,<br>                DNS:localhost, DNS:kubernetes, DNS:kubernetes.default,<br>                DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local,<br>                DNS:k8s-api-srv.admin-lb.com,<br>                IP Address:10.233.0.1, IP Address:192.168.10.11,<br>                IP Address:192.168.10.12, IP Address:192.168.10.13,<br>                IP Address:127.0.0.1, IP Address:192.168.10.10</pre><p>The new entries k8s-api-srv.admin-lb.com and 192.168.10.10 are now present in the certificate SANs.</p><p><strong>Step 4: Test external access through HAProxy</strong></p><pre># Using IP address<br>$ curl -sk https://192.168.10.10:6443/version<br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><pre># Using domain name (ensure DNS or /etc/hosts resolves this)<br>$ curl -sk <a href="https://k8s-api-srv.admin-lb.com:6443/version">https://k8s-api-srv.admin-lb.com:6443/version</a><br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><p>Both endpoints now work without TLS errors.</p><p><strong>Step 5: Update the admin kubeconfig to use the external LB</strong></p><p>Now that HAProxy is a valid entry point, update the admin kubeconfig on the admin-lb node to use it:</p><pre># Copy the kubeconfig from a control plane node<br>$ scp k8s-node1:/etc/kubernetes/admin.conf /root/.kube/config</pre><pre># Update the server URL to point to HAProxy<br>$ kubectl config set-cluster cluster.local \<br>    --server=https://k8s-api-srv.admin-lb.com:6443</pre><p>Or manually edit ~/.kube/config:</p><pre>apiVersion: v1<br>clusters:<br>- cluster:<br>    certificate-authority-data: &lt;base64-encoded-ca-cert&gt;<br>    server: https://k8s-api-srv.admin-lb.com:6443  # Changed from https://192.168.10.11:6443<br>  name: cluster.local</pre><p>Now all kubectl commands from the admin-lb node will go through HAProxy, which distributes them across all three control plane nodes.</p><h3>Failure Simulation with External LB</h3><p>With the external LB in place, let’s simulate a more realistic failure scenario.</p><p><strong>Step 1: Observe HAProxy backend status</strong></p><p>Open the HAProxy stats page in a browser:</p><pre>http://192.168.10.10:9000/haproxy_stats</pre><p>You should see all three backends (k8s-node1, k8s-node2, k8s-node3) in a green/UP state.</p><p><strong>Step 2: Shut down an entire control plane node</strong></p><pre># From the host machine (outside the VMs)<br>$ vagrant halt k8s-node1</pre><p><strong>Step 3: Verify HAProxy detects the failure</strong></p><p>Refresh the HAProxy stats page. Within a few seconds (determined by fall 3 and the health check interval), k8s-node1 will transition to a red/DOWN state. The remaining two backends stay UP.</p><p><strong>Step 4: Verify cluster operations continue</strong></p><pre>$ kubectl get nodes<br>NAME        STATUS     ROLES           AGE   VERSION<br>k8s-node1   NotReady   control-plane   45m   v1.32.9<br>k8s-node2   Ready      control-plane   43m   v1.32.9<br>k8s-node3   Ready      control-plane   41m   v1.32.9<br>k8s-node4   Ready      &lt;none&gt;          38m   v1.32.9<br>k8s-node5   Ready      &lt;none&gt;          38m   v1.32.9</pre><pre>$ kubectl create deployment test-nginx --image=nginx --replicas=2<br>deployment.apps/test-nginx created</pre><pre>$ kubectl get pods -o wide<br>NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE<br>test-nginx-7c79c4bf97-abc12   1/1     Running   0          10s   10.233.90.5   k8s-node4<br>test-nginx-7c79c4bf97-def34   1/1     Running   0          10s   10.233.91.3   k8s-node5</pre><p>The cluster is fully operational. Both HAProxy (for external access) and the NGINX proxy on workers (for internal communication) have seamlessly failed over to the remaining control plane nodes.</p><p><strong>Step 5: Restore the node</strong></p><pre>$ vagrant up k8s-node1</pre><p>After the node boots and kubelet starts, it will rejoin the cluster. HAProxy will detect the restored backend via health checks (rise 2) and begin routing traffic to it again.</p><pre># Clean up the test deployment<br>$ kubectl delete deployment test-nginx</pre><p><strong>Summary of Case 2:</strong></p><p>In this configuration:</p><ul><li><strong>Worker nodes</strong> use the local nginx-proxy (localhost:6443) — unchanged from Case 1</li><li><strong>External clients</strong> (kubectl, CI/CD, monitoring) use HAProxy (192.168.10.10:6443)</li><li><strong>Certificates</strong> include both the control plane IPs and the HAProxy IP/domain as SANs</li><li><strong>Two independent failover mechanisms</strong> protect the cluster: HAProxy for external traffic, nginx-proxy for internal traffic</li></ul><p>This is the most common production configuration because it provides maximum resilience without requiring changes to the internal cluster networking model.</p><h3>Case 3: External LB as Single Endpoint for All Components</h3><p>In some environments, teams prefer a <strong>fully centralized approach</strong> where every component — including kubelet and kube-proxy on worker nodes — connects to the API server exclusively through the external load balancer. This eliminates the per-node NGINX proxy and creates a simpler, more uniform architecture.</p><pre>All Clients (kubelet, kube-proxy, kubectl, CI/CD)<br>                    │<br>                    ▼<br>        ┌───────────────────────┐<br>        │   admin-lb (HAProxy)  │<br>        │   192.168.10.10:6443  │<br>        └───────────┬───────────┘<br>                    │<br>       ┌────────────┼────────────┐<br>       ▼            ▼            ▼<br>  ┌──────────┐ ┌──────────┐ ┌──────────┐<br>  │ node1    │ │ node2    │ │ node3    │<br>  │ apiserver│ │ apiserver│ │ apiserver│<br>  └──────────┘ └──────────┘ └──────────┘</pre><p>In this mode, <strong>no nginx-proxy static pods exist on worker nodes</strong>. Every API request from every node passes through HAProxy.</p><h3>Disabling Client-Side LB (loadbalancer_apiserver_localhost: false)</h3><p>To switch to external-LB-only mode, update the Kubespray configuration with the following variables:</p><p><strong>Edit </strong><strong>inventory/mycluster/group_vars/all/all.yml:</strong></p><pre>## External LB configuration<br>## This domain name will be used in kubeconfigs, kubelet configs, and<br>## kube-proxy configs on all nodes.<br>apiserver_loadbalancer_domain_name: &quot;k8s-api-srv.admin-lb.com&quot;</pre><pre>## External load balancer address and port<br>loadbalancer_apiserver:<br>  address: 192.168.10.10<br>  port: 6443</pre><pre>## CRITICAL: Disable client-side (localhost) load balancing<br>## When set to false, nginx-proxy static pods will be removed from worker nodes<br>## and all components will use the external LB endpoint instead.<br>loadbalancer_apiserver_localhost: false</pre><p>Make sure the supplementary_addresses_in_ssl_keys from Case 2 is still present in k8s-cluster.yml:</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>supplementary_addresses_in_ssl_keys:<br>  - 192.168.10.10<br>  - k8s-api-srv.admin-lb.com</pre><p><strong>Apply the changes:</strong></p><p>Since we are changing a fundamental cluster networking parameter, this requires a broader playbook run:</p><pre>$ cd /root/kubespray</pre><pre>$ ansible-playbook cluster.yml \<br>    -e kube_version=&quot;1.32.9&quot;</pre><p>This playbook run will:</p><ol><li>Remove the nginx-proxy static pod manifest from all worker nodes</li><li>Remove the /etc/nginx/nginx.conf file from worker nodes</li><li>Regenerate kubelet configuration on all nodes to point to <a href="https://k8s-api-srv.admin-lb.com:6443">https://k8s-api-srv.admin-lb.com:6443</a></li><li>Update the kube-proxy ConfigMap server endpoint</li><li>Regenerate all kubeconfig files (admin.conf, kubelet.conf, scheduler.conf, controller-manager.conf) with the new server URL</li><li>Restart kubelet on all nodes to pick up the new configuration</li></ol><blockquote><strong><em>⚠️ Important:</em></strong><em> This is a disruptive change. During the playbook run, worker nodes will be reconfigured to point to the external LB. If HAProxy is not running or the DNS name does not resolve, </em><strong><em>worker nodes will lose connectivity to the API server</em></strong><em>. Always verify HAProxy is healthy before applying this change.</em></blockquote><h3>Verifying kubelet and kube-proxy Endpoint Configuration</h3><p>After the playbook completes, systematically verify that every component is pointing to the external LB.</p><p><strong>1. Verify nginx-proxy pods are gone from worker nodes</strong></p><pre>$ kubectl get pods -A -o wide | grep nginx-proxy<br># No output — nginx-proxy pods have been removed</pre><p>Confirm the static pod manifest no longer exists on worker nodes:</p><pre>$ ssh k8s-node4 ls /etc/kubernetes/manifests/<br># nginx-proxy.yaml should NOT be listed</pre><p>And the NGINX configuration file has been cleaned up:</p><pre>$ ssh k8s-node4 cat /etc/nginx/nginx.conf<br>cat: /etc/nginx/nginx.conf: No such file or directory</pre><p><strong>2. Verify kubelet configuration on worker nodes</strong></p><pre>$ ssh k8s-node4 cat /etc/kubernetes/kubelet.conf | grep server<br>    server: https://k8s-api-srv.admin-lb.com:6443</pre><p>Previously (in Case 1 and Case 2), this was https://localhost:6443. Now it points directly to the external LB domain name.</p><p>Check another worker to confirm consistency:</p><pre>$ ssh k8s-node5 cat /etc/kubernetes/kubelet.conf | grep server<br>    server: https://k8s-api-srv.admin-lb.com:6443</pre><p><strong>3. Verify kubelet configuration on control plane nodes</strong></p><p>Control plane nodes have a slightly different behavior. Depending on the Kubespray version and settings, control plane kubelet may point to localhost:6443 (directly to the local kube-apiserver) or to the external LB. Verify:</p><pre>$ ssh k8s-node1 cat /etc/kubernetes/kubelet.conf | grep server<br>    server: https://k8s-api-srv.admin-lb.com:6443</pre><p>In Case 3, even control plane nodes route through the external LB for consistency (though the request may loop back to the local apiserver via HAProxy).</p><p><strong>4. Verify kube-proxy configuration</strong></p><pre>$ kubectl get configmap kube-proxy -n kube-system -o yaml | grep server<br>    server: https://k8s-api-srv.admin-lb.com:6443</pre><p><strong>5. Verify admin kubeconfig</strong></p><pre>$ ssh k8s-node1 cat /etc/kubernetes/admin.conf | grep server<br>    server: https://k8s-api-srv.admin-lb.com:6443</pre><p><strong>6. Test API access from multiple paths</strong></p><pre># Through HAProxy IP<br>$ curl -sk https://192.168.10.10:6443/version<br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><pre># Through HAProxy domain name<br>$ curl -sk <a href="https://k8s-api-srv.admin-lb.com:6443/version">https://k8s-api-srv.admin-lb.com:6443/version</a><br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><pre># Directly to a control plane node (still works — useful for debugging)<br>$ curl -sk <a href="https://192.168.10.11:6443/version">https://192.168.10.11:6443/version</a><br>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;32&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.32.9&quot;,<br>  ...<br>}</pre><p><strong>7. Verify cluster health</strong></p><pre>$ kubectl get nodes -o wide<br>NAME        STATUS   ROLES           AGE   VERSION    INTERNAL-IP      OS-IMAGE<br>k8s-node1   Ready    control-plane   50m   v1.32.9    192.168.10.11    Rocky Linux 10.0<br>k8s-node2   Ready    control-plane   48m   v1.32.9    192.168.10.12    Rocky Linux 10.0<br>k8s-node3   Ready    control-plane   46m   v1.32.9    192.168.10.13    Rocky Linux 10.0<br>k8s-node4   Ready    &lt;none&gt;          43m   v1.32.9    192.168.10.14    Rocky Linux 10.0<br>k8s-node5   Ready    &lt;none&gt;          43m   v1.32.9    192.168.10.15    Rocky Linux 10.0</pre><pre>$ kubectl get pods -A<br>NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE<br>kube-system   coredns-xxxxxxxxx-xxxxx             1/1     Running   0          50m<br>kube-system   coredns-xxxxxxxxx-xxxxx             1/1     Running   0          50m<br>kube-system   kube-apiserver-k8s-node1            1/1     Running   0          50m<br>kube-system   kube-apiserver-k8s-node2            1/1     Running   0          48m<br>kube-system   kube-apiserver-k8s-node3            1/1     Running   0          46m<br>kube-system   kube-controller-manager-k8s-node1   1/1     Running   0          50m<br>kube-system   kube-controller-manager-k8s-node2   1/1     Running   0          48m<br>kube-system   kube-controller-manager-k8s-node3   1/1     Running   0          46m<br>kube-system   kube-proxy-xxxxx                    1/1     Running   0          43m<br>kube-system   kube-proxy-xxxxx                    1/1     Running   0          43m<br>kube-system   kube-proxy-xxxxx                    1/1     Running   0          50m<br>kube-system   kube-proxy-xxxxx                    1/1     Running   0          48m<br>kube-system   kube-proxy-xxxxx                    1/1     Running   0          46m<br>kube-system   kube-scheduler-k8s-node1            1/1     Running   0          50m<br>kube-system   kube-scheduler-k8s-node2            1/1     Running   0          48m<br>kube-system   kube-scheduler-k8s-node3            1/1     Running   0          46m</pre><p>Notice that there are <strong>no </strong><strong>nginx-proxy-* pods</strong> in the listing. All traffic now flows through the external HAProxy.</p><p><strong>Failure Domain Consideration in Case 3:</strong></p><p>In this architecture, HAProxy becomes a <strong>critical single point of failure</strong> for the entire cluster. If the admin-lb node goes down:</p><ul><li>Worker nodes lose API server connectivity (kubelet cannot report node status)</li><li>kube-proxy cannot receive Service/Endpoint updates</li><li>kubectl commands from external clients fail</li><li>New pods cannot be scheduled</li></ul><p>In production, you would mitigate this by:</p><ul><li>Running HAProxy in a highly available pair with <strong>keepalived</strong> (VRRP) for a floating virtual IP</li><li>Using a cloud provider’s managed load balancer (AWS NLB, GCP Internal LB, Azure LB)</li><li>Deploying multiple HAProxy instances behind DNS round-robin</li></ul><h3>Choosing the Right Configuration for Your Environment</h3><p>For most production deployments, <strong>Case 2 (External LB + Client-Side LB)</strong> is the recommended approach. It provides two independent failover paths, no single point of failure, and a stable external endpoint for administrative and CI/CD access. The slight additional complexity of running nginx-proxy on workers is fully managed by Kubespray and requires no manual intervention.</p><p><strong>Case 3 (External LB Only)</strong> is appropriate when:</p><ul><li>You have a highly available load balancer infrastructure (cloud-managed LB, keepalived pair)</li><li>You want a simpler mental model where all traffic follows a single path</li><li>Your organization’s network team manages the LB and prefers centralized control</li></ul><p><strong>Case 1 (Client-Side Only)</strong> is best suited for:</p><ul><li>Development and testing environments</li><li>Air-gapped or isolated networks with no LB infrastructure</li><li>Temporary clusters where simplicity is prioritized over external accessibility</li></ul><p>In the remaining sections of this guide, we will continue with the <strong>Case 2</strong> configuration (external HAProxy + client-side NGINX proxy) as it is the most representative of real-world production setups.</p><h3>Node Management</h3><p>In a production Kubernetes cluster, nodes are not static. Worker nodes need to be scaled out to handle increased workloads, decommissioned when no longer needed, or replaced when hardware fails. Control plane nodes occasionally require rotation — for OS patching, hardware refresh, or disaster recovery. Kubespray provides dedicated playbooks for each of these lifecycle operations, making it possible to manage nodes declaratively and repeatably.</p><p>This section walks through every major node management scenario: adding workers, gracefully removing them, force-removing unhealthy nodes that have gone offline, replacing a control plane node, and performing a full cluster reset.</p><blockquote><strong><em>Lab Context Recap</em></strong></blockquote><blockquote><em>Our cluster currently has 3 control plane nodes (</em><em>k8s-node1, </em><em>k8s-node2, </em><em>k8s-node3) and 1 worker node (</em><em>k8s-node4). The admin/LB node (</em><em>admin-lb) runs HAProxy and Kubespray. Kubernetes version is </em><em>v1.32.9.</em></blockquote><h3>Adding a Worker Node (scale.yml)</h3><p>When your cluster needs more compute capacity, Kubespray’s scale.yml playbook lets you add new worker nodes without disrupting existing workloads. Unlike cluster.yml, which operates on the entire cluster, scale.yml is purpose-built for incremental node addition — it only touches the new node and the minimal set of existing resources needed to integrate it.</p><h3>Scale Playbook Walkthrough</h3><p>Before running the playbook, let’s understand what scale.yml does under the hood. The playbook executes the following sequence of operations:</p><ol><li><strong>Download</strong> — Pulls the required container images and binaries to the new node (containerd images, CNI plugins, kubelet binary, kubeadm binary).</li><li><strong>etcd (conditional)</strong> — If the new node is an etcd member (not the case for worker-only additions), it would configure and join the etcd cluster. For worker nodes, this step is skipped entirely.</li><li><strong>Install kubelet</strong> — Installs and configures the kubelet service on the new node, including the kubelet configuration file, systemd unit, and certificate bootstrap token.</li><li><strong>Upload control plane certificates</strong> — Copies the necessary CA certificates from an existing control plane node so the new worker can establish trust with the API server.</li><li><strong>kubeadm join</strong> — Runs kubeadm join on the new node, which registers it with the cluster and starts the kubelet.</li><li><strong>Apply labels and taints</strong> — Applies any node labels or taints defined in the inventory.</li><li><strong>Configure CNI</strong> — Deploys the CNI plugin (Flannel in our case) so the node can participate in the pod network.</li></ol><blockquote><strong><em>Important:</em></strong><em> </em><em>scale.yml can only be used for adding </em><strong><em>worker nodes</em></strong><em>. You cannot use </em><em>scale.yml to add new control plane nodes. For control plane additions, you must use </em><em>cluster.yml — this is covered in Section 7.4.</em></blockquote><h3>Step-by-Step Execution and Verification</h3><p><strong>Step 1: Update the Inventory</strong></p><p>First, add the new node to inventory.ini. We&#39;ll add k8s-node5 as a worker:</p><pre># inventory/mycluster/inventory.ini</pre><pre>[all]<br>k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11<br>k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12<br>k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13<br>k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14<br>k8s-node5 ansible_host=192.168.10.15 ip=192.168.10.15  # &lt;-- NEW</pre><pre>[kube_control_plane]<br>k8s-node1<br>k8s-node2<br>k8s-node3</pre><pre>[etcd:children]<br>kube_control_plane</pre><pre>[kube_node]<br>k8s-node4<br>k8s-node5  # &lt;-- NEW</pre><pre>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p><strong>Step 2: Run the Scale Playbook</strong></p><p>Execute scale.yml with the --limit flag to target only the new node. Always specify the Kubernetes version explicitly to ensure consistency:</p><pre>cd /root/kubespray</pre><pre>ansible-playbook scale.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  --limit=k8s-node5 \<br>  -e kube_version=&quot;1.32.9&quot;</pre><p>The playbook typically completes in about <strong>3 minutes</strong> for a single worker node. You’ll see output similar to:</p><pre>PLAY RECAP ***********************************************************************<br>k8s-node5                  : ok=198  changed=62   unreachable=0    failed=0    skipped=412  rescued=0    ignored=0</pre><p><strong>Step 3: Verify the New Node</strong></p><p>Once the playbook completes, verify the node has joined the cluster:</p><pre>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP      OS-IMAGE            KERNEL-VERSION<br>k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11    Rocky Linux 10.0    ...<br>k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12    Rocky Linux 10.0    ...<br>k8s-node3   Ready    control-plane   1d     v1.32.9   192.168.10.13    Rocky Linux 10.0    ...<br>k8s-node4   Ready    &lt;none&gt;          1d     v1.32.9   192.168.10.14    Rocky Linux 10.0    ...<br>k8s-node5   Ready    &lt;none&gt;          30s    v1.32.9   192.168.10.15    Rocky Linux 10.0    ...</pre><p>The new k8s-node5 should appear with STATUS: Ready and the correct Kubernetes version. You can also verify that the CNI plugin is running:</p><pre>kubectl get pods -n kube-system -o wide | grep flannel</pre><pre>kube-flannel-xxxxx   1/1   Running   0   45s   192.168.10.15   k8s-node5   ...</pre><p>If you’re using client-side load balancing (the default Kubespray behavior), you’ll also see an nginx-proxy static pod on the new worker:</p><pre>kubectl get pods -n kube-system -o wide | grep nginx-proxy</pre><pre>nginx-proxy-k8s-node4   1/1   Running   0   1d    192.168.10.14   k8s-node4   ...<br>nginx-proxy-k8s-node5   1/1   Running   0   45s   192.168.10.15   k8s-node5   ...</pre><p>This nginx-proxy is configured to load-balance API requests across all three control plane nodes, ensuring the new worker has full HA connectivity to the API server from the moment it joins.</p><h3>Removing a Worker Node (remove-node.yml)</h3><p>When decommissioning a worker node — whether for cost optimization, hardware retirement, or cluster right-sizing — Kubespray’s remove-node.yml playbook handles the full lifecycle: draining workloads, cleaning up cluster metadata, and resetting the node.</p><h3>Graceful Removal with PDB Considerations</h3><p>Before diving into the removal process, it’s critical to understand how PodDisruptionBudget (PDB) objects can affect node removal.</p><p>When Kubespray drains a node, it runs the equivalent of kubectl drain --ignore-daemonsets --delete-emptydir-data. This eviction process respects PDB constraints. If a PDB specifies maxUnavailable: 0 for a set of pods, the drain operation will <strong>block indefinitely</strong> because Kubernetes refuses to evict pods that would violate the disruption budget.</p><p><strong>Example scenario:</strong> Suppose you have a deployment with 2 replicas and a PDB that says maxUnavailable: 0:</p><pre>apiVersion: policy/v1<br>kind: PodDisruptionBudget<br>metadata:<br>  name: my-app-pdb<br>spec:<br>  maxUnavailable: 0<br>  selector:<br>    matchLabels:<br>      app: my-app</pre><p>If both replicas happen to be running on the node you’re trying to drain (or if draining would bring the available count below the PDB requirement), the drain command will hang. In the Kubespray playbook output, you’ll see the drain task stall with no progress.</p><p><strong>Mitigation strategies:</strong></p><ul><li>Before removing a node, check for PDBs: kubectl get pdb --all-namespaces</li><li>Temporarily adjust the PDB to allow disruption: maxUnavailable: 1</li><li>Or manually reschedule pods off the target node before starting the removal</li><li>Use kubectl drain &lt;node&gt; --timeout=300s to set a timeout on the drain (Kubespray&#39;s default behavior includes a drain timeout, but verify your configuration)</li></ul><h3>Remove Playbook Walkthrough</h3><p>The remove-node.yml playbook executes the following steps:</p><ol><li><strong>Confirmation prompt</strong> — Unless skip_confirmation=true is passed, the playbook asks for manual confirmation before proceeding. This is a safety mechanism to prevent accidental node removal.</li><li><strong>Drain the node</strong> — Cordons the node (marks it as unschedulable) and evicts all non-DaemonSet pods. Workloads are rescheduled to other available nodes.</li><li><strong>Remove etcd member</strong> — If the node is an etcd member, it is removed from the etcd cluster. For worker-only nodes, this step is skipped.</li><li><strong>kubeadm reset</strong> — Runs kubeadm reset on the target node, which tears down the kubelet, removes certificates, and cleans up local Kubernetes state.</li><li><strong>Delete node metadata</strong> — Removes the Node object from the Kubernetes API so it no longer appears in kubectl get nodes.</li></ol><p><strong>Execution:</strong></p><p>Let’s remove the k8s-node5 worker node we just added:</p><pre>cd /root/kubespray</pre><pre>ansible-playbook remove-node.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e node=k8s-node5 \<br>  -e skip_confirmation=true \<br>  -e kube_version=&quot;1.32.9&quot;</pre><p>Key parameters:</p><p>Parameter Description -e node=k8s-node5 Specifies which node to remove. Can be a comma-separated list for multiple nodes. -e skip_confirmation=true Bypasses the interactive confirmation prompt. Useful for automation.</p><p>The playbook typically completes in about <strong>2 minutes</strong>. Output:</p><pre>PLAY RECAP ***********************************************************************<br>k8s-node1                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0<br>k8s-node5                  : ok=15   changed=9    unreachable=0    failed=0    skipped=18   rescued=0    ignored=0</pre><p>Notice that k8s-node1 (a control plane node) also shows activity — this is because the playbook connects to a control plane node to execute kubectl drain and kubectl delete node commands.</p><p><strong>Verification:</strong></p><pre>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP      OS-IMAGE            KERNEL-VERSION<br>k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11    Rocky Linux 10.0    ...<br>k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12    Rocky Linux 10.0    ...<br>k8s-node3   Ready    control-plane   1d     v1.32.9   192.168.10.13    Rocky Linux 10.0    ...<br>k8s-node4   Ready    &lt;none&gt;          1d     v1.32.9   192.168.10.14    Rocky Linux 10.0    ...</pre><p>k8s-node5 is gone. Any pods that were running on it have been rescheduled to k8s-node4 or one of the control plane nodes (if they tolerate control plane taints).</p><blockquote><strong><em>Don’t forget to update the inventory.</em></strong><em> After removing a node, clean up </em><em>inventory.ini by removing the </em><em>k8s-node5 entry from both the </em><em>[all] and </em><em>[kube_node] groups. While Kubespray won&#39;t break if a removed node remains in the inventory, keeping the inventory in sync with reality is a best practice for Day-2 operations.</em></blockquote><h3>Force-Removing an Unhealthy Node</h3><p>Not all node removals are graceful. In production, you will encounter scenarios where a node has suffered a catastrophic failure — hardware crash, kernel panic, network partition, or a cloud instance that simply vanished. In these cases, you can’t SSH into the node to run kubeadm reset, and a normal remove-node.yml run will fail because Ansible can&#39;t reach the target.</p><p>Kubespray handles this with two special flags: reset_nodes=false and allow_ungraceful_removal=true.</p><h3>Simulating a Node Failure</h3><p>To demonstrate force removal, let’s simulate a hard failure of k8s-node5. First, add it back to the cluster (if you removed it in the previous section), then force-stop the VM:</p><pre># On the Vagrant host (not admin-lb)<br>vagrant halt k8s-node5 --force</pre><p>Or if you’re simulating in a different environment, simply power off the machine:</p><pre># On the target node itself (before it goes down)<br>sudo systemctl poweroff --force</pre><p>After the node goes offline, Kubernetes will eventually mark it as NotReady:</p><pre>kubectl get nodes</pre><pre>NAME        STATUS     ROLES           AGE    VERSION<br>k8s-node1   Ready      control-plane   1d     v1.32.9<br>k8s-node2   Ready      control-plane   1d     v1.32.9<br>k8s-node3   Ready      control-plane   1d     v1.32.9<br>k8s-node4   Ready      &lt;none&gt;          1d     v1.32.9<br>k8s-node5   NotReady   &lt;none&gt;          10m    v1.32.9</pre><p>The node transitions to NotReady after the kubelet stops sending heartbeats (default: 40 seconds for the node controller to mark it, then pod-eviction-timeout for pod rescheduling).</p><h3>Using reset_nodes=false and allow_ungraceful_removal=true</h3><p>Now attempt a normal removal — it will fail because Ansible cannot SSH into the offline node:</p><pre># This will FAIL:<br>ansible-playbook remove-node.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e node=k8s-node5 \<br>  -e skip_confirmation=true</pre><pre>fatal: [k8s-node5]: UNREACHABLE! =&gt; {&quot;changed&quot;: false, &quot;msg&quot;: &quot;Failed to connect to the host via ssh...&quot;}</pre><p>To force-remove the node without SSH access, use the ungraceful removal flags:</p><pre>ansible-playbook remove-node.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e node=k8s-node5 \<br>  -e skip_confirmation=true \<br>  -e reset_nodes=false \<br>  -e allow_ungraceful_removal=true</pre><p>Key parameters:</p><p>Parameter Description -e reset_nodes=false Skips the kubeadm reset step on the target node. Since the node is unreachable, there&#39;s nothing to reset remotely. -e allow_ungraceful_removal=true Permits removing the node from the cluster even though it couldn&#39;t be drained or cleaned up. Kubespray will only remove the cluster-side metadata (Node object, etcd member if applicable).</p><p>What happens during ungraceful removal:</p><ol><li><strong>Drain is skipped</strong> — Since the node is unreachable, the drain step is bypassed. Pods that were running on the dead node will remain in Terminating state until the garbage collection timeout expires (or until the Node object is deleted).</li><li><strong>kubeadm reset is skipped</strong> — No SSH to the node, so no local cleanup.</li><li><strong>Node metadata is deleted</strong> — The Node object is removed from the Kubernetes API. Once deleted, all pods that were “stuck” on that node are freed for rescheduling by their respective controllers (Deployments, StatefulSets, etc.).</li></ol><p><strong>Verification:</strong></p><pre>kubectl get nodes</pre><pre>NAME        STATUS   ROLES           AGE    VERSION<br>k8s-node1   Ready    control-plane   1d     v1.32.9<br>k8s-node2   Ready    control-plane   1d     v1.32.9<br>k8s-node3   Ready    control-plane   1d     v1.32.9<br>k8s-node4   Ready    &lt;none&gt;          1d     v1.32.9</pre><p>The NotReady node is gone. Workloads are rescheduled.</p><blockquote><strong><em>Caution:</em></strong><em> If the “dead” node later comes back online (e.g., after a network partition heals), its kubelet will try to re-register with the API server using its old certificates. This can cause unexpected behavior. Best practice is to fully wipe and reprovision the node before allowing it to rejoin the cluster, or ensure its kubelet service is disabled/stopped.</em></blockquote><h3>Replacing a Control Plane Node</h3><p>Replacing a control plane node is the most complex node management operation. Unlike workers, control plane nodes host critical static pods (kube-apiserver, kube-controller-manager, kube-scheduler) and may be etcd members. The replacement process involves:</p><ol><li>Removing the old control plane node (including its etcd membership)</li><li>Updating the inventory</li><li>Adding the new node as a control plane member using cluster.yml</li></ol><p>Kubespray handles most of the complexity, but there are important ordering constraints to be aware of.</p><h3>Removing a Control Plane and Worker Node</h3><p>Let’s say k8s-node3 (our third control plane node) needs to be replaced. First, we remove it from the cluster:</p><pre>ansible-playbook remove-node.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e node=k8s-node3 \<br>  -e skip_confirmation=true \<br>  -e kube_version=&quot;1.32.9&quot;</pre><p>This playbook will:</p><ol><li>Drain k8s-node3 — evict all pods (including those scheduled by tolerations)</li><li>Remove k8s-node3 from the etcd cluster — runs etcdctl member remove to safely remove it from the quorum</li><li>Run kubeadm reset on k8s-node3 — cleans up certificates, static pod manifests, and kubelet state</li><li>Delete the Node object from the API</li></ol><p>During etcd member removal, the cluster transitions from a 3-member to a 2-member etcd cluster. This is still operational (2 out of 3 is quorum), but you’ve lost fault tolerance — a 2-member etcd cluster cannot tolerate any additional failures. This is why you should minimize the time window between removing and re-adding a control plane node.</p><p><strong>Verify the etcd cluster state after removal:</strong></p><pre>ssh k8s-node1 etcdctl.sh member list -w table</pre><pre>+------------------+---------+-----------+---------------------------+---------------------------+<br>|        ID        | STATUS  |   NAME    |        PEER ADDRS         |       CLIENT ADDRS        |<br>+------------------+---------+-----------+---------------------------+---------------------------+<br>| 1a2b3c4d5e6f7890 | started | k8s-node1 | https://192.168.10.11:2380| https://192.168.10.11:2379|<br>| 2b3c4d5e6f789012 | started | k8s-node2 | https://192.168.10.12:2380| https://192.168.10.12:2379|<br>+------------------+---------+-----------+---------------------------+---------------------------+</pre><p>Only 2 members remain. Let’s also check the Kubernetes node list:</p><pre>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP<br>k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11<br>k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12<br>k8s-node4   Ready    &lt;none&gt;          1d     v1.32.9   192.168.10.14</pre><p>k8s-node3 is completely gone from both the Kubernetes cluster and the etcd cluster.</p><blockquote><strong><em>Critical Limitation:</em></strong><em> The </em><strong><em>first node</em></strong><em> listed in the </em><em>[kube_control_plane] group (typically </em><em>k8s-node1) cannot be removed using </em><em>remove-node.yml. This node is treated as the &quot;initial control plane&quot; by Kubespray and kubeadm, and its removal requires a full cluster rebuild. If you need to replace node1, you must first add a new control plane node, then use </em><em>cluster.yml with a reconfigured inventory that lists the new node first.</em></blockquote><h3>Re-adding a Node as Control Plane (cluster.yml)</h3><p>Now we’ll add k8s-node3 back, but this time as a fresh control plane member. After reprovisioning the VM (or ensuring the old state has been cleaned up via kubeadm reset), update the inventory.</p><p><strong>Critical: New control plane nodes must be added at the END of the </strong><strong>[kube_control_plane] group.</strong> Kubespray and kubeadm treat the first node in the group as the &quot;initial&quot; control plane. If you insert a new node before existing members, you risk breaking the cluster.</p><pre># inventory/mycluster/inventory.ini</pre><pre>[all]<br>k8s-node1 ansible_host=192.168.10.11 ip=192.168.10.11<br>k8s-node2 ansible_host=192.168.10.12 ip=192.168.10.12<br>k8s-node3 ansible_host=192.168.10.13 ip=192.168.10.13<br>k8s-node4 ansible_host=192.168.10.14 ip=192.168.10.14</pre><pre>[kube_control_plane]<br>k8s-node1<br>k8s-node2<br>k8s-node3  # &lt;-- Re-added at the END</pre><pre>[etcd:children]<br>kube_control_plane</pre><pre>[kube_node]<br>k8s-node4</pre><pre>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>Now run cluster.yml (not scale.yml!) to add the control plane node:</p><pre>ansible-playbook cluster.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e kube_version=&quot;1.32.9&quot;</pre><blockquote><strong><em>Why </em></strong><strong><em>cluster.yml and not </em></strong><strong><em>scale.yml?</em></strong><em> The </em><em>scale.yml playbook only handles worker node additions. Control plane nodes require the full </em><em>cluster.yml playbook because they need:</em></blockquote><ul><li>etcd member addition and data synchronization</li><li>Static pod manifest generation for apiserver, controller-manager, scheduler</li><li>Certificate generation and distribution</li><li>kubeadm control plane join (different from worker join)</li></ul><p>The playbook is idempotent — it will skip tasks for k8s-node1, k8s-node2, and k8s-node4 that are already configured, and only perform meaningful work on k8s-node3. However, running cluster.yml does touch all nodes (unlike scale.yml --limit), so expect it to take longer than a simple worker addition.</p><h3>etcd Member Changes and NGINX Config Updates</h3><p>After cluster.yml completes, verify that the control plane is fully restored:</p><p><strong>etcd cluster — back to 3 members:</strong></p><pre>ssh k8s-node1 etcdctl.sh member list -w table</pre><pre>+------------------+---------+-----------+---------------------------+---------------------------+<br>|        ID        | STATUS  |   NAME    |        PEER ADDRS         |       CLIENT ADDRS        |<br>+------------------+---------+-----------+---------------------------+---------------------------+<br>| 1a2b3c4d5e6f7890 | started | k8s-node1 | https://192.168.10.11:2380| https://192.168.10.11:2379|<br>| 2b3c4d5e6f789012 | started | k8s-node2 | https://192.168.10.12:2380| https://192.168.10.12:2379|<br>| 3c4d5e6f78901234 | started | k8s-node3 | https://192.168.10.13:2380| https://192.168.10.13:2379|<br>+------------------+---------+-----------+---------------------------+---------------------------+</pre><p>The new k8s-node3 etcd member has automatically joined and synchronized data from the existing members.</p><p><strong>Verify etcd endpoint health:</strong></p><pre>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><pre>+---------------------------+------------------+---------+---------+-----------+-----------+<br>|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM |<br>+---------------------------+------------------+---------+---------+-----------+-----------+<br>| https://192.168.10.11:2379| 1a2b3c4d5e6f7890 | 3.5.25  |   25 MB |   false   |        4  |<br>| https://192.168.10.12:2379| 2b3c4d5e6f789012 | 3.5.25  |   25 MB |    true   |        4  |<br>| https://192.168.10.13:2379| 3c4d5e6f78901234 | 3.5.25  |   25 MB |   false   |        4  |<br>+---------------------------+------------------+---------+---------+-----------+-----------+</pre><p>All three endpoints are healthy with consistent DB sizes.</p><p><strong>Kubernetes nodes — 3 control planes restored:</strong></p><pre>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           AGE    VERSION   INTERNAL-IP<br>k8s-node1   Ready    control-plane   1d     v1.32.9   192.168.10.11<br>k8s-node2   Ready    control-plane   1d     v1.32.9   192.168.10.12<br>k8s-node3   Ready    control-plane   60s    v1.32.9   192.168.10.13<br>k8s-node4   Ready    &lt;none&gt;          1d     v1.32.9   192.168.10.14</pre><p><strong>NGINX proxy configuration on workers (automatic update):</strong></p><p>If you’re using client-side load balancing (the default), Kubespray automatically updates the nginx.conf on all worker nodes to include the new control plane node. You can verify this by checking the nginx configuration on k8s-node4:</p><pre>ssh k8s-node4 cat /etc/nginx/nginx.conf</pre><pre>stream {<br>    upstream kube_apiserver {<br>        least_conn;<br>        server 192.168.10.11:6443;<br>        server 192.168.10.12:6443;<br>        server 192.168.10.13:6443;  # &lt;-- Restored<br>    }</pre><pre>    server {<br>        listen 127.0.0.1:6443;<br>        proxy_pass kube_apiserver;<br>        proxy_timeout 10m;<br>        proxy_connect_timeout 1s;<br>    }<br>}</pre><p>All three control plane IPs are present in the upstream block. The nginx-proxy static pod on k8s-node4 will automatically reload with this updated configuration, ensuring the worker&#39;s API requests are distributed across all three control plane nodes.</p><p><strong>Static pods on the new control plane node:</strong></p><pre>kubectl get pods -n kube-system -o wide | grep k8s-node3</pre><pre>etcd-k8s-node3                      1/1   Running   0   60s   192.168.10.13   k8s-node3<br>kube-apiserver-k8s-node3             1/1   Running   0   60s   192.168.10.13   k8s-node3<br>kube-controller-manager-k8s-node3    1/1   Running   0   60s   192.168.10.13   k8s-node3<br>kube-scheduler-k8s-node3             1/1   Running   0   60s   192.168.10.13   k8s-node3</pre><p>All four critical static pods are running on the restored control plane node.</p><p><strong>If using HAProxy (external LB):</strong></p><p>Remember that HAProxy on admin-lb is configured independently of Kubespray. If your HAProxy backend already lists all three control plane IPs in /etc/haproxy/haproxy.cfg, no changes are needed — HAProxy will automatically start routing traffic to the new node once its health check passes:</p><pre>backend k8s-api<br>    option  httpchk GET /healthz<br>    http-check expect status 200<br>    default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100<br>    server k8s-node1 192.168.10.11:6443 check check-ssl verify none<br>    server k8s-node2 192.168.10.12:6443 check check-ssl verify none<br>    server k8s-node3 192.168.10.13:6443 check check-ssl verify none</pre><p>You can verify HAProxy backend status at http://192.168.10.10:9000/haproxy_stats — all three backends should show as UP (green).</p><h3>Full Cluster Reset (reset.yml)</h3><p>When you need to tear down the entire Kubernetes cluster — for rebuilding from scratch, testing a fresh deployment, or decommissioning the infrastructure — Kubespray provides reset.yml. This playbook completely reverses everything that cluster.yml did: it stops all Kubernetes services, removes all binaries, cleans up configuration files, and leaves the nodes in a pre-Kubernetes state.</p><p><strong>What </strong><strong>reset.yml does on every node:</strong></p><ol><li><strong>Drains and deletes</strong> all nodes from the Kubernetes API</li><li><strong>Stops kubelet</strong> and removes the systemd unit</li><li><strong>Stops and removes etcd</strong> data and binaries (on etcd members)</li><li><strong>Runs </strong><strong>kubeadm reset</strong> — removes certificates, static pod manifests, and kubeconfig files</li><li><strong>Removes CNI configuration</strong> — cleans up /etc/cni/net.d/ and CNI binaries</li><li><strong>Cleans up iptables rules</strong> — removes all Kubernetes-related iptables/ipvs rules</li><li><strong>Removes container runtime artifacts</strong> — stops all running containers, removes containerd state</li><li><strong>Deletes configuration directories</strong> — /etc/kubernetes/, /var/lib/kubelet/, /var/lib/etcd/, etc.</li></ol><p><strong>Execution:</strong></p><pre>cd /root/kubespray</pre><pre>ansible-playbook reset.yml \<br>  --become \<br>  -i inventory/mycluster/inventory.ini \<br>  -e skip_confirmation=true</pre><blockquote><strong><em>Warning:</em></strong><em> This is a destructive, irreversible operation. All cluster data, including etcd state (and therefore all Kubernetes objects — deployments, services, secrets, configmaps, PVs, etc.) will be permanently deleted. Ensure you have backups of any critical data before running this playbook.</em></blockquote><p>After reset.yml completes, the nodes are clean and ready for a fresh cluster.yml deployment. This makes reset.yml particularly useful in development and testing workflows where you frequently need to iterate on cluster configurations:</p><pre># Tear down<br>ansible-playbook reset.yml --become -i inventory/mycluster/inventory.ini -e skip_confirmation=true</pre><pre># Rebuild with new settings<br>ansible-playbook cluster.yml --become -i inventory/mycluster/inventory.ini -e kube_version=&quot;1.32.9&quot;</pre><p><strong>Verification after reset:</strong></p><p>SSH into any node and confirm that Kubernetes components are gone:</p><pre>ssh k8s-node1</pre><pre># kubelet should not be running<br>systemctl status kubelet<br># Unit kubelet.service could not be found.</pre><pre># No Kubernetes directories<br>ls /etc/kubernetes/<br># ls: cannot access &#39;/etc/kubernetes/&#39;: No such file or directory</pre><pre># No etcd data<br>ls /var/lib/etcd/<br># ls: cannot access &#39;/var/lib/etcd/&#39;: No such file or directory</pre><pre># kubectl should fail (no kubeconfig)<br>kubectl get nodes<br># The connection to the server localhost:8080 was refused</pre><p>The node is now a blank slate, ready for reprovisioning.</p><h3><strong><em>Key takeaways</em></strong></h3><ul><li>Use scale.yml for workers, cluster.yml for control planes.</li><li>Always add new control plane nodes at the <strong>end</strong> of the [kube_control_plane] group.</li><li>The first node in [kube_control_plane] cannot be removed via remove-node.yml.</li><li>For unreachable nodes, use reset_nodes=false + allow_ungraceful_removal=true.</li><li>PodDisruptionBudgets can block node drains — check PDBs before removing nodes.</li><li>Keep your inventory.ini in sync with the actual cluster state after every operation.</li><li>Minimize the time window where etcd has fewer than 3 members to maintain fault tolerance.</li></ul><h3>8. Monitoring Setup</h3><p>With a fully operational HA Kubernetes cluster — three control plane nodes, two workers, and an external HAProxy load balancer — the next critical step is observability. Without monitoring, you are flying blind: you cannot detect etcd quorum degradation, API server latency spikes, or node resource exhaustion until they become outages.</p><p>In this section, we will build a production-grade monitoring stack on top of our lab cluster. The setup consists of three layers:</p><ol><li><strong>Persistent storage</strong> — An NFS-based dynamic provisioner so that Prometheus and Grafana retain their data across pod restarts.</li><li><strong>kube-prometheus-stack</strong> — A Helm-based deployment that bundles Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics into a single, cohesive package.</li><li><strong>etcd metrics</strong> — Dedicated scrape configuration so that Prometheus can collect metrics directly from the etcd cluster running on the control plane nodes.</li></ol><p>By the end of this section, you will have Prometheus scraping the Kubernetes API server, kubelet, node-exporter, HAProxy, and etcd, with Grafana dashboards providing real-time visibility into every layer of the cluster.</p><h3>NFS Subdir External Provisioner for Persistent Storage</h3><h3>Why We Need a StorageClass</h3><p>Both Prometheus and Grafana require persistent volumes. Prometheus stores its time-series database (TSDB) on disk, and Grafana persists dashboards, data sources, and user sessions. Without a StorageClass that supports dynamic provisioning, every PersistentVolumeClaim (PVC) would require a cluster administrator to manually create a matching PersistentVolume (PV)—impractical for a monitoring stack that may create multiple PVCs during installation.</p><p>In our lab environment, the admin-lb node (192.168.10.10) already runs an NFS server exporting /srv/nfs/share. This was configured during the initial bootstrap via the admin-lb.sh script. We will leverage this NFS export as the backing storage for all dynamic PVCs.</p><h3>Installing the NFS Subdir External Provisioner</h3><p>The <a href="https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner">nfs-subdir-external-provisioner</a> is a Kubernetes controller that watches for new PVCs and automatically creates subdirectories on the NFS share as PVs. Each PVC gets its own directory named ${namespace}-${pvcName}-${pvName}, making it easy to identify which data belongs to which workload.</p><p>First, add the Helm repository and update:</p><pre>helm repo add nfs-subdir-external-provisioner \<br>  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/<br>helm repo update</pre><p>Then install the provisioner, pointing it at our NFS server:</p><pre>helm install nfs-provisioner \<br>  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \<br>  --set nfs.server=192.168.10.10 \<br>  --set nfs.path=/srv/nfs/share \<br>  --set storageClass.defaultClass=true</pre><h3>Verifying the Provisioner</h3><p>After installation, confirm the StorageClass is created and set as the default:</p><pre>kubectl get storageclass</pre><p>Expected output:</p><pre>NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE<br>nfs-client (default)   cluster.local/nfs-provisioner-nfs-subdir-...    Delete          Immediate           true                   30s</pre><p>The (default) annotation confirms that any PVC without an explicit storage class will be fulfilled by this provisioner. You can also verify the provisioner pod is running:</p><pre>kubectl get pods -l app=nfs-subdir-external-provisioner</pre><pre>NAME                                              READY   STATUS    RESTARTS   AGE<br>nfs-provisioner-nfs-subdir-external-provisioner-xxx   1/1     Running   0          45s</pre><p>At this point, any component that creates a PVC — including Prometheus and Grafana — will automatically get a dynamically provisioned NFS-backed persistent volume.</p><h3>Installing kube-prometheus-stack with Helm</h3><p>The <a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack">kube-prometheus-stack</a> Helm chart is the de facto standard for deploying a complete monitoring solution on Kubernetes. A single helm install gives you:</p><ul><li><strong>Prometheus</strong> — Time-series database and scraping engine</li><li><strong>Grafana</strong> — Visualization and dashboarding platform</li><li><strong>Alertmanager</strong> — Alert routing and deduplication</li><li><strong>node-exporter</strong> — Host-level metrics (CPU, memory, disk, network) from every node</li><li><strong>kube-state-metrics</strong> — Kubernetes object-level metrics (deployments, pods, nodes, PVCs)</li><li><strong>Pre-built Grafana dashboards</strong> — Dozens of dashboards for Kubernetes, node, and Prometheus self-monitoring</li><li><strong>Pre-built PrometheusRules</strong> — Alert rules for common failure conditions</li></ul><h3>Adding the Helm Repository</h3><pre>helm repo add prometheus-community \<br>  https://prometheus-community.github.io/helm-charts<br>helm repo update</pre><h3>Custom Values (Prometheus, Grafana, Node Exporter)</h3><p>Rather than installing with default settings, we create a custom values file to tailor the stack to our lab environment. The key customizations are:</p><ol><li><strong>NodePort services</strong> for Prometheus and Grafana — so we can access them from the host machine without an Ingress controller.</li><li><strong>Additional scrape configurations</strong> — to collect metrics from HAProxy and etcd, which are not Kubernetes-native workloads.</li><li><strong>Grafana admin password</strong> — set explicitly for reproducibility.</li><li><strong>Node exporter tolerations</strong> — to ensure node-exporter runs on control plane nodes as well.</li></ol><p>Create the values file:</p><pre>cat &lt;&lt;&#39;EOF&#39; &gt; ~/kubespray/custom-values.yaml<br># ============================================================<br># Prometheus Configuration<br># ============================================================<br>prometheus:<br>  prometheusSpec:<br>    # Expose Prometheus via NodePort so we can access it from outside the cluster<br>    # Access URL: http://&lt;any-node-ip&gt;:30001<br>    serviceMonitorSelectorNilUsesHelmValues: false<br>    podMonitorSelectorNilUsesHelmValues: false</pre><pre>    # Persistent storage for TSDB<br>    storageSpec:<br>      volumeClaimTemplate:<br>        spec:<br>          accessModes: [&quot;ReadWriteOnce&quot;]<br>          resources:<br>            requests:<br>              storage: 10Gi</pre><pre>    # Additional scrape configs for non-k8s targets<br>    additionalScrapeConfigs:<br>      # --- HAProxy Metrics ---<br>      # HAProxy exposes a Prometheus-compatible metrics endpoint<br>      # on port 8405 (configured in haproxy.cfg with stats socket)<br>      - job_name: &quot;haproxy&quot;<br>        static_configs:<br>          - targets:<br>              - &quot;192.168.10.10:8405&quot;<br>        metrics_path: &quot;/metrics&quot;</pre><pre>      # --- etcd Metrics ---<br>      # etcd exposes metrics on port 2381 (configured via etcd_listen_metrics_urls)<br>      # We scrape all 3 etcd members individually<br>      - job_name: &quot;etcd&quot;<br>        static_configs:<br>          - targets:<br>              - &quot;192.168.10.11:2381&quot;<br>              - &quot;192.168.10.12:2381&quot;<br>              - &quot;192.168.10.13:2381&quot;<br>        metrics_path: &quot;/metrics&quot;</pre><pre>  service:<br>    type: NodePort<br>    nodePort: 30001</pre><pre># ============================================================<br># Grafana Configuration<br># ============================================================<br>grafana:<br>  adminPassword: &quot;prom-operator&quot;</pre><pre>  service:<br>    type: NodePort<br>    nodePort: 30002</pre><pre>  # Persistent storage for Grafana dashboards and settings<br>  persistence:<br>    enabled: true<br>    size: 5Gi</pre><pre>  # Sidecar configuration: automatically imports ConfigMaps<br>  # with label grafana_dashboard=&quot;1&quot; as dashboards<br>  sidecar:<br>    dashboards:<br>      enabled: true<br>      label: grafana_dashboard<br>      labelValue: &quot;1&quot;<br>      searchNamespace: ALL</pre><pre># ============================================================<br># Node Exporter Configuration<br># ============================================================<br>nodeExporter:<br>  # Ensure node-exporter runs on ALL nodes, including control planes<br>  tolerations:<br>    - effect: NoSchedule<br>      operator: Exists</pre><pre># ============================================================<br># Alertmanager Configuration<br># ============================================================<br>alertmanager:<br>  service:<br>    type: NodePort<br>    nodePort: 30003<br>EOF</pre><p>Let us walk through each section in detail.</p><h4>Prometheus Additional Scrape Configs</h4><p>The additionalScrapeConfigs field is the mechanism for adding scrape targets that are not discovered via Kubernetes service discovery. In our setup, two external targets need explicit configuration:</p><p><strong>HAProxy metrics (</strong><strong>192.168.10.10:8405)</strong></p><p>Our HAProxy configuration on the admin-lb node includes a Prometheus metrics endpoint. The relevant section in /etc/haproxy/haproxy.cfg is:</p><pre>frontend stats<br>    bind *:8405<br>    http-request use-service prometheus-exporter if { path /metrics }</pre><p>This exposes standard HAProxy metrics such as haproxy_frontend_current_sessions, haproxy_backend_up, haproxy_server_bytes_in_total, etc. These metrics are critical for monitoring the external load balancer that fronts our Kubernetes API servers.</p><p><strong>etcd metrics (</strong><strong>192.168.10.11-13:2381)</strong></p><p>etcd does not expose metrics by default. We will enable this in Section 8.3. Once enabled, each etcd member exposes metrics on port 2381, including etcd_server_has_leader, etcd_disk_wal_fsync_duration_seconds, etcd_network_peer_round_trip_time_seconds, and etcd_mvcc_db_total_size_in_bytes.</p><h4>Service Types and NodePorts</h4><p>In a lab environment without an Ingress controller or cloud load balancer, NodePort is the simplest way to expose services externally. The fixed ports make access predictable:</p><p>Service NodePort Access URL Prometheus 30001 http://192.168.10.14:30001 Grafana 30002 http://192.168.10.14:30002 Alertmanager 30003 <a href="http://192.168.10.14:30003">http://192.168.10.14:30003</a></p><p>You can use any node IP in the cluster (worker or control plane). The NodePort is accessible on every node regardless of where the pod is actually scheduled.</p><h4>Grafana Sidecar Dashboard Loader</h4><p>The sidecar.dashboards configuration tells Grafana to watch for ConfigMaps across all namespaces (searchNamespace: ALL) that carry the label grafana_dashboard: &quot;1&quot;. When such a ConfigMap is created, the Grafana sidecar container automatically loads the JSON dashboard it contains—no manual import required. We will use this mechanism in Section 8.2.2 to add custom dashboards.</p><h3>Installing the Stack</h3><p>With the values file ready, install the chart:</p><pre>helm install kube-prometheus-stack \<br>  prometheus-community/kube-prometheus-stack \<br>  --namespace monitoring \<br>  --create-namespace \<br>  --version 80.13.3 \<br>  -f ~/kubespray/custom-values.yaml</pre><p>We pin the chart version to 80.13.3 for reproducibility. The --create-namespace flag creates the monitoring namespace if it does not already exist.</p><p>The installation takes approximately 2–3 minutes. Monitor the progress:</p><pre>kubectl -n monitoring get pods -w</pre><p>Wait until all pods reach Running or Completed status:</p><pre>NAME                                                        READY   STATUS    RESTARTS   AGE<br>alertmanager-kube-prometheus-stack-alertmanager-0            2/2     Running   0          2m<br>kube-prometheus-stack-grafana-xxxxxxxxx-xxxxx                3/3     Running   0          2m<br>kube-prometheus-stack-kube-state-metrics-xxxxxxxxx-xxxxx     1/1     Running   0          2m<br>kube-prometheus-stack-operator-xxxxxxxxx-xxxxx               1/1     Running   0          2m<br>kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m<br>kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m<br>kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m<br>kube-prometheus-stack-prometheus-node-exporter-xxxxx         1/1     Running   0          2m<br>prometheus-kube-prometheus-stack-prometheus-0                2/2     Running   0          2m</pre><p>Note that prometheus-node-exporter pods appear on every node (including control planes, thanks to our toleration). Verify services:</p><pre>kubectl -n monitoring get svc</pre><p>Confirm the NodePort assignments match our values file (30001 for Prometheus, 30002 for Grafana, 30003 for Alertmanager).</p><h3>Accessing the Monitoring UIs</h3><p><strong>Prometheus</strong> — Open http://192.168.10.14:30001 in your browser.</p><p>Navigate to <strong>Status → Targets</strong> to verify all scrape targets are healthy. You should see targets for:</p><ul><li>kubernetes-apiservers — the Kubernetes API server metrics endpoint</li><li>kubernetes-nodes — kubelet metrics from each node</li><li>kubernetes-nodes-cadvisor — container-level resource metrics</li><li>node-exporter — host-level metrics from each node</li><li>kube-state-metrics — Kubernetes object metrics</li><li>haproxy — HAProxy load balancer metrics (from additionalScrapeConfigs)</li><li>etcd — will appear as DOWN initially (we enable this in Section 8.3)</li></ul><p><strong>Grafana</strong> — Open http://192.168.10.14:30002 in your browser.</p><p>Login with:</p><ul><li><strong>Username:</strong> admin</li><li><strong>Password:</strong> prom-operator</li></ul><p>The kube-prometheus-stack chart automatically provisions dozens of dashboards. Browse <strong>Dashboards → Browse</strong> to explore them. Key pre-installed dashboards include:</p><ul><li><em>Kubernetes / Compute Resources / Cluster</em> — cluster-wide CPU and memory utilization</li><li><em>Kubernetes / Compute Resources / Node (Pods)</em> — per-node resource breakdown</li><li><em>Node Exporter / Nodes</em> — detailed host metrics</li><li><em>Kubernetes / Networking / Cluster</em> — network throughput and errors</li><li><em>Prometheus / Overview</em> — Prometheus self-monitoring (scrape duration, target health, TSDB size)</li></ul><h3>Adding Grafana Dashboards via ConfigMap</h3><p>While the kube-prometheus-stack ships with many useful dashboards, there are several community dashboards that provide deeper or more specialized visibility. We will add three additional dashboards:</p><p>Dashboard ID Name Purpose 12693 Kubernetes Monitoring Comprehensive cluster overview with resource usage trends 15661 Node Exporter Full Extremely detailed host metrics (disk I/O, network sockets, CPU frequency, thermal zones) Custom k8s-system-api-server Kubernetes API Server performance (request latency, error rates, inflight requests)</p><p>The Grafana sidecar we configured earlier watches for ConfigMaps with the label grafana_dashboard: &quot;1&quot;. To add a dashboard, we simply create a ConfigMap containing the dashboard JSON and apply the label.</p><h4>Downloading Community Dashboards</h4><p>Grafana community dashboards can be downloaded from <a href="https://grafana.com/grafana/dashboards/">grafana.com/grafana/dashboards</a> as JSON files. The download URL pattern is:</p><pre>https://grafana.com/api/dashboards/{DASHBOARD_ID}/revisions/latest/download</pre><p>Download the two community dashboards:</p><pre># Dashboard 12693 - Kubernetes Monitoring<br>curl -fsSL \<br>  https://grafana.com/api/dashboards/12693/revisions/latest/download \<br>  -o /tmp/dashboard-12693.json</pre><pre># Dashboard 15661 - Node Exporter Full<br>curl -fsSL \<br>  <a href="https://grafana.com/api/dashboards/15661/revisions/latest/download">https://grafana.com/api/dashboards/15661/revisions/latest/download</a> \<br>  -o /tmp/dashboard-15661.json</pre><h4>Creating ConfigMaps from Dashboard JSON</h4><p>For each dashboard, create a ConfigMap in the monitoring namespace with the appropriate label:</p><pre># Dashboard 12693<br>kubectl create configmap grafana-dashboard-12693 \<br>  --from-file=k8s-monitoring.json=/tmp/dashboard-12693.json \<br>  -n monitoring</pre><pre>kubectl label configmap grafana-dashboard-12693 \<br>  grafana_dashboard=&quot;1&quot; \<br>  -n monitoring</pre><pre># Dashboard 15661<br>kubectl create configmap grafana-dashboard-15661 \<br>  --from-file=node-exporter-full.json=/tmp/dashboard-15661.json \<br>  -n monitoring</pre><pre>kubectl label configmap grafana-dashboard-15661 \<br>  grafana_dashboard=&quot;1&quot; \<br>  -n monitoring</pre><h4>Creating the API Server Dashboard via YAML</h4><p>For the custom API Server dashboard, we define it directly as a YAML manifest. This dashboard focuses on Kubernetes API Server performance — a critical metric for HA clusters where API server latency directly impacts cluster operations.</p><pre>cat &lt;&lt;&#39;EOF&#39; &gt; ~/kubespray/grafana-dashboard-apiserver.yaml<br>apiVersion: v1<br>kind: ConfigMap<br>metadata:<br>  name: grafana-dashboard-apiserver<br>  namespace: monitoring<br>  labels:<br>    grafana_dashboard: &quot;1&quot;<br>data:<br>  k8s-system-api-server.json: |-<br>    {<br>      &quot;annotations&quot;: {<br>        &quot;list&quot;: []<br>      },<br>      &quot;description&quot;: &quot;Kubernetes API Server Monitoring&quot;,<br>      &quot;editable&quot;: true,<br>      &quot;gnetId&quot;: null,<br>      &quot;graphTooltip&quot;: 1,<br>      &quot;links&quot;: [],<br>      &quot;panels&quot;: [<br>        {<br>          &quot;title&quot;: &quot;API Server Request Rate&quot;,<br>          &quot;type&quot;: &quot;timeseries&quot;,<br>          &quot;datasource&quot;: &quot;Prometheus&quot;,<br>          &quot;targets&quot;: [<br>            {<br>              &quot;expr&quot;: &quot;sum(rate(apiserver_request_total[5m])) by (verb)&quot;,<br>              &quot;legendFormat&quot;: &quot;{{ verb }}&quot;<br>            }<br>          ],<br>          &quot;gridPos&quot;: { &quot;h&quot;: 8, &quot;w&quot;: 12, &quot;x&quot;: 0, &quot;y&quot;: 0 }<br>        },<br>        {<br>          &quot;title&quot;: &quot;API Server Request Latency (99th percentile)&quot;,<br>          &quot;type&quot;: &quot;timeseries&quot;,<br>          &quot;datasource&quot;: &quot;Prometheus&quot;,<br>          &quot;targets&quot;: [<br>            {<br>              &quot;expr&quot;: &quot;histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (verb, le))&quot;,<br>              &quot;legendFormat&quot;: &quot;{{ verb }}&quot;<br>            }<br>          ],<br>          &quot;gridPos&quot;: { &quot;h&quot;: 8, &quot;w&quot;: 12, &quot;x&quot;: 12, &quot;y&quot;: 0 }<br>        },<br>        {<br>          &quot;title&quot;: &quot;API Server Error Rate&quot;,<br>          &quot;type&quot;: &quot;timeseries&quot;,<br>          &quot;datasource&quot;: &quot;Prometheus&quot;,<br>          &quot;targets&quot;: [<br>            {<br>              &quot;expr&quot;: &quot;sum(rate(apiserver_request_total{code=~\&quot;5..\&quot;}[5m])) by (resource)&quot;,<br>              &quot;legendFormat&quot;: &quot;{{ resource }}&quot;<br>            }<br>          ],<br>          &quot;gridPos&quot;: { &quot;h&quot;: 8, &quot;w&quot;: 12, &quot;x&quot;: 0, &quot;y&quot;: 8 }<br>        },<br>        {<br>          &quot;title&quot;: &quot;API Server Inflight Requests&quot;,<br>          &quot;type&quot;: &quot;timeseries&quot;,<br>          &quot;datasource&quot;: &quot;Prometheus&quot;,<br>          &quot;targets&quot;: [<br>            {<br>              &quot;expr&quot;: &quot;sum(apiserver_current_inflight_requests) by (request_kind)&quot;,<br>              &quot;legendFormat&quot;: &quot;{{ request_kind }}&quot;<br>            }<br>          ],<br>          &quot;gridPos&quot;: { &quot;h&quot;: 8, &quot;w&quot;: 12, &quot;x&quot;: 12, &quot;y&quot;: 8 }<br>        }<br>      ],<br>      &quot;schemaVersion&quot;: 36,<br>      &quot;style&quot;: &quot;dark&quot;,<br>      &quot;tags&quot;: [&quot;kubernetes&quot;, &quot;apiserver&quot;],<br>      &quot;templating&quot;: { &quot;list&quot;: [] },<br>      &quot;time&quot;: { &quot;from&quot;: &quot;now-1h&quot;, &quot;to&quot;: &quot;now&quot; },<br>      &quot;title&quot;: &quot;Kubernetes / System / API Server&quot;,<br>      &quot;uid&quot;: &quot;k8s-system-api-server&quot;<br>    }<br>EOF</pre><p>Apply the manifest:</p><pre>kubectl apply -f ~/kubespray/grafana-dashboard-apiserver.yaml</pre><h4>Verifying Dashboard Import</h4><p>The Grafana sidecar detects new ConfigMaps within a few seconds. You do not need to restart Grafana or any pods. Verify by listing the ConfigMaps with the dashboard label:</p><pre>kubectl -n monitoring get configmap -l grafana_dashboard=&quot;1&quot;</pre><p>Expected output:</p><pre>NAME                             DATA   AGE<br>grafana-dashboard-12693          1      2m<br>grafana-dashboard-15661          1      2m<br>grafana-dashboard-apiserver      1      30s</pre><p>Now open Grafana (http://192.168.10.14:30002) and navigate to <strong>Dashboards → Browse</strong>. You should see the three new dashboards alongside the pre-installed ones. Open each to confirm they are rendering correctly with live data from Prometheus.</p><p>The API Server dashboard is particularly valuable for HA monitoring because it lets you observe whether API server request latency is consistent across the three control plane nodes. Asymmetric latency may indicate a problem with one specific node’s etcd connection or resource exhaustion.</p><h3>Enabling etcd Metrics Collection</h3><p>etcd is the most critical stateful component in a Kubernetes cluster. It stores the entire cluster state — every resource definition, every secret, every lease. If etcd fails, the cluster becomes read-only at best and completely non-functional at worst. Monitoring etcd health is therefore non-negotiable for any production cluster.</p><p>However, <strong>etcd does not expose a metrics endpoint by default in Kubespray deployments</strong>. We must explicitly enable it.</p><h3>Configuring etcd_listen_metrics_urls</h3><p>Kubespray provides dedicated variables to control etcd metrics exposure. We need to set two variables in the etcd group variables file.</p><p>Edit the etcd configuration:</p><pre>vi ~/kubespray/inventory/mycluster/group_vars/etcd.yml</pre><p>Add (or modify) the following variables:</p><pre># Enable etcd metrics endpoint<br>etcd_metrics: true</pre><pre># Bind the metrics endpoint to all interfaces on port 2381<br># This allows Prometheus (running on worker nodes) to scrape etcd metrics<br># from the control plane nodes over the network<br>etcd_listen_metrics_urls: &quot;http://0.0.0.0:2381&quot;</pre><p>Let us examine what each variable does:</p><p><strong>etcd_metrics: true</strong></p><p>This variable is Kubespray-specific. When set to true, Kubespray passes the --listen-metrics-urls flag to each etcd member during configuration. Without this flag, etcd simply does not serve any HTTP endpoint for metrics.</p><p><strong>etcd_listen_metrics_urls: &quot;http://0.0.0.0:2381&quot;</strong></p><p>This sets the address and port on which each etcd member will listen for metrics scrape requests. The values are deliberate:</p><ul><li>http:// (not https://) — Metrics are served over plain HTTP. This avoids the complexity of configuring Prometheus with etcd client TLS certificates. Since metrics do not contain sensitive data (only operational telemetry), HTTP is acceptable in a lab environment. In production, you might prefer https:// with mutual TLS.</li><li>0.0.0.0 — Bind to all network interfaces. This is necessary because Prometheus runs on worker nodes, which communicate with control plane nodes over the pod network or host network. Binding to 127.0.0.1 would restrict access to local-only scraping.</li><li>2381 — The conventional port for etcd metrics. The standard etcd client port is 2379, and peer port is 2380, so 2381 follows the natural sequence and avoids conflicts.</li></ul><h4>Applying the Configuration with Kubespray</h4><p>Now apply the configuration change using Kubespray’s cluster.yml playbook with the etcd tag, limited to the etcd node group:</p><pre>cd ~/kubespray</pre><pre>ansible-playbook cluster.yml \<br>  --tags &quot;etcd&quot; \<br>  --limit etcd \<br>  -e kube_version=&quot;1.32.9&quot;</pre><p>This command takes approximately <strong>2 minutes</strong> to complete. Here is what happens during the playbook run:</p><ol><li><strong>Kubespray detects the configuration change</strong> — The etcd_listen_metrics_urls variable modifies the etcd systemd unit file (or static pod manifest, depending on deployment type).</li><li><strong>etcd members are restarted one at a time</strong> — Kubespray performs a rolling restart of the etcd cluster. Each member is stopped, its configuration is updated, and it is restarted before moving to the next member. This preserves quorum throughout the process.</li><li><strong>Automatic backups are created</strong> — Before restarting each etcd member, Kubespray takes a snapshot backup of the etcd data. These backups are stored in /var/backups/ on each control plane node:</li></ol><pre>ssh k8s-node1 ls -la /var/backups/etcd-*<br>-rw------- 1 root root 3145728 Jul 15 10:30 /var/backups/etcd-snapshot-20250715103000.db</pre><ol><li>This automatic backup behavior is a valuable safety net — if the etcd restart somehow corrupts data, you have an immediate point-in-time recovery option.</li><li><strong>etcd rejoins the cluster</strong> — After restart, each member reconnects to the cluster and synchronizes its data with the other members. You can verify cluster health:</li></ol><pre>ssh k8s-node1 etcdctl.sh member list -w table<br><br>+------------------+---------+-----------+----------------------------+----------------------------+------------+<br>   |        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |<br>   +------------------+---------+-----------+----------------------------+----------------------------+------------+<br>   | 8e9e05c52164694d | started | k8s-node1 | https://192.168.10.11:2380 | https://192.168.10.11:2379 |      false |<br>   | 91bc3c398fb3c146 | started | k8s-node2 | https://192.168.10.12:2380 | https://192.168.10.12:2379 |      false |<br>   | fd422379fda50e48 | started | k8s-node3 | https://192.168.10.13:2380 | https://192.168.10.13:2379 |      false |<br>   +------------------+---------+-----------+----------------------------+----------------------------+------------+</pre><p>All three members should show started status.</p><h4>Verifying the Metrics Endpoint</h4><p>After the playbook completes, confirm that each etcd member is now serving metrics:</p><pre># Test from the admin-lb node (or any node that can reach the control plane network)<br>curl -s http://192.168.10.11:2381/metrics | head -20</pre><p>Expected output (truncated):</p><pre># HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.<br># TYPE etcd_server_has_leader gauge<br>etcd_server_has_leader 1<br># HELP etcd_server_leader_changes_seen_total The number of leader changes seen.<br># TYPE etcd_server_leader_changes_seen_total counter<br>etcd_server_leader_changes_seen_total 2<br># HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by WAL.<br># TYPE etcd_disk_wal_fsync_duration_seconds histogram<br>etcd_disk_wal_fsync_duration_seconds_bucket{le=&quot;0.001&quot;} 542<br>etcd_disk_wal_fsync_duration_seconds_bucket{le=&quot;0.002&quot;} 1203<br>...</pre><p>Repeat for all three control plane nodes to confirm consistency:</p><pre>for ip in 192.168.10.11 192.168.10.12 192.168.10.13; do<br>  echo &quot;=== $ip ===&quot;<br>  curl -s http://$ip:2381/metrics | grep etcd_server_has_leader<br>done</pre><p>All three should report etcd_server_has_leader 1.</p><h3>Adding etcd Scrape Config to Prometheus</h3><p>We already configured the etcd scrape job in Section 8.2.1 as part of additionalScrapeConfigs in our custom values file. Here is the relevant snippet again for reference:</p><pre>additionalScrapeConfigs:<br>  - job_name: &quot;etcd&quot;<br>    static_configs:<br>      - targets:<br>          - &quot;192.168.10.11:2381&quot;<br>          - &quot;192.168.10.12:2381&quot;<br>          - &quot;192.168.10.13:2381&quot;<br>    metrics_path: &quot;/metrics&quot;</pre><p>Since this was included in the initial Helm install, Prometheus has been attempting to scrape these endpoints from the beginning. Before enabling etcd_listen_metrics_urls, the etcd targets appeared as DOWN in Prometheus. Now that the metrics endpoint is active, Prometheus should automatically pick them up.</p><h4>Verifying etcd Targets in Prometheus</h4><p>Open the Prometheus UI at http://192.168.10.14:30001 and navigate to <strong>Status → Targets</strong>. Look for the etcd job. You should see three targets, all with state UP:</p><pre>Endpoint                     State    Labels                  Last Scrape    Scrape Duration<br>http://192.168.10.11:2381    UP       instance=&quot;192.168...&quot;   12s ago        23.4ms<br>http://192.168.10.12:2381    UP       instance=&quot;192.168...&quot;   14s ago        21.1ms<br>http://192.168.10.13:2381    UP       instance=&quot;192.168...&quot;   11s ago        22.8ms</pre><p>If any target shows DOWN, common troubleshooting steps include:</p><p><strong>Firewall</strong> — Verify that port 2381 is open on the control plane nodes:</p><ul><li>ssh k8s-node1 ss -tlnp | grep 2381</li></ul><ol><li>Expected: LISTEN 0 4096 *:2381 *:* users:((&quot;etcd&quot;,..))</li></ol><p><strong>Network connectivity</strong> — Test from a worker node (where Prometheus is running):</p><ul><li>ssh k8s-node4 curl -s <a href="http://192.168.10.11:2381/health">http://192.168.10.11:2381/health</a></li></ul><ol><li>Expected: {&quot;health&quot;:&quot;true&quot;,&quot;reason&quot;:&quot;&quot;}</li></ol><p><strong>etcd configuration</strong> — Verify the etcd process has the metrics flag:</p><ul><li>ssh k8s-node1 ps aux | grep etcd | grep listen-metrics</li></ul><ol><li>You should see --listen-metrics-urls=http://0.0.0.0:2381 in the process arguments.</li></ol><h4>Sample PromQL Queries</h4><p>You can run these queries in the Prometheus UI (http://192.168.10.14:30001/graph) to immediately visualize etcd health:</p><p><strong>etcd cluster leader status:</strong></p><pre>etcd_server_has_leader</pre><p><strong>WAL fsync latency (99th percentile, per member):</strong></p><pre>histogram_quantile(0.99,<br>  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])<br>)</pre><p><strong>Database size across all members:</strong></p><pre>etcd_mvcc_db_total_size_in_bytes</pre><p><strong>Peer network round-trip time (99th percentile):</strong></p><pre>histogram_quantile(0.99,<br>  rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])<br>)</pre><p><strong>Rate of leader changes (should be 0 in a healthy cluster):</strong></p><pre>rate(etcd_server_leader_changes_seen_total[15m])</pre><h4>Creating an etcd Grafana Dashboard</h4><p>You can also import a community etcd dashboard into Grafana. Dashboard ID <strong>3070</strong> (etcd by etcd.io) is a popular choice. Download and create a ConfigMap following the same pattern as Section 8.2.2:</p><pre># Download the etcd dashboard<br>curl -fsSL \<br>  https://grafana.com/api/dashboards/3070/revisions/latest/download \<br>  -o /tmp/dashboard-etcd-3070.json</pre><pre># Create ConfigMap with the dashboard label<br>kubectl create configmap grafana-dashboard-etcd \<br>  --from-file=etcd.json=/tmp/dashboard-etcd-3070.json \<br>  -n monitoring</pre><pre>kubectl label configmap grafana-dashboard-etcd \<br>  grafana_dashboard=&quot;1&quot; \<br>  -n monitoring</pre><p>After a few seconds, the dashboard will appear in Grafana under <strong>Dashboards → Browse</strong>. This dashboard provides real-time visibility into:</p><ul><li>etcd cluster membership and leader status</li><li>Raft proposal commit rates</li><li>Disk I/O performance (WAL fsync, backend commit)</li><li>gRPC request rates and latencies</li><li>Database size and compaction</li></ul><h4>Complete Target Verification</h4><p>At this point, your Prometheus instance should have the following target groups all showing UP:</p><pre>Job                        Targets   Status<br>─────────────────────────────────────────────<br>kubernetes-apiservers      3/3       UP<br>kubernetes-nodes           5/5       UP<br>kubernetes-nodes-cadvisor  5/5       UP<br>node-exporter              5/5       UP<br>kube-state-metrics         1/1       UP<br>haproxy                    1/1       UP<br>etcd                       3/3       UP</pre><p>The HAProxy stats page is also accessible at http://192.168.10.10:9000/haproxy_stats for a quick visual check of API server backend health, independent of Prometheus.</p><h3>Kubernetes Upgrade with Kubespray</h3><p>Upgrading Kubernetes is one of the most critical Day-2 operations in any production environment. Unlike initial cluster deployment, upgrades must be performed carefully — a single misconfiguration can lead to API downtime, workload disruption, or even data loss in etcd. Kubespray provides well-structured playbooks that automate the upgrade process while maintaining high availability, but understanding what happens under the hood is essential for operating confidently.</p><p>Each upgrade type introduces progressively more complexity. A patch upgrade touches only the Kubernetes binaries. A minor upgrade may involve API deprecations and behavioral changes. A major upgrade combined with a Kubespray version bump means new Ansible roles, new default variables, updated container runtime versions, and potentially new etcd releases — all changing simultaneously.</p><p>Before proceeding with any upgrade, let’s first address a prerequisite that many operators overlook: the CNI plugin.</p><h3>Pre-Upgrade: Flannel CNI Plugin Update</h3><p>Before upgrading Kubernetes itself, it is important to verify that the CNI (Container Network Interface) plugin is compatible with the target Kubernetes version. In our lab, we use Flannel as the CNI. Flannel is deployed as a DaemonSet, which means it runs on every node in the cluster. Unlike node-level components (kubelet, kube-proxy), <strong>you cannot upgrade Flannel on a per-node basis</strong> — a DaemonSet update rolls out to all nodes at once.</p><p>This is a critical consideration: if you upgrade Flannel after upgrading only half of your nodes, you risk running incompatible versions of the CNI across different nodes, which can cause intermittent pod networking failures.</p><p><strong>Best practice:</strong> Update the Flannel CNI plugin <em>before</em> starting the Kubernetes upgrade, while all nodes are still running the same Kubernetes version.</p><h3>Checking the Current Flannel Version</h3><pre># Check the current Flannel DaemonSet image<br>kubectl -n kube-system get daemonset kube-flannel -o jsonpath=&#39;{.spec.template.spec.containers[0].image}&#39;</pre><p>Output:</p><pre>docker.io/flannel/flannel:v0.26.7</pre><h3>Updating Flannel via Kubespray</h3><p>Kubespray manages the Flannel version through its role defaults. To update Flannel, you can either change the variable in your inventory or pass it as an extra variable. The relevant variable is flannel_image_tag in roles/network_plugin/flannel/defaults/main.yml.</p><p>However, the simplest approach is to let Kubespray handle it by running the cluster playbook with the network plugin tag:</p><pre>cd /root/kubespray</pre><pre># Check what Flannel version Kubespray v2.29.1 ships with<br>grep -r &quot;flannel_image_tag&quot; roles/network_plugin/flannel/defaults/main.yml</pre><p>Output:</p><pre>flannel_image_tag: &quot;v0.26.7&quot;</pre><p>If the current version is already the latest supported by your Kubespray release, no action is needed. If you need to update it manually:</p><pre>ansible-playbook cluster.yml \<br>  --tags &quot;network&quot; \<br>  --limit &quot;kube_node,kube_control_plane&quot; \<br>  -e kube_version=&quot;1.32.9&quot;</pre><blockquote><strong><em>Important:</em></strong><em> Because Flannel is a DaemonSet, the </em><em>--limit flag does not restrict which nodes receive the update. The DaemonSet controller will roll out the new image to all nodes regardless. The </em><em>--limit flag only controls which nodes Ansible connects to for running tasks, but the Kubernetes DaemonSet update propagates cluster-wide.</em></blockquote><p>After the update, verify:</p><pre># Verify Flannel pods are running the new version on all nodes<br>kubectl -n kube-system get pods -l app=flannel -o wide</pre><pre># Check that all Flannel pods are in Running state<br>kubectl -n kube-system get pods -l app=flannel -o jsonpath=&#39;{range .items[*]}{.metadata.name}{&quot;\t&quot;}{.status.phase}{&quot;\t&quot;}{.spec.nodeName}{&quot;\n&quot;}{end}&#39;</pre><h3>Kubespray Upgrade Strategies (Unsafe vs. Graceful)</h3><p>Kubespray provides two fundamentally different approaches to cluster upgrades. Understanding the difference is critical for choosing the right strategy based on your environment’s tolerance for risk and downtime.</p><h3>Strategy 1: Unsafe Upgrade (cluster.yml)</h3><p>The cluster.yml playbook is the same playbook used for initial cluster deployment. When used for upgrades, it applies changes to all nodes simultaneously without draining workloads first. This is called an &quot;unsafe&quot; upgrade because:</p><ul><li>Nodes are <strong>not cordoned</strong> before the upgrade</li><li>Pods are <strong>not drained</strong> — they continue running during the upgrade</li><li>If a kubelet restart kills running pods, they are simply restarted by their controllers</li><li>All nodes of the same type (control plane or worker) may be upgraded in parallel</li></ul><p>To use cluster.yml for upgrades, you must set upgrade_cluster_setup: true:</p><pre>ansible-playbook cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  -e upgrade_cluster_setup=true</pre><p><strong>When to use:</strong> Development environments, non-production clusters, or situations where speed matters more than zero-downtime guarantees.</p><h3>Strategy 2: Graceful Upgrade (upgrade-cluster.yml)</h3><p>The upgrade-cluster.yml playbook is purpose-built for production upgrades. It performs a rolling upgrade, processing one node at a time (or a configurable batch size), with proper workload migration:</p><ol><li><strong>Cordon</strong> the node (mark it as unschedulable)</li><li><strong>Drain</strong> the node (gracefully evict all pods, respecting PodDisruptionBudgets)</li><li><strong>Upgrade</strong> the node (update containerd, kubelet, kube-proxy, static pod manifests)</li><li><strong>Uncordon</strong> the node (mark it as schedulable again)</li><li><strong>Repeat</strong> for the next node</li></ol><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot;</pre><h3>Controlling the Rolling Upgrade Behavior</h3><p>Several variables control how the rolling upgrade proceeds:</p><h4>Serial Execution</h4><p>The serial parameter in the playbook controls how many nodes are upgraded simultaneously:</p><pre># Default: 20% of nodes at a time<br>serial: &quot;. 20%&quot;</pre><pre># Conservative: one node at a time<br>serial: 1</pre><pre># Aggressive: all nodes at once (effectively unsafe)<br>serial: &quot;100%&quot;</pre><p>For production environments, serial: 1 is the safest option. It ensures that at any point during the upgrade, at most one node is unavailable.</p><h4>Upgrade Confirmation Prompt</h4><p>If you want manual confirmation before each node upgrade:</p><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  -e upgrade_node_confirm=true</pre><p>This pauses the playbook before each node and waits for you to press Enter. This is useful for verifying cluster health between node upgrades.</p><h4>Timed Pause Between Nodes</h4><p>Alternatively, you can set an automatic pause between node upgrades:</p><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  -e upgrade_node_pause_seconds=60</pre><p>This gives the cluster 60 seconds to stabilize after each node upgrade before proceeding to the next one.</p><h3>PodDisruptionBudget (PDB) Considerations</h3><p>During the drain phase, Kubernetes respects PodDisruptionBudgets. If a workload has a PDB with maxUnavailable: 0, the drain will hang indefinitely waiting for permission to evict pods. This is a common source of upgrade failures.</p><p>Before starting an upgrade, audit your PDBs:</p><pre># List all PDBs and their disruption settings<br>kubectl get pdb --all-namespaces -o wide</pre><pre># Check for PDBs that might block drain<br>kubectl get pdb --all-namespaces -o jsonpath=&#39;{range .items[*]}{.metadata.namespace}/{.metadata.name}: maxUnavailable={.spec.maxUnavailable}, disruptionsAllowed={.status.disruptionsAllowed}{&quot;\n&quot;}{end}&#39;</pre><p>If you find PDBs with disruptionsAllowed: 0, either temporarily adjust them or ensure the corresponding workloads have enough replicas to tolerate one pod being evicted.</p><h3>Patch Upgrade: v1.32.9 → v1.32.10</h3><p>A patch upgrade is the simplest type of Kubernetes upgrade. It only includes bug fixes and security patches — no new features, no API changes, no deprecations. The Kubespray version remains the same (v2.29.1), and no supporting tools (etcd, containerd) need to change.</p><p>Despite its simplicity, a patch upgrade is the perfect opportunity to validate your upgrade procedures and tooling before attempting more complex minor or major upgrades.</p><h3>Pre-Upgrade Cluster State</h3><p>Before starting, document the current state:</p><pre># Current node versions<br>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           AGE   VERSION   INTERNAL-IP      OS-IMAGE                       KERNEL-VERSION<br>k8s-node1   Ready    control-plane   1d    v1.32.9   192.168.10.11    Rocky Linux 10.0 (Obsidian)    6.12.x<br>k8s-node2   Ready    control-plane   1d    v1.32.9   192.168.10.12    Rocky Linux 10.0 (Obsidian)    6.12.x<br>k8s-node3   Ready    control-plane   1d    v1.32.9   192.168.10.13    Rocky Linux 10.0 (Obsidian)    6.12.x<br>k8s-node4   Ready    &lt;none&gt;          1d    v1.32.9   192.168.10.14    Rocky Linux 10.0 (Obsidian)    6.12.x<br>k8s-node5   Ready    &lt;none&gt;          1d    v1.32.9   192.168.10.15    Rocky Linux 10.0 (Obsidian)    6.12.x</pre><pre># Current etcd cluster health<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><pre># Current component versions (static pods on control plane)<br>kubectl -n kube-system get pods -l tier=control-plane -o jsonpath=&#39;{range .items[*]}{.metadata.name}{&quot;\t&quot;}{.spec.containers[0].image}{&quot;\n&quot;}{end}&#39;</pre><pre>kube-apiserver-k8s-node1            registry.k8s.io/kube-apiserver:v1.32.9<br>kube-apiserver-k8s-node2            registry.k8s.io/kube-apiserver:v1.32.9<br>kube-apiserver-k8s-node3            registry.k8s.io/kube-apiserver:v1.32.9<br>kube-controller-manager-k8s-node1   registry.k8s.io/kube-controller-manager:v1.32.9<br>kube-controller-manager-k8s-node2   registry.k8s.io/kube-controller-manager:v1.32.9<br>kube-controller-manager-k8s-node3   registry.k8s.io/kube-controller-manager:v1.32.9<br>kube-scheduler-k8s-node1            registry.k8s.io/kube-scheduler:v1.32.9<br>kube-scheduler-k8s-node2            registry.k8s.io/kube-scheduler:v1.32.9<br>kube-scheduler-k8s-node3            registry.k8s.io/kube-scheduler:v1.32.9</pre><pre># Current kube-proxy version<br>kubectl -n kube-system get daemonset kube-proxy -o jsonpath=&#39;{.spec.template.spec.containers[0].image}&#39;</pre><pre>registry.k8s.io/kube-proxy:v1.32.9</pre><h3>Control Plane Rolling Upgrade</h3><p>We start the upgrade with the control plane nodes and etcd. This is always done first because worker nodes depend on the API server, and Kubernetes guarantees backward compatibility: a newer control plane can manage older worker nodes, but not vice versa.</p><pre>cd /root/kubespray</pre><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  --limit &quot;kube_control_plane:etcd&quot;</pre><p>This command takes approximately <strong>14 minutes</strong> for three control plane nodes.</p><h4>What Happens During the Control Plane Upgrade</h4><p>The upgrade-cluster.yml playbook executes the following sequence for each control plane node, one at a time:</p><p><strong>Phase 1: Pre-upgrade Downloads</strong></p><p>Before touching any node, Kubespray downloads all required container images and binaries to every node in the --limit scope. This ensures that the actual upgrade step (which involves restarting components) completes as quickly as possible.</p><pre>TASK [download : Download containers if pull is required or told to always pull] ****</pre><p>The images downloaded for a patch upgrade:</p><pre>registry.k8s.io/kube-apiserver:v1.32.10<br>registry.k8s.io/kube-controller-manager:v1.32.10<br>registry.k8s.io/kube-scheduler:v1.32.10<br>registry.k8s.io/kube-proxy:v1.32.10</pre><p>Images that do <strong>not</strong> change in a patch upgrade (and are therefore skipped):</p><pre>registry.k8s.io/coredns/coredns:v1.12.0         # CoreDNS version unchanged<br>registry.k8s.io/pause:3.10                        # Pause container unchanged<br>quay.io/coreos/etcd:v3.5.25                       # etcd version unchanged</pre><p><strong>Phase 2: Rolling Upgrade per Node (k8s-node1 first)</strong></p><pre>TASK [kubernetes/control-plane : Kubeadm | Cordon node] *************************<br>changed: [k8s-node1]</pre><pre>TASK [kubernetes/control-plane : Kubeadm | Drain node] **************************<br>changed: [k8s-node1]</pre><p>At this point, k8s-node1 is cordoned (no new pods will be scheduled) and drained (existing pods are gracefully evicted). Since this is a control plane node, only the static pods (apiserver, controller-manager, scheduler) and DaemonSet pods remain — they cannot be evicted.</p><pre>TASK [container-engine/containerd : Containerd | Ensure containerd is installed] ***<br>ok: [k8s-node1]</pre><p>For a patch upgrade, the containerd version typically does not change. Kubespray verifies the installed version matches the expected version and skips reinstallation if they match.</p><pre>TASK [kubernetes/control-plane : Kubeadm | Upgrade first control plane] *********<br>changed: [k8s-node1]</pre><p>This is the core step. Kubespray runs kubeadm upgrade apply v1.32.10 on the first control plane node. This command:</p><ol><li>Validates the upgrade path (v1.32.9 → v1.32.10 is allowed)</li><li>Downloads any missing component images</li><li>Updates the static pod manifests in /etc/kubernetes/manifests/</li><li>The kubelet detects the manifest changes and restarts the static pods</li><li>Waits for the new API server to become healthy</li></ol><p>For subsequent control plane nodes (k8s-node2, k8s-node3), kubeadm upgrade node is used instead of kubeadm upgrade apply:</p><pre>TASK [kubernetes/control-plane : Kubeadm | Upgrade subsequent control planes] ***<br>changed: [k8s-node2]<br>...<br>changed: [k8s-node3]</pre><p>The difference:</p><ul><li>kubeadm upgrade apply — updates the cluster-level configuration and upgrades the first node</li><li>kubeadm upgrade node — upgrades a single node using the cluster-level configuration already applied</li></ul><p><strong>Phase 3: kube-proxy DaemonSet Update</strong></p><p>After the first control plane node is upgraded, Kubespray updates the kube-proxy DaemonSet image. Since kube-proxy is a DaemonSet, this update propagates to <strong>all nodes</strong> in the cluster, including worker nodes that have not been upgraded yet:</p><pre>TASK [kubernetes/control-plane : Kubeadm | Update kube-proxy DaemonSet] *********<br>changed: [k8s-node1]</pre><p>This is safe because kube-proxy v1.32.10 is backward-compatible with kubelet v1.32.9 running on the worker nodes.</p><p><strong>Phase 4: Uncordon</strong></p><pre>TASK [kubernetes/control-plane : Kubeadm | Uncordon node] ***********************<br>changed: [k8s-node1]</pre><p>The node is marked schedulable again, and the playbook moves to the next control plane node.</p><h4>Monitoring the Upgrade in Real-Time</h4><p>While the upgrade is running, you can monitor progress from another terminal:</p><pre># Watch node versions change in real-time<br>watch -n 2 &#39;kubectl get nodes -o wide&#39;</pre><pre># Watch static pod restarts<br>watch -n 2 &#39;kubectl -n kube-system get pods -l tier=control-plane -o wide&#39;</pre><pre># Monitor etcd cluster health (should remain healthy throughout)<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><pre># Check API server availability through the load balancer<br>while true; do curl -sk -o /dev/null -w &quot;%{http_code}\n&quot; <a href="https://192.168.10.10:6443/healthz;">https://192.168.10.10:6443/healthz;</a> sleep 1; done</pre><p>During the rolling upgrade, you will briefly see mixed versions:</p><pre>NAME        STATUS                     ROLES           VERSION<br>k8s-node1   Ready                      control-plane   v1.32.10   ← upgraded<br>k8s-node2   Ready,SchedulingDisabled   control-plane   v1.32.9    ← being upgraded<br>k8s-node3   Ready                      control-plane   v1.32.9    ← waiting<br>k8s-node4   Ready                      &lt;none&gt;          v1.32.9<br>k8s-node5   Ready                      &lt;none&gt;          v1.32.9</pre><p>The HAProxy load balancer on admin-lb (192.168.10.10) continues to route traffic to the two healthy API servers while one is being upgraded. The API remains available throughout.</p><h4>Post-Control-Plane-Upgrade Verification</h4><p>After the control plane upgrade completes:</p><pre># All control plane nodes should show v1.32.10<br>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           VERSION<br>k8s-node1   Ready    control-plane   v1.32.10<br>k8s-node2   Ready    control-plane   v1.32.10<br>k8s-node3   Ready    control-plane   v1.32.10<br>k8s-node4   Ready    &lt;none&gt;          v1.32.9    ← workers still on old version<br>k8s-node5   Ready    &lt;none&gt;          v1.32.9</pre><pre># Verify static pod images<br>kubectl -n kube-system get pods -l tier=control-plane -o jsonpath=&#39;{range .items[*]}{.metadata.name}{&quot;\t&quot;}{.spec.containers[0].image}{&quot;\n&quot;}{end}&#39;</pre><p>All control plane static pods should now show v1.32.10.</p><pre># Verify kube-proxy was updated on ALL nodes (including workers)<br>kubectl -n kube-system get daemonset kube-proxy -o jsonpath=&#39;{.spec.template.spec.containers[0].image}&#39;</pre><pre>registry.k8s.io/kube-proxy:v1.32.10</pre><pre># Verify etcd cluster health (etcd version unchanged for patch upgrade)<br>ssh k8s-node1 etcdctl.sh member list -w table<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><pre># Verify API server responds correctly through all endpoints<br>curl -sk https://192.168.10.11:6443/version | jq .gitVersion<br>curl -sk https://192.168.10.12:6443/version | jq .gitVersion<br>curl -sk https://192.168.10.13:6443/version | jq .gitVersion<br>curl -sk https://192.168.10.10:6443/version | jq .gitVersion  # via HAProxy</pre><p>All should return &quot;v1.32.10&quot;.</p><h3>Worker Node Individual Upgrade</h3><p>With the control plane upgraded, we now upgrade the worker nodes. Unlike control plane nodes, <strong>worker nodes should be upgraded individually</strong> to maintain workload availability. If you have workloads with replicas: 2 spread across two workers, upgrading both simultaneously would cause a full outage.</p><pre># Upgrade worker node k8s-node4<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  --limit &quot;k8s-node4&quot;</pre><p>This takes approximately <strong>2 minutes</strong> per worker node.</p><h4>What Happens During a Worker Node Upgrade</h4><p>The process is simpler than control plane upgrades:</p><ol><li><strong>Cordon</strong> k8s-node4</li><li><strong>Drain</strong> k8s-node4 (all non-DaemonSet pods are evicted to other nodes)</li><li><strong>Upgrade containerd</strong> (if version changed — not in patch upgrades)</li><li><strong>Run </strong><strong>kubeadm upgrade node</strong> (updates kubelet configuration)</li><li><strong>Restart kubelet</strong> service</li><li><strong>Uncordon</strong> k8s-node4</li></ol><p>During the drain, you can observe pods migrating:</p><pre># Watch pods being rescheduled from node4 to node5<br>watch -n 1 &#39;kubectl get pods -A -o wide --field-selector spec.nodeName=k8s-node4&#39;</pre><p>After k8s-node4 is done, upgrade k8s-node5:</p><pre># Upgrade worker node k8s-node5<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.32.10&quot; \<br>  --limit &quot;k8s-node5&quot;</pre><h4>Post-Worker-Upgrade Verification</h4><pre># All nodes should now be v1.32.10<br>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           VERSION<br>k8s-node1   Ready    control-plane   v1.32.10<br>k8s-node2   Ready    control-plane   v1.32.10<br>k8s-node3   Ready    control-plane   v1.32.10<br>k8s-node4   Ready    &lt;none&gt;          v1.32.10<br>k8s-node5   Ready    &lt;none&gt;          v1.32.10</pre><pre># Verify all system pods are healthy<br>kubectl -n kube-system get pods -o wide</pre><pre># Verify workloads are running normally<br>kubectl get pods --all-namespaces -o wide | grep -v Running | grep -v Completed</pre><h3>Updating Admin kubectl and kubeconfig</h3><p>After the cluster upgrade, the admin node (admin-lb) should also have its kubectl binary updated to match the cluster version. While older kubectl versions can communicate with newer API servers (within one minor version skew), it is best practice to keep them aligned.</p><pre># On the admin-lb node (192.168.10.10)</pre><pre># Check current kubectl version<br>kubectl version --client</pre><pre>Client Version: v1.32.9</pre><pre># Update kubectl to match the cluster version<br># First, ensure the Kubernetes yum repo points to v1.32<br>cat /etc/yum.repos.d/kubernetes.repo</pre><pre>[kubernetes]<br>name=Kubernetes<br>baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/<br>enabled=1<br>gpgcheck=1<br>gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key</pre><pre># Install the latest kubectl in the v1.32 series<br>dnf install kubectl --disableexcludes=kubernetes -y</pre><pre># Verify the new version<br>kubectl version --client</pre><pre>Client Version: v1.32.10</pre><p>Also refresh the kubeconfig from the first control plane node, as the certificate or cluster configuration may have been updated:</p><pre># Copy the updated kubeconfig from the first control plane<br>scp k8s-node1:/root/.kube/config /root/.kube/config</pre><pre># Verify connectivity<br>kubectl cluster-info<br>kubectl get nodes -o wide</pre><p>The patch upgrade from v1.32.9 to v1.32.10 is now complete.</p><h3>Minor Upgrade: v1.32.10 → v1.33.7</h3><p>A minor version upgrade is more significant than a patch upgrade. Minor releases introduce new features, new APIs, deprecate old APIs, and may change default behaviors. However, the upgrade procedure with Kubespray is nearly identical — the difference lies primarily in what changes inside the cluster.</p><h3>Pre-Upgrade: Review Release Notes</h3><p>Before any minor upgrade, review the Kubernetes changelog for breaking changes:</p><pre># Check what version Kubespray v2.29.1 supports for v1.33<br>grep -r &quot;kube_version&quot; roles/kubespray-defaults/defaults/main/download.yml | head -5</pre><p>Kubernetes follows a strict version skew policy: <strong>you can only upgrade one minor version at a time</strong>. Jumping from v1.32 to v1.34 directly is not supported. You must go v1.32 → v1.33 → v1.34.</p><h3>Running the Minor Upgrade</h3><p>The commands are identical to the patch upgrade — only the kube_version value changes.</p><h4>Step 1: Upgrade Control Plane</h4><pre>cd /root/kubespray</pre><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.33.7&quot; \<br>  --limit &quot;kube_control_plane:etcd&quot;</pre><p>This takes approximately <strong>18 minutes</strong> for three control plane nodes. The longer duration compared to the patch upgrade is because:</p><ul><li>More container images need to be downloaded (some support components have new versions)</li><li>kubeadm upgrade apply performs more extensive preflight checks when crossing minor versions</li><li>Static pod restarts may take slightly longer as new features are initialized</li></ul><p>The rolling upgrade process is identical:</p><pre>Node 1: cordon → drain → kubeadm upgrade apply v1.33.7 → uncordon<br>Node 2: cordon → drain → kubeadm upgrade node → uncordon<br>Node 3: cordon → drain → kubeadm upgrade node → uncordon</pre><p>During the upgrade, monitor the API server availability:</p><pre># Continuous health check through HAProxy<br>while true; do<br>  HTTP_CODE=$(curl -sk -o /dev/null -w &quot;%{http_code}&quot; https://192.168.10.10:6443/healthz)<br>  echo &quot;$(date &#39;+%H:%M:%S&#39;) - API Server: $HTTP_CODE&quot;<br>  sleep 1<br>done</pre><p>You should see uninterrupted 200 responses throughout the control plane upgrade, thanks to the HAProxy load balancer distributing requests across the three API servers.</p><h4>Step 2: Upgrade Worker Nodes</h4><pre># Upgrade workers one at a time<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.33.7&quot; \<br>  --limit &quot;k8s-node4&quot;</pre><pre>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.33.7&quot; \<br>  --limit &quot;k8s-node5&quot;</pre><p>Each worker takes approximately <strong>3 minutes</strong>.</p><p>Alternatively, you can upgrade all workers at once if you have sufficient capacity:</p><pre># Upgrade all workers (serial behavior controlled by playbook)<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.33.7&quot; \<br>  --limit &quot;kube_node&quot;</pre><p>With the default serial: 20%, both workers would be upgraded sequentially.</p><h4>Step 3: Verify the Minor Upgrade</h4><pre># All nodes should show v1.33.7<br>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           VERSION<br>k8s-node1   Ready    control-plane   v1.33.7<br>k8s-node2   Ready    control-plane   v1.33.7<br>k8s-node3   Ready    control-plane   v1.33.7<br>k8s-node4   Ready    &lt;none&gt;          v1.33.7<br>k8s-node5   Ready    &lt;none&gt;          v1.33.7</pre><pre># Verify API version<br>curl -sk https://192.168.10.10:6443/version | jq .</pre><pre>{<br>  &quot;major&quot;: &quot;1&quot;,<br>  &quot;minor&quot;: &quot;33&quot;,<br>  &quot;gitVersion&quot;: &quot;v1.33.7&quot;,<br>  &quot;buildDate&quot;: &quot;...&quot;,<br>  &quot;goVersion&quot;: &quot;go1.23.x&quot;,<br>  &quot;compiler&quot;: &quot;gc&quot;,<br>  &quot;platform&quot;: &quot;linux/amd64&quot;<br>}</pre><pre># Verify system pods<br>kubectl -n kube-system get pods -o wide</pre><pre># Check for any deprecated API usage in your workloads<br>kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis</pre><h3>Step 4: Update Admin kubectl</h3><p>For a minor version upgrade, the kubectl repo must be updated to the new minor version:</p><pre># On admin-lb node</pre><pre># Update the Kubernetes yum repo to point to v1.33<br>cat &gt; /etc/yum.repos.d/kubernetes.repo &lt;&lt; &#39;EOF&#39;<br>[kubernetes]<br>name=Kubernetes<br>baseurl=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/<br>enabled=1<br>gpgcheck=1<br>gpgkey=https://pkgs.k8s.io/core:/stable:/v1.33/rpm/repodata/repomd.xml.key<br>EOF</pre><pre># Install kubectl v1.33<br>dnf install kubectl --disableexcludes=kubernetes -y</pre><pre># Verify<br>kubectl version --client</pre><pre>Client Version: v1.33.7</pre><pre># Refresh kubeconfig<br>scp k8s-node1:/root/.kube/config /root/.kube/config</pre><pre># Confirm connectivity<br>kubectl cluster-info<br>kubectl get nodes</pre><p>The minor upgrade from v1.32.10 to v1.33.7 is now complete.</p><h3>Major Upgrade with Kubespray Version Bump: v1.33.7 → v1.34.3</h3><p>This is the most complex upgrade scenario. Not only are we upgrading Kubernetes by another minor version (v1.33 → v1.34), but we are also upgrading Kubespray itself from v2.29.1 to v2.30.0. A Kubespray version bump means:</p><ul><li>New Ansible roles and tasks</li><li>Updated default variable values</li><li>New versions of supporting components (etcd, containerd, CoreDNS)</li><li>Potentially new Python dependencies for Ansible</li><li>Changed Ansible collection requirements</li></ul><p>This is the type of upgrade that most closely mirrors what happens in real production environments, where the infrastructure tooling evolves alongside the target platform.</p><h3>Switching Kubespray Tags (v2.29.1 → v2.30.0)</h3><p>First, we need to switch the Kubespray repository to the new release tag.</p><pre>cd /root/kubespray</pre><pre># Check current Kubespray version<br>git describe --tags</pre><pre>v2.29.1</pre><pre># Fetch the latest tags<br>git fetch --all --tags</pre><pre># Check available v2.30.x tags<br>git tag -l &quot;v2.30*&quot;</pre><pre>v2.30.0</pre><pre># Switch to the new version<br>git checkout v2.30.0</pre><pre>Note: switching to &#39;v2.30.0&#39;.<br>You are in &#39;detached HEAD&#39; state...<br>HEAD is now at &lt;commit-hash&gt; Release v2.30.0</pre><p>After switching, verify what Kubernetes versions this Kubespray release supports:</p><pre># Check the supported K8s version range<br>grep -r &quot;kube_version_min_required\|kube_version&quot; roles/kubespray-defaults/defaults/main/*.yml | grep -i version | head -10</pre><p>Also check what component versions ship with this release:</p><pre># etcd version<br>grep &quot;etcd_version&quot; roles/download/defaults/main/main.yml</pre><pre>etcd_version: v3.5.26</pre><pre># containerd version<br>grep &quot;containerd_version&quot; roles/download/defaults/main/main.yml</pre><pre>containerd_version: 2.2.1</pre><pre># CoreDNS version<br>grep &quot;coredns_version&quot; roles/download/defaults/main/main.yml</pre><p>This reveals that upgrading to Kubespray v2.30.0 will also upgrade:</p><ul><li><strong>etcd</strong>: 3.5.25 → 3.5.26</li><li><strong>containerd</strong>: 2.1.5 → 2.2.1</li></ul><p>These component upgrades happen automatically as part of the Kubernetes upgrade because Kubespray controls their versions through role defaults.</p><h3>Python Virtual Environment for Dependency Isolation</h3><p>Kubespray v2.30.0 may require different Python package versions than v2.29.1. Installing new dependencies directly on the system can break existing Ansible setups. A Python virtual environment provides isolation.</p><pre># Check the new requirements<br>cat /root/kubespray/requirements.txt</pre><pre>ansible==10.7.0<br>cryptography==46.0.3<br>jinja2==3.1.5<br>jmespath==1.1.0<br>MarkupSafe==3.0.2<br>netaddr==1.3.0<br>pbr==6.1.1<br>ruamel.yaml==0.18.14</pre><p>Compare with what is currently installed:</p><pre>pip3 list 2&gt;/dev/null | grep -i -E &quot;ansible|cryptography|jmespath|netaddr&quot;</pre><p>If versions differ, install the new requirements:</p><pre># Option 1: Direct installation (simpler, our lab approach)<br>pip3 install -r requirements.txt --break-system-packages</pre><pre># Option 2: Virtual environment (recommended for production admin nodes)<br>python3 -m venv /root/kubespray-venv<br>source /root/kubespray-venv/bin/activate<br>pip3 install -r requirements.txt</pre><p>If using a virtual environment, remember to activate it before running any Ansible commands:</p><pre>source /root/kubespray-venv/bin/activate</pre><p>Verify the key packages:</p><pre>pip3 list | grep -i -E &quot;ansible|cryptography|jmespath|netaddr&quot;</pre><pre>ansible          10.7.0<br>cryptography     46.0.3<br>jmespath         1.1.0<br>netaddr          1.3.0</pre><p>Also verify that Ansible can reach all nodes:</p><pre>cd /root/kubespray<br>ansible -i inventory/mycluster/inventory.ini all -m ping</pre><p>All nodes should return pong.</p><h3>etcd Version Upgrade (3.5.25 → 3.5.26)</h3><p>One of the most critical parts of a Kubespray version bump is the etcd upgrade. etcd is the single source of truth for all cluster state — a failed etcd upgrade can be catastrophic. Kubespray handles this carefully:</p><ol><li>Before modifying any etcd member, Kubespray creates a <strong>backup snapshot</strong></li><li>Each etcd member is upgraded one at a time</li><li>After each member upgrade, cluster health is verified before proceeding</li></ol><h4>Pre-Upgrade etcd Status</h4><pre># etcd member list<br>ssh k8s-node1 etcdctl.sh member list -w table</pre><pre>+------------------+---------+-----------+---------------------------+---------------------------+<br>|        ID        | STATUS  |   NAME    |       PEER ADDRS          |      CLIENT ADDRS         |<br>+------------------+---------+-----------+---------------------------+---------------------------+<br>| 8e9e05c52164694d | started | etcd1     | https://192.168.10.11:2380| https://192.168.10.11:2379|<br>| 91bc3c398fb3c146 | started | etcd2     | https://192.168.10.12:2380| https://192.168.10.12:2379|<br>| fd422379fda50e48 | started | etcd3     | https://192.168.10.13:2380| https://192.168.10.13:2379|<br>+------------------+---------+-----------+---------------------------+---------------------------+</pre><pre># etcd version before upgrade<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><p>The version column should show 3.5.25.</p><h4>etcd Upgrade Process (Automatic during K8s upgrade)</h4><p>The etcd upgrade happens automatically when you run the upgrade playbook. Kubespray detects that the new release expects etcd v3.5.26 and upgrades each member:</p><pre>TASK [etcd : Backup etcd data] **************************************************<br>changed: [k8s-node1]<br>changed: [k8s-node2]<br>changed: [k8s-node3]</pre><pre>TASK [etcd : Upgrade etcd member] ***********************************************<br>changed: [k8s-node1]</pre><pre>TASK [etcd : Wait for etcd cluster health] **************************************<br>ok: [k8s-node1]</pre><pre>TASK [etcd : Upgrade etcd member] ***********************************************<br>changed: [k8s-node2]</pre><pre>TASK [etcd : Wait for etcd cluster health] **************************************<br>ok: [k8s-node2]</pre><pre>TASK [etcd : Upgrade etcd member] ***********************************************<br>changed: [k8s-node3]</pre><p>Each etcd member is restarted with the new binary. During the restart of a single member, the etcd cluster maintains quorum with the remaining two members (2/3 = majority), so there is no data availability interruption.</p><h4>etcd Backup Verification</h4><p>After the upgrade, verify that backups were created:</p><pre>ssh k8s-node1 tree /var/backups/ | head -20</pre><pre>/var/backups/<br>├── etcd-20260207-before-upgrade/<br>│   └── member/<br>│       ├── snap/<br>│       │   └── db<br>│       └── wal/<br>│           └── ...</pre><p>These backups are critical — if anything goes wrong during the upgrade, you can restore etcd from these snapshots. Kubespray creates timestamped backup directories under /var/backups/ on each etcd member.</p><h4>Post-Upgrade etcd Verification</h4><pre># Verify etcd version is now 3.5.26<br>ssh k8s-node1 etcdctl.sh endpoint status -w table</pre><pre># Verify cluster health<br>ssh k8s-node1 etcdctl.sh endpoint health -w table</pre><pre>+---------------------------+--------+-------+-------+<br>|         ENDPOINT          | HEALTH |  TOOK | ERROR |<br>+---------------------------+--------+-------+-------+<br>| https://192.168.10.11:2379| true   |  12ms |       |<br>| https://192.168.10.12:2379| true   |  11ms |       |<br>| https://192.168.10.13:2379| true   |  13ms |       |<br>+---------------------------+--------+-------+-------+</pre><h3>containerd Upgrade (2.1.5 → 2.2.1)</h3><p>The container runtime upgrade is another significant change that comes with the Kubespray version bump. containerd is the CRI (Container Runtime Interface) implementation that actually runs containers on each node. Upgrading it requires restarting the containerd service, which briefly affects container operations on that node.</p><p>Kubespray handles this as part of the rolling upgrade:</p><ol><li>The node is already cordoned and drained (no user workloads running)</li><li>containerd binary is replaced</li><li>containerd service is restarted</li><li>kubelet is restarted (reconnects to the new containerd socket)</li><li>Static pods are re-created by the kubelet</li></ol><h4>containerd Upgrade is Transparent</h4><p>You do not need to run separate commands for containerd — it is upgraded automatically during the upgrade-cluster.yml execution:</p><pre>TASK [container-engine/containerd : Containerd | Download containerd] ***********<br>changed: [k8s-node1]</pre><pre>TASK [container-engine/containerd : Containerd | Install containerd] ************<br>changed: [k8s-node1]</pre><pre>TASK [container-engine/containerd : Containerd | Restart containerd] ************<br>changed: [k8s-node1]</pre><h4>Running the Full Upgrade</h4><p>Now, execute the complete upgrade:</p><pre>cd /root/kubespray</pre><pre># Step 1: Upgrade control plane and etcd (includes etcd + containerd upgrades)<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.34.3&quot; \<br>  --limit &quot;kube_control_plane:etcd&quot;</pre><p>This takes approximately <strong>15 minutes</strong> for three control plane nodes. The additional time compared to the patch upgrade comes from the etcd restart/upgrade and the containerd binary replacement on each node.</p><pre># Step 2: Upgrade worker nodes<br>ansible-playbook upgrade-cluster.yml \<br>  -e kube_version=&quot;1.34.3&quot; \<br>  --limit &quot;kube_node&quot;</pre><p>Worker nodes take approximately <strong>4 minutes</strong> each because containerd also needs to be upgraded on workers.</p><h4>Verifying containerd Version</h4><p>After the upgrade completes:</p><pre># Check containerd version on all nodes<br>ansible -i inventory/mycluster/inventory.ini all -m shell -a &quot;containerd --version&quot;</pre><pre>k8s-node1 | SUCCESS | containerd containerd.io 2.2.1 ...<br>k8s-node2 | SUCCESS | containerd containerd.io 2.2.1 ...<br>k8s-node3 | SUCCESS | containerd containerd.io 2.2.1 ...<br>k8s-node4 | SUCCESS | containerd containerd.io 2.2.1 ...<br>k8s-node5 | SUCCESS | containerd containerd.io 2.2.1 ...</pre><pre># Verify container images on a node<br>ssh k8s-node1 crictl images</pre><p>You should see images for Kubernetes v1.34.3 alongside the upgraded etcd v3.5.26 image.</p><h4>Post-Upgrade Cluster Verification</h4><pre># All nodes should show v1.34.3<br>kubectl get nodes -o wide</pre><pre>NAME        STATUS   ROLES           VERSION<br>k8s-node1   Ready    control-plane   v1.34.3<br>k8s-node2   Ready    control-plane   v1.34.3<br>k8s-node3   Ready    control-plane   v1.34.3<br>k8s-node4   Ready    &lt;none&gt;          v1.34.3<br>k8s-node5   Ready    &lt;none&gt;          v1.34.3</pre><pre># API version check through all endpoints<br>for ip in 192.168.10.{11,12,13,10}; do<br>  echo -n &quot;$ip: &quot;<br>  curl -sk https://$ip:6443/version | jq -r .gitVersion<br>done</pre><pre>192.168.10.11: v1.34.3<br>192.168.10.12: v1.34.3<br>192.168.10.13: v1.34.3<br>192.168.10.10: v1.34.3</pre><pre># Verify all system components<br>kubectl -n kube-system get pods -o wide</pre><pre># Check downloaded release binaries<br>ssh k8s-node1 tree /tmp/releases</pre><h3>Helm and kubectl Version Updates</h3><p>After a major upgrade, the admin tooling must be updated to match. This includes both kubectl and helm.</p><h4>Updating kubectl to v1.34</h4><pre># On admin-lb node</pre><pre># Update the Kubernetes yum repo to v1.34<br>cat &gt; /etc/yum.repos.d/kubernetes.repo &lt;&lt; &#39;EOF&#39;<br>[kubernetes]<br>name=Kubernetes<br>baseurl=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/<br>enabled=1<br>gpgcheck=1<br>gpgkey=https://pkgs.k8s.io/core:/stable:/v1.34/rpm/repodata/repomd.xml.key<br>EOF</pre><pre># Install kubectl<br>dnf clean metadata<br>dnf install kubectl --disableexcludes=kubernetes -y</pre><pre># Verify<br>kubectl version --client</pre><pre>Client Version: v1.34.3</pre><h4>Updating Helm</h4><p>Kubespray v2.30.0 may also ship with or recommend a newer Helm version. In our case, Helm is updated to v3.20.0:</p><pre># Check current Helm version<br>helm version --short</pre><pre>v3.17.3+...</pre><pre># Install Helm v3.20.0<br>curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | DESIRED_VERSION=v3.20.0 bash</pre><pre>Downloading https://get.helm.sh/helm-v3.20.0-linux-amd64.tar.gz<br>Verifying checksum...<br>Preparing to install helm into /usr/local/bin<br>helm installed into /usr/local/bin/helm</pre><pre># Verify the new version<br>helm version --short</pre><pre>v3.20.0+...</pre><h4>Refreshing kubeconfig</h4><p>As with previous upgrades, refresh the kubeconfig from the first control plane node:</p><pre>scp k8s-node1:/root/.kube/config /root/.kube/config</pre><pre># Final verification<br>kubectl cluster-info<br>kubectl get nodes -o wide<br>kubectl -n kube-system get pods</pre><h4>HAProxy Stats Verification</h4><p>Finally, verify that the HAProxy load balancer still shows all three control plane backends as healthy:</p><pre># Open HAProxy stats page<br># URL: http://192.168.10.10:9000/haproxy_stats</pre><pre># Or check via CLI<br>curl -s <a href="http://192.168.10.10:9000/haproxy_stats\;csv">http://192.168.10.10:9000/haproxy_stats\;csv</a> | grep k8s_api | cut -d&#39;,&#39; -f1-3,18</pre><p>All three backends should show status UP.</p><h4>Prometheus/Grafana Monitoring Check</h4><p>If you have the kube-prometheus-stack installed (see Section 8), verify that all monitoring targets are still healthy after the upgrade:</p><pre># Prometheus targets (check via NodePort)<br># URL: http://192.168.10.14:30001/targets</pre><pre># Grafana dashboards (check via NodePort)<br># URL: <a href="http://192.168.10.14:30002">http://192.168.10.14:30002</a><br># Login: admin / prom-operator</pre><p>Key dashboards to check:</p><ul><li><strong>Kubernetes / API server</strong> (dashboard 15661): Verify API server latency and error rates returned to normal</li><li><strong>etcd</strong> (dashboard 12693): Verify etcd leader elections did not spike, DB size is stable</li><li><strong>Node Exporter</strong> (dashboard from kube-prometheus-stack): Verify resource utilization across all nodes</li></ul><h3>Upgrade Summary</h3><p>Over the course of this section, we performed three distinct upgrade paths, each building on the complexity of the previous one:</p><p><strong>Patch Upgrade (v1.32.9 → v1.32.10):</strong> The simplest case — only Kubernetes binaries changed. No etcd, no containerd, no Kubespray changes. Demonstrated the core rolling upgrade mechanics: cordon, drain, kubeadm upgrade, uncordon.</p><p><strong>Minor Upgrade (v1.32.10 → v1.33.7):</strong> Same procedure, but with potential API deprecations and longer upgrade times. Reinforced the importance of upgrading one minor version at a time and updating the admin kubectl repository pointer.</p><p><strong>Major Upgrade with Kubespray Bump (v1.33.7 → v1.34.3):</strong> The full production scenario — new Kubespray tag, new Python dependencies, new etcd version, new containerd version, new Helm version, and the Kubernetes upgrade itself. This is the upgrade type that production teams encounter most frequently and the one that requires the most preparation.</p><p>Throughout all three upgrades, the cluster remained available. The HAProxy load balancer ensured API server continuity, the rolling upgrade strategy preserved workload availability, and etcd quorum was maintained during member restarts. This is the power of combining Kubespray’s automation with a properly designed HA architecture.</p><h3>Conclusion</h3><h3>Key Takeaways</h3><p>This guide walked through the complete lifecycle of a production-oriented Kubernetes cluster managed by Kubespray. Here are the most important lessons:</p><p><strong>HA is not optional — it is the foundation.</strong> A 3-node control plane with an etcd quorum of 3 is the minimum viable production architecture. Without it, any single node failure renders the entire cluster unmanageable. The cost of 2 additional control plane nodes is negligible compared to the cost of downtime.</p><p><strong>The API endpoint configuration has real consequences.</strong> The three cases we explored (client-side NGINX proxy, external LB + client-side proxy, external LB only) each have distinct trade-offs. Client-side proxies give workers independence from external infrastructure but add per-node complexity. External-only LB simplifies worker configuration but makes the load balancer a critical dependency. Choose based on your infrastructure constraints and failure tolerance.</p><p><strong>Node management is a routine operation, not an emergency.</strong> Adding workers, removing failed nodes, and replacing control plane nodes should all be practiced regularly. The distinction between scale.yml (workers only) and cluster.yml (control plane additions) is critical — using the wrong playbook leads to incomplete configurations. Always add new control plane nodes at the end of the inventory group.</p><p><strong>etcd is the heart of the cluster.</strong> Every piece of cluster state lives in etcd. Always use an odd number of members (3, 5, 7) for proper quorum. Always verify etcd health after any cluster modification. Always ensure backups are being created before upgrades.</p><p><strong>Upgrades should be frequent and incremental.</strong> Patch upgrades (v1.32.9 → v1.32.10) are low-risk and take minutes. Minor upgrades (v1.32 → v1.33) require more attention but follow the same pattern. Skipping versions creates compounding technical debt and makes each subsequent upgrade more painful. Kubernetes only supports upgrading one minor version at a time.</p><p><strong>Kubespray version matters.</strong> Each Kubespray release supports a specific range of Kubernetes versions. When crossing into a new Kubernetes minor version, check whether your Kubespray version supports it. Switching Kubespray versions (v2.29.1 → v2.30.0) also requires updating Python dependencies — always run pip install -r requirements.txt after switching tags.</p><p><strong>Monitoring is not a nice-to-have.</strong> You cannot upgrade with confidence if you cannot see what is happening. kube-prometheus-stack with Prometheus, Grafana, and etcd metrics gives you the observability needed to verify cluster health before, during, and after upgrades.</p><h3>Kubespray as a Production Lifecycle Management Framework</h3><p>Kubespray is more than a deployment tool — it is a lifecycle management framework. The five core playbooks (cluster.yml, scale.yml, remove-node.yml, upgrade-cluster.yml, reset.yml) cover the full operational spectrum:</p><p>PhasePlaybookUse Case<strong>Deploy</strong>cluster.ymlInitial deployment, adding control plane nodes<strong>Scale</strong>scale.ymlAdding worker nodes<strong>Shrink</strong>remove-node.ymlRemoving any node (graceful or forced)<strong>Upgrade</strong>upgrade-cluster.ymlRolling upgrades with drain/cordon/uncordon<strong>Teardown</strong>reset.ymlFull cluster reset</p><p>Because Kubespray is built on Ansible, every operation is:</p><ul><li><strong>Idempotent</strong>: Running a playbook twice produces the same result. Safe to retry after failures.</li><li><strong>Declarative</strong>: The inventory and variables define the desired state; Kubespray converges the cluster to match.</li><li><strong>Auditable</strong>: Every playbook run can be logged, and the inventory can be version-controlled in Git.</li><li><strong>Extensible</strong>: Custom roles and tasks can be added for organization-specific requirements (certificate management, custom CNI configurations, security hardening).</li></ul><p>The combination of Git-managed inventory, Ansible’s idempotency, and Kubespray’s rolling upgrade capabilities makes it a solid foundation for teams that need to manage Kubernetes clusters without relying on a managed cloud service.</p><h3>Next Steps: Disaster Recovery, Automation, and Beyond</h3><p>This guide covered the core operational workflows, but several advanced topics extend naturally from here:</p><p><strong>Disaster Recovery (DR):</strong></p><ul><li>Implement automated etcd snapshot backups (e.g., CronJob running etcdctl snapshot save to an off-cluster storage backend)</li><li>Practice etcd restoration from snapshots: etcdctl snapshot restore</li><li>Document and test the full cluster recovery procedure: restore etcd → restart control plane → verify state</li><li>Consider etcd backup solutions like Velero for full cluster backup including PVs</li></ul><p><strong>CI/CD-Driven Upgrades:</strong></p><ul><li>Integrate Kubespray playbooks into a CI/CD pipeline (Jenkins, GitLab CI, GitHub Actions)</li><li>Use a Git branch per environment (dev, staging, production)</li><li>Automate the upgrade sequence: run in staging → run health checks → promote to production</li><li>Add pre-upgrade and post-upgrade verification steps (API health, etcd quorum, workload status) as pipeline gates</li></ul><p><strong>Security Hardening:</strong></p><ul><li>Enable PodSecurityAdmission (or OPA/Gatekeeper) for workload policy enforcement</li><li>Implement NetworkPolicies to restrict pod-to-pod communication</li><li>Configure certificate rotation and monitor certificate expiration dates</li><li>Harden etcd with TLS client authentication and restrict access to control plane nodes only</li></ul><p><strong>Advanced HA:</strong></p><ul><li>Deploy HAProxy in HA with keepalived for virtual IP failover (eliminates LB as SPOF)</li><li>Consider running 5 etcd members for higher fault tolerance in large clusters</li><li>Implement pod topology spread constraints and anti-affinity rules for critical workloads</li></ul><p><strong>Multi-Cluster Management:</strong></p><ul><li>Use Kubespray to manage multiple clusters from a single admin node</li><li>Maintain separate inventories per cluster in the same Git repository</li><li>Standardize configurations across clusters using shared group_vars with cluster-specific overrides</li></ul><p>The techniques in this guide — HA deployment, rolling upgrades, node lifecycle management, and observability — form the operational foundation upon which all of these advanced topics are built. Master these fundamentals first, and the rest follows naturally.</p><blockquote><strong><em>Lab Cleanup:</em></strong><em> When you’re done, tear down the entire environment with </em><em>vagrant destroy -f. All VMs and their data will be removed.</em></blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=43076ef54676" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kubernetes, The Kubespray Way]]></title>
            <link>https://sigridjin.medium.com/kubernetes-the-kubespray-way-89111e0e1636?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/89111e0e1636</guid>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[kubespray]]></category>
            <category><![CDATA[kubernetes]]></category>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 31 Jan 2026 21:59:53 GMT</pubDate>
            <atom:updated>2026-01-31T21:59:53.547Z</atom:updated>
            <content:encoded><![CDATA[<h4>You can optimizeKubernetes infrastructure provisioning and lifecycle management through Kubespray’s Ansible-based automation framework, leveraging codified community expertise to deliver robust, production-grade environments.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4N4fTJ8pPsl4SH-esMqKSg.jpeg" /></figure><h3>Chapter 1: Why Kubespray? Understanding the Automation Spectrum</h3><p>If you have been following the Kubernetes learning path properly, you have probably gone through something like Kubernetes The Hard Way first. You manually provisioned VMs, generated certificates with OpenSSL, created kubeconfig files by hand, configured etcd as a systemd service, and set up each control plane component one by one. It was painful. It took hours. But you understood exactly what was happening.</p><p>Then you discovered kubeadm. Suddenly, all that certificate generation, etcd bootstrapping, and control plane deployment collapsed into a single command:</p><blockquote>kubeadm init — config=kubeadm-config.yaml</blockquote><p>The hours of manual work became minutes. But you still had to prepare each node yourself — disable swap, load kernel modules, install containerd, configure kubelet. You still had to SSH into each machine and run commands.</p><p>Now we arrive at Kubespray. One command, and everything happens:</p><blockquote>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml</blockquote><p>Fifteen minutes later, you have a production-ready Kubernetes cluster. The OS is configured. Containerd is installed. Certificates are generated. etcd is running. The control plane is up. Workers have joined. CNI is deployed. CoreDNS is answering queries.</p><p>This is the automation spectrum. Each step up the ladder makes deployment easier and faster. But there is a cost that nobody talks about until it bites them.</p><h4>What Kubespray Actually Is</h4><p>The Kubespray GitHub repository introduces itself with a simple tagline:</p><blockquote>Deploy a Production Ready Kubernetes Cluster</blockquote><p>This is not marketing speak. Kubespray genuinely aims to give you a cluster that you can run workloads on immediately, configured according to best practices that the Kubernetes community has learned through years of production experience.</p><p>The official documentation describes Kubespray as:</p><blockquote>Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks.</blockquote><p>That phrase “domain knowledge” is the key. Kubespray is not just a script that runs commands in order. It encodes decisions about how to configure NTP synchronization, what kernel parameters to set, how to structure etcd clusters, when to use IPVS versus iptables for kube-proxy, and hundreds of other operational details.</p><p>Kubespray is maintained by the Kubernetes SIG Cluster Lifecycle group. Since version 2.3, it uses kubeadm internally for the actual cluster bootstrapping. This means Kubespray handles everything that kubeadm does not — the OS preparation, container runtime installation, CNI deployment, and post-installation configuration.</p><p>Think of it as layers:</p><p>Manual installation: You do everything kubeadm: Automates cluster bootstrapping, you do the rest Kubespray: Automates everything, you configure variables</p><p>The Kubespray repository is organized as a standard Ansible project.</p><p>Each directory under roles contains the logic for one piece of the cluster. The bootstrap-os role configures the operating system. The container-engine role installs containerd. The etcd role sets up the etcd cluster. And so on.</p><h4>The Double-Edged Sword</h4><p>Here is the uncomfortable truth about automation: the more it does for you, the less you understand about what is actually happening.</p><p>When you run cluster.yml, Kubespray executes hundreds of tasks across dozens of roles. It loads kernel modules, writes configuration files, downloads binaries, generates certificates, creates systemd units, applies Kubernetes manifests, and configures networking. All of this happens behind a progress bar showing task names scrolling by.</p><p>If the playbook completes successfully, you get a working cluster. You can run kubectl get nodes and see your machines. You can deploy applications. Everything works.</p><p>But what happens when something goes wrong six months later?</p><p>A pod is not scheduling. You check the events, and they mention resource pressure. You look at the node, and something is consuming all the memory. Is it the kubelet configuration? The container runtime? A kernel parameter? Where do you even look?</p><p>Network traffic between pods is failing. Packets are being dropped somewhere. Is it the CNI plugin? An iptables rule? A kernel module that is not loaded? Which configuration file controls this behavior?</p><p>Certificates have expired. The API server is refusing connections. How do you renew them? Where are they stored? What tool generated them in the first place?</p><p>If you installed everything manually, you would know the answers because you typed every command yourself. If you used kubeadm with understanding, you would know that kubeadm certs renew exists and where the certificates live.</p><p>But if you just ran ansible-playbook cluster.yml without understanding what it does, you are now staring at a broken cluster with no idea how to fix it.</p><p>This is the black box problem. The automation that saved you hours during installation now costs you days during troubleshooting.</p><h4>The Upgrade Trap</h4><p>Kubernetes moves fast. A new minor version releases every four months. Security patches come out regularly. Running an outdated cluster is a security risk and eventually a compatibility problem as client tools and container images stop supporting old API versions.</p><p>Kubespray provides an upgrade playbook:</p><blockquote>ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.31.0</blockquote><p>In theory, you run this command and your cluster upgrades. In practice, upgrades are where black box automation fails hardest.</p><p>The upgrade playbook modifies your cluster in place. It drains nodes, updates binaries, restarts services, and hopes everything comes back up. If something fails partway through, you have a partially upgraded cluster in an inconsistent state.</p><p>Without understanding what the playbook is doing, you cannot:</p><ul><li>Diagnose why a particular step failed</li><li>Manually complete the upgrade if automation cannot</li><li>Roll back to the previous state safely</li><li>Verify that the upgrade completed correctly</li></ul><p>The result is that people who do not understand their automation become afraid to upgrade. They skip versions. They fall behind on security patches. The tool that was supposed to help them maintain their cluster becomes the reason they cannot maintain it.</p><h4>Kubespray Itself Changes</h4><p>This problem compounds because Kubespray is also evolving. New versions add roles, rename variables, change default values, and restructure configuration files.</p><p>If you learned Kubespray 2.20 and then try to use 2.28, you will find that:</p><ul><li>Some variable names have changed</li><li>New configuration options exist that you have never seen</li><li>Default behaviors are different</li><li>The role structure may have been reorganized</li></ul><p>Your muscle memory and mental model from the old version no longer apply. You need to relearn the tool, but you never really learned it deeply in the first place — you just ran the playbook.</p><h4>The Temptation of “Just Make It Work”</h4><p>Every developer has done this at some point. A deadline is approaching. Something needs to be running by tomorrow. You find a tool that claims to solve your problem, you run it, and it works. You move on.</p><p>For short-lived systems, this is fine. But Kubernetes clusters are not short-lived. They run for years. They get upgraded. They have nodes added and removed. They suffer hardware failures and network partitions. They need certificates renewed and configurations tuned.</p><p>If you do not understand how your cluster was built, you do not really own it. You are at the mercy of the tool. When the tool cannot help you, you are stuck.</p><p>I have seen this happen. At a previous job, we used Ansible extensively for infrastructure automation. Nobody on the team really understood Ansible deeply. We copied playbooks from examples, modified them until they worked, and ran them. When something broke, we spent hours debugging YAML indentation and variable precedence because we had never learned how Ansible actually evaluates variables.</p><p>The playbooks worked, but we did not understand why they worked. That meant we also did not understand why they failed.</p><h4>The Philosophy of This Tutorial</h4><p>This brings us to why this tutorial exists and how it approaches Kubespray.</p><p>We are not going to treat Kubespray as a magic button that produces clusters. We are going to open it up, look inside, and understand what it does.</p><p>When we run cluster.yml, we will trace through the plays and roles to see what tasks execute. When we configure variables, we will understand where those variables come from and how they affect the final cluster. When we deploy a cluster, we will verify that each component is running correctly and know where to look if it is not.</p><p>This takes longer than just running the playbook and hoping for the best. But it means that six months from now, when something breaks at 3 AM, you will have a chance of fixing it.</p><p>The goal is not to memorize every variable and task in Kubespray. That would be impossible — the kubespray_defaults role alone has over 800 lines of variables, and there are dozens of other roles. The goal is to understand the structure well enough that you know where to look when you need to find something.</p><p>Think of it like learning a new city. You do not memorize every street. But you learn the main neighborhoods, the major roads, and how the transit system works. Then when you need to get somewhere specific, you can figure it out.</p><h4>What Kubespray Automates</h4><p>To understand Kubespray, it helps to see exactly what manual work it replaces. If you have done Kubernetes The Hard Way, you will recognize these steps:</p><p>Kubernetes The Hard Way had you:</p><ol><li>Provision compute resources (VMs or physical machines)</li><li>Generate TLS certificates for every component with OpenSSL</li><li>Create kubeconfig files for kubelet, controller-manager, scheduler, and admin</li><li>Generate a data encryption configuration</li><li>Bootstrap the etcd cluster with systemd units</li><li>Bootstrap the control plane components as systemd services</li><li>Configure kubelet and kube-proxy on worker nodes</li><li>Set up pod networking routes manually</li></ol><p>kubeadm automated steps 2 through 6 into kubeadm init and step 8 into kubeadm join. But you still had to:</p><ul><li>Prepare the OS (disable swap, load kernel modules, set kernel parameters)</li><li>Install the container runtime (containerd)</li><li>Install kubeadm, kubelet, and kubectl</li><li>Configure the container runtime for Kubernetes</li><li>Install a CNI plugin</li><li>Set up high availability if you wanted it</li></ul><p>Kubespray automates all of this. When you run cluster.yml, these roles execute in sequence across your inventory hosts. The entire process typically takes 15–30 minutes depending on network speed and the number of nodes.</p><h4>The Comparison That Matters</h4><p>Here is a concrete comparison of deploying a basic cluster with each approach:</p><p>With Kubernetes The Hard Way, you run approximately 200 commands across multiple machines. You SSH into each node, copy files, edit configurations, and start services. The process takes several hours if you are careful and understand what you are doing.</p><p>With kubeadm, you still prepare each machine manually, but the cluster bootstrapping is reduced to:</p><pre>On the first control plane node<br>kubeadm init - config=kubeadm-config.yaml<br><br>On each worker node<br>kubeadm join 192.168.10.10:6443 - token &lt;token&gt; - discovery-token-ca-cert-hash sha256:&lt;hash&gt;</pre><p>You save significant time, but you still need to visit each machine.</p><p>With Kubespray, you run a single command from your Ansible control node:</p><blockquote>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b</blockquote><p>Kubespray connects to all machines simultaneously, configures them in parallel where possible, and handles the sequencing of operations automatically.</p><p>The time savings are dramatic. But remember: you are trading understanding for convenience. This tutorial exists to help you get both.</p><h3>Chapter 2: Kubespray Core Concepts and Features</h3><p>Let me start by reading you the official description from the Kubespray GitHub repository. The tagline says “Deploy a Production Ready Kubernetes Cluster.” This is not just marketing speak. It means Kubespray is designed to give you a cluster that you can actually run workloads on immediately, not a toy setup that needs hours of additional configuration.</p><p>The official documentation describes Kubespray as “a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks.” Pay attention to that phrase “domain knowledge.” This is what separates Kubespray from a simple shell script that runs kubeadm commands. The Kubespray maintainers have spent years figuring out what works in production environments, what kernel parameters matter, which CNI configurations cause problems, and how to handle edge cases. All of that accumulated wisdom is baked into the playbooks.</p><p>Kubespray is maintained by the Kubernetes SIG Cluster Lifecycle group, which means it is an official Kubernetes project. This is not some random GitHub repository that might disappear tomorrow.</p><p>One thing that confuses people is how Kubespray relates to kubeadm. They are not competing tools. Since version 2.3, Kubespray has used kubeadm internally for the actual cluster bootstrapping. The official documentation explains this decision:</p><blockquote>“Kubespray has started using kubeadm internally for cluster creation since v2.3 in order to consume life cycle management domain knowledge from it and offload generic OS configuration things from it, which hopefully benefits both sides.”</blockquote><p>So when you run Kubespray, it is not reinventing the wheel. It is using kubeadm to handle certificate generation, etcd configuration, and control plane bootstrapping. What Kubespray adds on top is everything else that kubeadm does not handle.</p><p>Think about what you still need to do manually when using kubeadm alone.</p><ul><li>Machine provisioning OS pre-configuration like time synchronization</li><li>SELinux settings, swap disable, kernel parameters Installing containerd and kubelet</li><li>Installing CNI plugins Setting up load balancers for HA configuration</li></ul><p>Kubespray automates all of this. When you run the cluster.yml playbook, you get a complete cluster, not a half-finished one that requires more work.</p><p>Kubespray is not just for initial deployment. It handles the entire lifecycle of your cluster through different playbooks.</p><blockquote>For creating a new cluster: ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b</blockquote><blockquote>For upgrading the cluster to a new Kubernetes version: ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -b</blockquote><blockquote>For adding new worker nodes: ansible-playbook -i inventory/mycluster/inventory.ini scale.yml -b</blockquote><blockquote>For removing specific nodes: ansible-playbook -i inventory/mycluster/inventory.ini remove-node.yml -b -e node=worker-3</blockquote><blockquote>For completely resetting the cluster back to a clean state: ansible-playbook -i inventory/mycluster/inventory.ini reset.yml -b</blockquote><p>Each of these playbooks is idempotent thanks to Ansible. If your deployment fails halfway through, you can run the same command again and Ansible will skip the tasks that already succeeded, continuing from where it left off. This is enormously valuable when you are dealing with flaky networks or transient failures.</p><p>One of Kubespray’s strongest features is its ability to deploy clusters anywhere. The same inventory structure and playbooks work whether you are deploying to AWS, GCP, Azure, a VMware vSphere environment, or bare metal servers in your data center.</p><p>For public cloud environments, the typical pattern is to use Terraform for infrastructure provisioning and then Kubespray for Kubernetes deployment:</p><blockquote>terraform apply ansible-playbook cluster.yml</blockquote><p>The Kubespray repository includes Terraform samples for various cloud providers if you want to see how this integration works.</p><p>For air-gapped environments where there is no internet access, Kubespray provides offline deployment support. This requires pre-downloading container images and binaries, then hosting them on an internal registry and file server.</p><p>The group_vars files include settings for configuring registry mirrors and custom download URLs:</p><pre>In group_vars/all/containerd.yml<br>containerd_registries_mirrors:<br>prefix: docker.io mirrors:<br>host: https://registry.internal.company.com capabilities: [&quot;pull&quot;, &quot;resolve&quot;]</pre><p>The only hard requirements are that your Ansible control node can reach the target machines via SSH and that Python is installed on those machines.</p><p>Since Ansible is agentless and uses SSH, you do not need to install any agents or daemons on the cluster nodes before running Kubespray.</p><p>Production clusters need high availability. Kubespray supports HA configurations for both the control plane and etcd.</p><p>For control plane HA, you simply list multiple nodes in your inventory:</p><pre>[kube_control_plane] controller-0 controller-1 controller-2<br>[etcd] controller-0 controller-1 controller-2<br>[kube_node] worker-0 worker-1 worker-2</pre><p>Kubespray will configure all three control plane nodes with the API server, controller manager, and scheduler. The controller manager and scheduler use leader election, so only one instance is active at a time while the others stand by.</p><p>For etcd, Kubespray deploys a proper cluster using the Raft consensus protocol. You should always use an odd number of etcd nodes, typically three or five, to ensure quorum can be reached even if one node fails.</p><p>Now here is something important to understand about load balancing. Kubespray automates client-side load balancing but not external load balancing. What does this mean?</p><p>Client-side load balancing means each worker node runs a local nginx or haproxy that proxies requests to the available API servers. This is configured automatically by Kubespray. The kubelet on each worker connects to localhost, and the local proxy distributes requests across the control plane nodes.</p><p>External load balancing means having a load balancer in front of your control plane that external clients like kubectl on your laptop can connect to. Kubespray does not automatically configure this because external load balancing is completely different depending on your environment:</p><p>On AWS you would use an ELB or NLB On GCP you would use a GCP Load Balancer On bare metal you might use HAProxy with keepalived and a virtual IP In some environments you might use MetalLB or kube-vip</p><p>Kubespray is a tool for configuring software on top of your operating system. It is not an infrastructure provisioning tool. Configuring virtual IPs, DNS entries, and cloud load balancers is outside its scope.</p><p>The expectation is that you handle external load balancing through Terraform or manual configuration, and Kubespray handles everything from the OS level up.</p><p>That said, Kubespray does support kube-vip as an option for control plane VIP, which can provide external load balancing without requiring separate infrastructure. You can enable it in your addons configuration:</p><pre>In group_vars/k8s_cluster/addons.yml<br>kube_vip_enabled: true</pre><p>When you deploy a cluster with Kubespray, you get production-grade defaults without having to think about them. Remember all those manual steps from the kubeadm documentation about preparing your nodes? Kubespray handles them automatically.</p><p>Time synchronization is configured using chrony or ntp. In a distributed system like Kubernetes, clock skew between nodes causes all sorts of subtle problems with certificates, leases, and log correlation.</p><p>Kernel parameters are set correctly for container networking:</p><pre>net.bridge.bridge-nf-call-iptables = 1<br>net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1</pre><p>The necessary kernel modules are loaded.</p><p>overlay br_netfilter: Swap is disabled, which is a hard requirement for kubelet to run properly.</p><p>These settings are defined in the kubernetes/preinstall role and applied automatically during cluster deployment. You do not need to SSH into each node and run sysctl commands manually.</p><p>When you run cluster.yml, here is what gets installed and configured on your nodes.</p><p>On all nodes:</p><ul><li>Kernel modules and sysctl parameters for container networking</li><li>Container runtime (containerd by default)</li><li>kubelet and kubectl binaries</li><li>CNI plugins</li></ul><p>On control plane nodes:</p><ul><li>kube-apiserver as a static pod</li><li>kube-controller-manager as a static pod</li><li>kube-scheduler as a static pod</li><li>etcd (either as a systemd service or static pod depending on configuration)</li></ul><p>On worker nodes:</p><ul><li>kube-proxy (as a DaemonSet)</li><li>Client-side load balancer for API server access (nginx or haproxy)</li></ul><p>Cluster-wide:</p><ul><li>CoreDNS for cluster DNS</li><li>CNI plugin DaemonSet (Calico, Flannel, etc.)</li><li>Optional addons like metrics-server, ingress controllers, helm</li></ul><p>The configuration for all of this lives in your inventory directory. Once you have a working inventory, deploying the same cluster configuration repeatedly is just a matter of running the playbook against new infrastructure. This is infrastructure as code applied to Kubernetes deployment.</p><h3>Chapter 3: Project Structure Overview</h3><p>When you first clone the Kubespray repository and run ls, you&#39;re greeted with a wall of files and directories that can feel overwhelming. But here&#39;s the thing — Kubespray follows standard Ansible project conventions, so once you understand how Ansible organizes things, the whole structure clicks into place. Let me walk you through it piece by piece.</p><p>First, let’s get a bird’s eye view. Clone the repository and take a look:</p><pre>git clone https://github.com/kubernetes-sigs/kubespray.git<br>cd kubespray<br>git checkout release-2.28<br>ls -la</pre><p>You’ll see something like this at the root level:</p><pre>kubespray/<br>├── ansible.cfg<br>├── cluster.yml<br>├── reset.yml<br>├── scale.yml<br>├── upgrade-cluster.yml<br>├── remove-node.yml<br>├── recover-control-plane.yml<br>├── inventory/<br>├── roles/<br>├── playbooks/<br>├── library/<br>├── docs/<br>└── ...</pre><p>Those YAML files sitting at the root — they’re your entry points. Each one represents a major cluster operation, and you’ll be running these directly with ansible-playbook.</p><p>The most important one is cluster.yml. This is what you run to create a new cluster from scratch. When you execute ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml, Kubespray kicks off the entire deployment process — from configuring the operating system to installing containerd, setting up etcd, deploying the Kubernetes control plane, joining worker nodes, installing CNI, and deploying addons like CoreDNS. We&#39;ll dissect this playbook in detail in Chapter 7.</p><p>Then there’s reset.yml. This one tears everything down. It&#39;s the nuclear option — it removes all Kubernetes components, wipes etcd data, and leaves your nodes in a clean state. I&#39;ve used this more times than I&#39;d like to admit during testing when something went sideways and I just needed a fresh start.</p><p>scale.yml is for adding new nodes to an existing cluster. Say you deployed with three workers and now you need five. You update your inventory file with the new nodes, then run scale.yml. It&#39;s smart enough to skip all the work that&#39;s already done on existing nodes and only configure the new ones.</p><p>remove-node.yml does the opposite. When you need to decommission a node — maybe the hardware is failing or you&#39;re downsizing — this playbook handles the graceful removal. It drains the node, removes it from the cluster, and cleans up.</p><p>upgrade-cluster.yml handles version upgrades. Kubernetes releases new versions every few months, and running this playbook walks your cluster through the upgrade process node by node, respecting the proper order (etcd first, then control plane, then workers).</p><p>recover-control-plane.yml is your emergency playbook. If a control plane node dies and you need to restore it, this is what you reach for.</p><p>Now, here’s something that might confuse you at first. You’ll notice some files exist in both hyphen and underscore versions:</p><pre>ls -la | grep -E &quot;remove|scale&quot;</pre><pre>remove-node.yml<br>remove_node.yml<br>scale.yml</pre><p>Both remove-node.yml and remove_node.yml exist. This isn&#39;t a mistake — it&#39;s backward compatibility. At some point, Kubespray standardized on kebab-case (hyphens), but they kept the old snake_case versions around so existing scripts and documentation wouldn&#39;t break. If you&#39;re writing new automation, use the hyphen versions.</p><h4>The inventory Directory</h4><p>This is where you define what your cluster looks like. Kubespray ships with a sample inventory that you’re meant to copy and customize:</p><pre>ls inventory/</pre><pre>local/<br>sample/</pre><p>The local/ directory is for single-node testing on your local machine. The sample/ directory is the template you&#39;ll copy for real deployments. Let&#39;s look inside:</p><pre>tree inventory/sample/</pre><pre>inventory/sample/<br>├── inventory.ini<br>└── group_vars/<br>    ├── all/<br>    │   ├── all.yml<br>    │   ├── aws.yml<br>    │   ├── azure.yml<br>    │   ├── containerd.yml<br>    │   ├── coreos.yml<br>    │   ├── cri-o.yml<br>    │   ├── docker.yml<br>    │   ├── etcd.yml<br>    │   ├── gcp.yml<br>    │   ├── hcloud.yml<br>    │   ├── huaweicloud.yml<br>    │   ├── oci.yml<br>    │   ├── offline.yml<br>    │   ├── openstack.yml<br>    │   ├── upcloud.yml<br>    │   └── vsphere.yml<br>    └── k8s_cluster/<br>        ├── addons.yml<br>        ├── k8s-cluster.yml<br>        ├── k8s-net-calico.yml<br>        ├── k8s-net-cilium.yml<br>        ├── k8s-net-custom-cni.yml<br>        ├── k8s-net-flannel.yml<br>        ├── k8s-net-kube-ovn.yml<br>        ├── k8s-net-kube-router.yml<br>        ├── k8s-net-macvlan.yml<br>        └── kube_control_plane.yml</pre><p>The inventory.ini file is where you list your actual hosts and assign them to groups. A typical production inventory might look like:</p><pre>[all]<br>node1 ansible_host=10.0.1.10 ip=10.0.1.10<br>node2 ansible_host=10.0.1.11 ip=10.0.1.11<br>node3 ansible_host=10.0.1.12 ip=10.0.1.12<br>node4 ansible_host=10.0.1.20 ip=10.0.1.20<br>node5 ansible_host=10.0.1.21 ip=10.0.1.21<br><br>[kube_control_plane]<br>node1<br>node2<br>node3<br><br>[etcd:children]<br>kube_control_plane<br><br>[kube_node]<br>node4<br>node5<br><br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>Notice the [etcd:children] syntax. This is Ansible&#39;s way of creating nested groups. Instead of listing hosts directly under [etcd], we&#39;re saying &quot;the etcd group contains all hosts from the kube_control_plane group.&quot; This pattern is convenient when your etcd nodes and control plane nodes are the same machines, which is the common &quot;stacked etcd&quot; topology.</p><p>The group_vars/ directory is where the magic happens. Ansible automatically loads variables from files in this directory based on which groups a host belongs to. Files under group_vars/all/ apply to every host. Files under group_vars/k8s_cluster/ apply only to hosts in the k8s_cluster group.</p><p>This is why you see so many YAML files in there. containerd.yml has container runtime settings. etcd.yml has etcd-specific configuration. k8s-cluster.yml has core Kubernetes settings like the network plugin, service CIDR, and pod CIDR. The cloud provider files (aws.yml, gcp.yml, azure.yml) contain settings that only matter if you&#39;re deploying to those environments — they&#39;re mostly commented out by default.</p><h4>The roles Directory</h4><p>Here’s where the actual work gets defined. Kubespray is essentially a collection of Ansible roles, and each role handles a specific piece of the cluster setup:</p><pre>ls roles/</pre><pre>adduser/<br>bootstrap_os/<br>bootstrap-os/<br>container-engine/<br>download/<br>etcd/<br>etcdctl_etcdutl/<br>kubernetes/<br>kubernetes-apps/<br>kubespray_defaults/<br>kubespray-defaults/<br>network_plugin/<br>recover_control_plane/<br>remove-node/<br>reset/<br>upgrade/<br>win_nodes/<br>...</pre><p>Let me explain the key ones.</p><p>bootstrap-os handles the initial operating system configuration. It installs Python (which Ansible needs), updates packages, sets up required kernel modules like overlay and br_netfilter, configures sysctl parameters for networking, and disables swap. All that tedious pre-work you&#39;d do manually before running kubeadm — this role automates it.</p><p>container-engine installs the container runtime. By default that&#39;s containerd, but Kubespray supports Docker (via cri-dockerd) and CRI-O as well. This role downloads the binaries, creates the configuration files, sets up systemd units, and ensures the runtime is running.</p><p>etcd deploys the etcd cluster. Depending on your settings, it either installs etcd as a systemd service directly on the host or lets kubeadm manage it as a static pod. It handles certificate generation, cluster membership, and health checking.</p><p>kubernetes is actually a directory containing multiple sub-roles:</p><pre>ls roles/kubernetes/</pre><pre>control-plane/<br>kubeadm/<br>node/<br>node-label/<br>node-taint/<br>preinstall/</pre><p>The preinstall sub-role does Kubernetes-specific preparation — things like creating the kubernetes user, setting up directories, and configuring kubelet. The node sub-role installs kubelet, kubectl, and kubeadm on all nodes. The control-plane sub-role runs the kubeadm init process on control plane nodes. The kubeadm sub-role handles both the initial cluster bootstrap and joining additional nodes.</p><p>network_plugin installs your chosen CNI. Look inside and you&#39;ll see sub-directories for each supported CNI:</p><pre>ls roles/network_plugin/</pre><pre>calico/<br>cilium/<br>flannel/<br>kube-ovn/<br>kube-router/<br>macvlan/<br>multus/<br>weave/<br>...</pre><p>When you set kube_network_plugin: calico in your group_vars, Kubespray runs the tasks in roles/network_plugin/calico/. Switch it to flannel, and it runs roles/network_plugin/flannel/ instead.</p><p>kubernetes-apps deploys the addons — CoreDNS for cluster DNS, metrics-server for resource metrics, helm if you enable it, ingress controllers, cert-manager, and so on.</p><p>download is a critical role that handles fetching all the binaries and container images. It&#39;s designed to be idempotent and can work in both online and offline (air-gap) scenarios.</p><p>Now, about those confusingly similar names — kubespray_defaults versus kubespray-defaults. These are not duplicates. They serve different purposes, and understanding this distinction matters.</p><p>kubespray_defaults (with underscore) contains the actual variable definitions. Look at its structure:</p><pre>tree roles/kubespray_defaults/</pre><pre>roles/kubespray_defaults/<br>├── defaults/<br>│   └── main/<br>│       ├── download.yml<br>│       └── main.yml<br>└── vars/<br>    └── main/<br>        ├── checksums.yml<br>        └── main.yml</pre><p>The defaults/main/main.yml file alone is over 800 lines. It defines default values for nearly every configurable aspect of Kubespray — Kubernetes version, network settings, paths, timeouts, feature flags. The defaults/main/download.yml adds another 1100+ lines covering download URLs and version mappings. The vars/main/checksums.yml contains SHA256 checksums for all downloadable binaries — these are in the vars/ directory (higher priority) because you really shouldn&#39;t be changing checksums unless you know exactly what you&#39;re doing.</p><p>kubespray-defaults (with hyphen), on the other hand, contains tasks:</p><pre>tree roles/kubespray-defaults/</pre><pre>roles/kubespray-defaults/<br>└── tasks/<br>    └── main.yml</pre><p>This role’s job is to load and process those variables. It’s a pattern you’ll see throughout Kubespray — separate the data (variables) from the logic (tasks).</p><p>You’ll find similar underscore/hyphen pairs elsewhere. bootstrap_os holds variables while bootstrap-os holds tasks. It&#39;s a bit confusing at first, but once you recognize the pattern, it makes navigating the codebase easier.</p><h4>The playbooks Directory</h4><p>Remember those root-level playbooks like cluster.yml? They’re mostly thin wrappers that import the real playbooks from this directory:</p><pre>ls playbooks/</pre><pre>ansible_version.yml<br>cluster.yml<br>facts.yml<br>install_etcd.yml<br>reset.yml<br>scale.yml<br>upgrade_cluster.yml<br>...</pre><p>When you run the root cluster.yml, it imports playbooks/cluster.yml which contains the actual play definitions. This separation keeps the root directory clean while allowing the playbook logic to be more complex.</p><p>Let’s peek at what the root cluster.yml actually does:</p><pre>cat cluster.yml | head -30</pre><pre>---<br>- name: Check Ansible version<br>  import_playbook: ansible_version.yml<br><br>- name: Add kube-master nodes to kube_control_plane<br>  # Backward compatibility<br>  import_playbook: legacy_groups.yml<br><br>- name: Common tasks for every playbooks<br>  import_playbook: boilerplate.yml<br><br>- name: Gather facts<br>  import_playbook: facts.yml<br><br>- name: Prepare for etcd install<br>  hosts: &quot;{{ hostvars[groups[&#39;etcd&#39;][0]][&#39;etcd_retries&#39;] | default(groups[&#39;etcd&#39;][0]) }}:kube_control_plane&quot;<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, tags: preinstall }<br>    - { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>    - { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>You can see it’s a sequence of import_playbook statements and play definitions. The structure is methodical — check Ansible version, handle legacy group names, run common boilerplate, gather facts, then proceed through each phase of cluster setup.</p><p>Here’s the complete picture of how these pieces connect when you run a deployment:</p><pre>You run:<br>  ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml<br><br>Ansible reads:<br>  1. ansible.cfg (from current directory)<br>  2. inventory/mycluster/inventory.ini (hosts and groups)<br>  3. inventory/mycluster/group_vars/all/*.yml (global variables)<br>  4. inventory/mycluster/group_vars/k8s_cluster/*.yml (cluster variables)<br><br>cluster.yml imports and runs:<br>  → playbooks/ansible_version.yml (version check)<br>  → playbooks/boilerplate.yml (common setup)<br>  → playbooks/facts.yml (gather system info)<br>  → plays that execute roles:<br>      → roles/kubespray_defaults (load variables)<br>      → roles/bootstrap-os (OS setup)<br>      → roles/container-engine (containerd)<br>      → roles/etcd (etcd cluster)<br>      → roles/kubernetes/node (kubelet)<br>      → roles/kubernetes/control-plane (kubeadm init)<br>      → roles/network_plugin/calico (or flannel, etc.)<br>      → roles/kubernetes-apps (addons)</pre><p>Each role follows the standard Ansible role structure:</p><pre>roles/some-role/<br>├── defaults/<br>│   └── main.yml      # Default variable values (lowest priority)<br>├── vars/<br>│   └── main.yml      # Role variables (higher priority)<br>├── tasks/<br>│   └── main.yml      # The actual work<br>├── templates/<br>│   └── config.j2     # Jinja2 templates<br>├── handlers/<br>│   └── main.yml      # Handlers (restart services, etc.)<br>└── files/<br>    └── something     # Static files to copy</pre><p>The roles execute in dependency order. You can’t install Kubernetes before containerd is running. You can’t join worker nodes before the control plane is up. Kubespray handles this orchestration through careful ordering in the playbooks.</p><p>For a quick reference, here’s how to list what’s in each major directory:</p><pre># Count files in each major area<br>echo &quot;Roles:&quot; &amp;&amp; ls roles/ | wc -l<br>echo &quot;Role defaults files:&quot; &amp;&amp; find roles -path &quot;*/defaults/*.yml&quot; | wc -l  <br>echo &quot;Role vars files:&quot; &amp;&amp; find roles -path &quot;*/vars/*.yml&quot; | wc -l<br>echo &quot;Sample group_vars/all:&quot; &amp;&amp; ls inventory/sample/group_vars/all/ | wc -l<br>echo &quot;Sample group_vars/k8s_cluster:&quot; &amp;&amp; ls inventory/sample/group_vars/k8s_cluster/ | wc -l</pre><p>On the 2.28 release, you’ll see something like:</p><pre>Roles: 40+<br>Role defaults files: 77<br>Role vars files: 50<br>Sample group_vars/all: 16<br>Sample group_vars/k8s_cluster: 10</pre><p>That’s a lot of moving parts, but they’re organized logically. When you need to understand or modify something specific — say, how containerd gets configured — you know to look in roles/container-engine/containerd/. When you want to change a default timeout or version, you check roles/kubespray_defaults/defaults/. When you want to customize your deployment, you edit files in inventory/mycluster/group_vars/.</p><p>The key insight is that Kubespray isn’t magic. It’s just a well-organized collection of Ansible automation following established patterns. Once you internalize the structure, you can navigate it confidently, troubleshoot issues effectively, and customize it for your environment without fear of breaking things you don’t understand.</p><h3>Chapter 4: ansible.cfg Configuration Analysis</h3><p>When you run ansible-playbook from the Kubespray directory, Ansible doesn’t just use its global defaults. It picks up a project-specific configuration file that Kubespray ships with. Understanding this file is important because it controls how Ansible behaves during cluster deployment — things like SSH connection handling, fact caching, output formatting, and performance optimizations.</p><p>Ansible looks for configuration files in a specific order. First, it checks if you’ve set the ANSIBLE_CONFIG environment variable. If that’s not set, it looks for ansible.cfg in your current working directory. Failing that, it checks your home directory for .ansible.cfg, and finally falls back to the system-wide /etc/ansible/ansible.cfg.</p><p>This is exactly why every Kubespray command in the documentation tells you to run ansible-playbook from the kubespray directory. If you run it from somewhere else, Ansible won’t find Kubespray’s ansible.cfg and will use different settings, which can lead to unexpected behavior or failures.</p><p>Let’s look at what Kubespray’s ansible.cfg actually contains:</p><pre>[ssh_connection]<br>pipelining=True<br>ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null<br>#control_path = ~/.ssh/ansible-%%r@%%h:%%p<br><br>[defaults]<br>force_valid_group_names = ignore<br>host_key_checking=False<br>gathering = smart<br>fact_caching = jsonfile<br>fact_caching_connection = /tmp<br>fact_caching_timeout = 86400<br>timeout = 300<br>stdout_callback = default<br>display_skipped_hosts = no<br>library = ./library<br>callbacks_enabled = profile_tasks<br>roles_path = roles:$VIRTUAL_ENV/usr/local/share/kubespray/roles:$VIRTUAL_ENV/usr/local/share/ansible/roles:/usr/share/kubespray/roles<br>deprecation_warnings=False<br>inventory_ignore_extensions = ~, .orig, .bak, .ini, .cfg, .retry, .pyc, .pyo, .creds, .gpg<br><br>[inventory]<br>ignore_patterns = artifacts, credentials</pre><p>That’s a lot to unpack. Let’s go through each section.</p><h4>The SSH Connection Section</h4><p>The [ssh_connection] section controls how Ansible connects to your target nodes. Kubespray deploys to multiple machines and runs hundreds of tasks, so SSH performance matters a lot here.</p><p>The first setting is pipelining=True. By default, when Ansible runs a task on a remote machine, it copies a Python script over, executes it, and closes the connection. With pipelining disabled, each task involves multiple round trips. With pipelining enabled, Ansible can send multiple commands through a single SSH session, reducing overhead significantly. For a playbook with 500+ tasks running across multiple nodes, this makes a real difference.</p><p>The ssh_args line packs several SSH client options together:</p><pre>ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null</pre><p><strong>ControlMaster=auto</strong> enables SSH connection multiplexing. When you first connect to a host, SSH creates a master connection. Subsequent connections to the same host reuse this master connection instead of establishing new ones. This eliminates the authentication overhead for each connection.</p><p><strong>ControlPersist=30m</strong> keeps that master connection alive for 30 minutes after the last session ends. During a Kubespray run, Ansible connects to the same nodes over and over. Keeping the master connection alive means those reconnections are nearly instant.</p><p><strong>ConnectionAttempts=100</strong> tells SSH to retry up to 100 times if a connection fails. This sounds excessive, but network hiccups happen, especially in cloud environments. You don’t want your 20-minute deployment to fail because of a momentary network blip.</p><p>The last one, <strong>UserKnownHostsFile=/dev/null</strong>, is worth understanding properly. Normally, SSH maintains a file called known_hosts in your ~/.ssh directory. The first time you connect to a server, SSH shows you a fingerprint and asks if you want to trust this host. If you say yes, it saves the fingerprint to known_hosts. On subsequent connections, SSH checks that the fingerprint matches. If it doesn’t match, SSH refuses to connect because this could indicate a man-in-the-middle attack.</p><p>Setting <strong>UserKnownHostsFile=/dev/null</strong> means SSH writes the fingerprint to /dev/null — the black hole device that discards everything written to it. The fingerprint is never saved, so every connection is treated as a first-time connection. Combined with host_key_checking=False (which we’ll see in the defaults section), this means Ansible never prompts you to verify host keys and never complains about changed fingerprints.</p><p>Why would Kubespray want this? In cloud and VM environments, you often tear down and recreate machines. The same IP address might point to completely different machines over time. If Ansible kept checking host keys, you’d constantly get “HOST KEY VERIFICATION FAILED” errors and have to manually clean up your known_hosts file. For automated deployment, this is a practical necessity.</p><p>That said, this does reduce security. In a production environment where hosts are long-lived and you want to detect if someone is intercepting your SSH traffic, you might want to reconsider this setting. But for most Kubespray use cases — spinning up clusters, tearing them down, trying different configurations — the convenience outweighs the security tradeoff.</p><p>The control_path line is commented out:</p><pre>#control_path = ~/.ssh/ansible-%%r@%%h:%%p</pre><p>This would specify where SSH stores its control socket files for connection multiplexing. The %% patterns expand to the remote user, hostname, and port. Kubespray leaves this commented because Ansible manages control paths automatically, and the default behavior works fine for most setups.</p><h4>The Defaults Section</h4><p>The [defaults] section contains most of the interesting configuration. Let’s break it down.</p><pre>force_valid_group_names = ignore</pre><p>This one has a story behind it. In Ansible 2.8, the developers added stricter validation for inventory group names. They decided that group names should follow Python variable naming rules — no hyphens, no dots. The reasoning was that group names sometimes get used in Jinja2 templates or as variable names, and special characters could cause problems.</p><p>The thing is, Kubernetes uses hyphens everywhere. Namespaces like kube-system, pod names like coredns-12345, node names with hyphens. If you’re using Kubespray with an inventory that reflects your Kubernetes naming conventions, you’ll have groups with hyphens in their names.</p><p>When Ansible 2.8 came out, people started getting warnings about invalid group names. Some people’s playbooks broke entirely. There was a GitHub issue (#56930) where users complained, pointing out that group names aren’t variable names — they’re just strings that happen to be used as dictionary keys. The Kubernetes ecosystem had already standardized on hyphen-separated names, and forcing everyone to use underscores instead wasn’t practical.</p><p>Setting force_valid_group_names = ignore tells Ansible to accept any group name without complaining. Kubespray needs this to work seamlessly with Kubernetes naming conventions.</p><pre>host_key_checking=False</pre><p>This complements the UserKnownHostsFile=/dev/null setting from the SSH section. When SSH connects to a new host, it normally asks “Are you sure you want to continue connecting?” and waits for you to type yes. With host_key_checking disabled, Ansible automatically accepts new host keys without prompting. Combined with the /dev/null known_hosts file, this makes SSH connections completely non-interactive.</p><pre>gathering = smart<br>fact_caching = jsonfile<br>fact_caching_connection = /tmp<br>fact_caching_timeout = 86400</pre><p>These four lines work together to optimize fact gathering. Facts are system information that Ansible collects from each host — things like the operating system, IP addresses, memory size, CPU count. Collecting facts requires running commands on each host, which takes time.</p><p>With gathering = smart, Ansible only collects facts when it doesn’t already have them cached. The fact_caching = jsonfile setting tells Ansible to cache facts as JSON files. fact_caching_connection = /tmp specifies where to store those JSON files. And fact_caching_timeout = 86400 means cached facts are valid for 86400 seconds — that’s 24 hours.</p><p>During a Kubespray run, Ansible connects to the same hosts multiple times across different plays. Without caching, it would collect facts every single time, adding minutes to the total runtime. With smart gathering and caching, facts are collected once and reused throughout the run.</p><pre>timeout = 300</pre><p>This sets the SSH connection timeout to 300 seconds (5 minutes). If a host doesn’t respond within 5 minutes, Ansible gives up on that connection. This is pretty generous, but it accommodates slow networks or hosts that take a while to become available.</p><pre>stdout_callback = default<br>display_skipped_hosts = no</pre><p>These control how Ansible displays output. The default callback gives you the standard Ansible output format. You could change this to yaml or json if you preferred different formatting.</p><p>display_skipped_hosts = no is a nice quality-of-life setting. Kubespray has many tasks that only run on certain host types. A task that configures etcd will be skipped on worker nodes. A task that installs kubelet runs everywhere. With display_skipped_hosts enabled, you’d see endless “skipping: [worker-1]” messages cluttering your output. Disabling it keeps the output focused on what’s actually happening.</p><pre>library = ./library</pre><p>Ansible lets you write custom modules when the built-in ones don’t do what you need. Kubespray includes a custom module called kube.py in its library directory. This module wraps kubectl commands, letting Ansible tasks manage Kubernetes resources directly. The library = ./library setting tells Ansible to look in that directory for custom modules.</p><pre>callbacks_enabled = profile_tasks</pre><p>This enables the profile_tasks callback plugin. When your playbook finishes, you’ll see a summary showing how long each task took:</p><pre>PLAY RECAP ****<br>Tuesday 28 January 2026  15:23:45 +0900 (0:00:02.456)<br><br>===============================================================================<br>Download containerd -------------------------------- 45.23s<br>Configure kubelet ---------------------------------- 12.67s<br>Download images ------------------------------------ 89.45s</pre><p>This is incredibly useful for understanding where time goes during deployment. If your cluster deployment is taking too long, this output tells you which tasks are the bottleneck.</p><pre>roles_path = roles:$VIRTUAL_ENV/usr/local/share/kubespray/roles:$VIRTUAL_ENV/usr/local/share/ansible/roles:/usr/share/kubespray/roles</pre><p>This tells Ansible where to look for roles, in order of priority. The paths are separated by colons. Ansible checks the first path first, then moves on if it doesn’t find the role.</p><p>The first path, roles, is the local roles directory in the Kubespray project. This is where all of Kubespray’s roles live, and it has the highest priority.</p><p>The paths containing $VIRTUAL_ENV are for when you install Kubespray as a Python package into a virtual environment. The $VIRTUAL_ENV variable gets set automatically when you activate a Python virtual environment. If you’re just running Kubespray from a git clone (which is the most common approach), these paths won’t match anything and get ignored.</p><p>The last path, /usr/share/kubespray/roles, is for system-wide installations.</p><pre>deprecation_warnings=False</pre><p>Ansible loves to warn you about deprecated features. While these warnings are useful when you’re developing playbooks, they add noise when you’re just running Kubespray. Kubespray’s maintainers keep the playbooks updated, so you can trust that deprecated features will be fixed in future releases.</p><pre>inventory_ignore_extensions = ~, .orig, .bak, .ini, .cfg, .retry, .pyc, .pyo, .creds, .gpg</pre><p>When Ansible scans your inventory directory, it ignores files with these extensions. This prevents backup files, compiled Python files, and credential files from being accidentally parsed as inventory.</p><h4>The Inventory Section</h4><pre>[inventory]<br>ignore_patterns = artifacts, credentials</pre><p>Similar to inventory_ignore_extensions, this tells Ansible to skip directories matching these patterns when scanning for inventory files. The artifacts directory typically contains deployment outputs, and credentials contains sensitive files. Neither should be treated as inventory sources.</p><h4>Customizing ansible.cfg</h4><p>Kubespray’s default ansible.cfg works well for most situations, but you might want to adjust it for your environment. The key is understanding that you can override settings without modifying the original file.</p><p>One approach is to use the ANSIBLE_CONFIG environment variable:</p><pre>ANSIBLE_CONFIG=ansible.cfg.custom ansible-playbook cluster.yml</pre><p>You could maintain different configuration files for different environments:</p><pre>kubespray/<br>├── ansible.cfg           # default<br>├── ansible.cfg.dev       # development (more verbose, shorter timeouts)<br>└── ansible.cfg.prod      # production (stricter security settings)</pre><p>Some settings you might want to change:</p><p>For a more secure environment where you do want host key checking:</p><pre>[defaults]<br>host_key_checking = True<br><br>[ssh_connection]<br>ssh_args = -o ControlMaster=auto -o ControlPersist=30m<br># Removed UserKnownHostsFile=/dev/null</pre><p>For CI/CD pipelines where you need machine-parseable output:</p><pre>[defaults]<br>stdout_callback = json</pre><p>For debugging, to see which hosts are being skipped:</p><pre>[defaults]<br>display_skipped_hosts = yes</pre><p>For large clusters where you want more parallelism:</p><pre>[defaults]<br>forks = 20  # default is 5</pre><p>The forks setting controls how many hosts Ansible manages simultaneously. The default of 5 is conservative. If you have the network bandwidth and your control machine has enough resources, increasing this can speed up deployments to large clusters significantly.</p><p>If you modify ansible.cfg, keep a few things in mind. First, always keep a copy of the original so you can revert if something breaks. Second, document your changes with comments explaining why you made them. Third, if you’re version-controlling your Kubespray setup, include your modified ansible.cfg so team members get the same behavior.</p><p>The ansible.cfg file might seem like a minor detail compared to the actual playbooks and roles, but these settings affect every single task Ansible runs. Understanding them helps you troubleshoot problems, optimize performance, and adapt Kubespray to your specific environment.</p><h3>Chapter 5: Variable System and Precedence Strategy</h3><p>If you have ever spent hours debugging why a variable in Kubespray does not behave the way you expected, you are not alone. Kubespray has hundreds of variables spread across dozens of files, and understanding where to look and what takes priority over what is essential before you start customizing anything. This chapter will walk you through Ansible’s variable precedence system and show you exactly how Kubespray leverages it to create a maintainable, layered configuration architecture.</p><p>The fundamental principle is simple. Ansible has a 22-level priority system, and Kubespray deliberately places its variables at specific levels so that you, the operator, can override them without touching the core codebase. Once you understand this design, everything clicks into place.</p><p>Ansible evaluates variables from multiple sources and applies them in a strict order. When the same variable appears in multiple places, the one with higher priority wins. Here is the complete list, numbered from lowest to highest priority:</p><pre>1.  command line values (for example, -u my_user)<br>2.  role defaults (roles/*/defaults/main.yml)<br>3.  inventory file or script group vars<br>4.  inventory group_vars/all<br>5.  playbook group_vars/all<br>6.  inventory group_vars/*<br>7.  playbook group_vars/*<br>8.  inventory file or script host vars<br>9.  inventory host_vars/*<br>10. playbook host_vars/*<br>11. host facts / cached set_facts<br>12. play vars<br>13. play vars_prompt<br>14. play vars_files<br>15. role vars (roles/*/vars/main.yml)<br>16. block vars<br>17. task vars<br>18. include_vars<br>19. set_facts / registered vars<br>20. role parameters<br>21. include parameters<br>22. extra vars (-e) (always win)</pre><p>The magic number to remember is 22. Extra vars passed via the command line with -e always win. The second thing to remember is that role defaults sit at priority 2, the absolute bottom of the hierarchy. This is not an accident. Ansible designed it this way so that role authors could provide sensible defaults while making it trivial for users to override them from almost anywhere.</p><p>If you visualize this as a stack, it looks like this:</p><pre>┌─────────────────────────┐<br>│   extra_vars (-e)       │  22 ← highest priority<br>├─────────────────────────┤<br>│   task vars             │  17<br>├─────────────────────────┤<br>│   role/vars/            │  15<br>├─────────────────────────┤<br>│   play vars             │  12<br>├─────────────────────────┤<br>│   host_vars/            │  9<br>├─────────────────────────┤<br>│   group_vars/*          │  6<br>├─────────────────────────┤<br>│   group_vars/all        │  4<br>├─────────────────────────┤<br>│   role/defaults/        │  2  ← lowest priority<br>└─────────────────────────┘</pre><p>When you set kube_version: v1.30.0 in a role&#39;s defaults and then set kube_version: v1.31.0 in your inventory&#39;s group_vars, the group_vars value wins because priority 4 or 6 beats priority 2. If you then run the playbook with -e &quot;kube_version=v1.32.0&quot;, that value wins over everything because priority 22 is king.</p><h4>Mapping Precedence to Kubespray Files</h4><p>Now let us connect these abstract priority levels to actual files in the Kubespray repository. This mapping is crucial for debugging and for knowing where to put your customizations.</p><p>At priority level 2, you have role defaults. In Kubespray, these live in paths like:</p><pre>roles/kubespray_defaults/defaults/main/main.yml<br>roles/kubespray_defaults/defaults/main/download.yml<br>roles/etcd/defaults/main.yml<br>roles/kubernetes/node/defaults/main.yml</pre><p>The kubespray_defaults role alone contains over 800 lines of variables in main.yml and over 1100 lines in download.yml. These are the sensible defaults that Kubespray provides out of the box.</p><p>At priority level 4, you have inventory group_vars/all. In a typical Kubespray setup, this maps to:</p><pre>inventory/mycluster/group_vars/all/all.yml<br>inventory/mycluster/group_vars/all/etcd.yml<br>inventory/mycluster/group_vars/all/containerd.yml</pre><p>At priority level 6, you have inventory group_vars for specific groups:</p><pre>inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>inventory/mycluster/group_vars/k8s_cluster/addons.yml<br>inventory/mycluster/group_vars/k8s_cluster/k8s-net-calico.yml</pre><p>At priority level 15, you have role vars. These are internal fixed values that Kubespray does not want you to accidentally override:</p><pre>roles/kubespray_defaults/vars/main/checksums.yml<br>roles/kubespray_defaults/vars/main/main.yml</pre><p>Let me show you how many files exist at each level. Run these commands from the Kubespray root directory:</p><pre>find roles -type f -name &quot;*.yml&quot; | grep &quot;/defaults/&quot; | wc -l</pre><p>This returns around 77 files. These are all the default variable files across every role.</p><pre>find inventory/sample -path &quot;*/group_vars/all/*.yml&quot; | wc -l</pre><p>This returns around 16 files in the sample inventory’s all directory.</p><pre>find inventory/sample -path &quot;*/group_vars/k8s_cluster/*.yml&quot; | wc -l</pre><p>This returns around 10 files for cluster-specific settings.</p><pre>find roles -type f -name &quot;*.yml&quot; | grep &quot;/vars/&quot; | wc -l</pre><p>This returns around 50 files containing internal fixed values.</p><h4>Kubespray’s Intentional Design</h4><p>Here is where it gets interesting. Kubespray deliberately places variables at specific priority levels to create a clean separation of concerns.</p><p>Role defaults at priority 2 contain the base configuration. These are values that work for most deployments. The Kubespray maintainers chose priority 2 because it is the lowest level, meaning you can override these values from almost anywhere.</p><p>Inventory group_vars at priority 4 and 6 are where you, the operator, put your customizations. When you copy the sample inventory to create your own cluster configuration, you are creating your override layer. Anything you define here beats the role defaults.</p><p>Role vars at priority 15 contain values that should not be overridden casually. The checksums.yml file is a perfect example. It contains SHA256 hashes for every binary that Kubespray downloads. If you accidentally override a checksum, the integrity verification fails and your deployment breaks. By placing checksums at priority 15, Kubespray ensures that your group_vars settings at priority 4–6 cannot accidentally clobber them.</p><p>The flow looks like this:</p><pre>[role defaults] ──override──→ [inventory group_vars] ──override──→ [extra_vars]<br>    基本値                         your settings                    临时设定</pre><p>Let me show you a concrete example. The variable override_system_hostname controls whether Kubespray sets the hostname on your nodes. In the role defaults, you will find:</p><pre># roles/bootstrap_os/defaults/main.yml<br>override_system_hostname: true</pre><p>This is the default behavior. If you want to disable it for your cluster, you add this to your inventory:</p><pre># inventory/mycluster/group_vars/all/all.yml<br>override_system_hostname: false</pre><p>Your value at priority 4 overrides the default at priority 2. You did not have to modify any Kubespray source code. When you upgrade Kubespray to a newer version, your customization remains intact in your inventory directory while the role defaults get updated.</p><h4>The kubespray_defaults Role</h4><p>The kubespray_defaults role deserves special attention because it is the single largest collection of variables in the entire project.</p><p>Look at its structure:</p><pre>roles/kubespray_defaults/<br>├── defaults/main/<br>│   ├── main.yml        # 801 lines - core defaults<br>│   └── download.yml    # 1139 lines - download URLs, versions<br>└── vars/main/<br>    ├── main.yml        # internal paths<br>    └── checksums.yml   # binary integrity checksums</pre><p>The defaults/main/main.yml file contains the core configuration for your cluster. Open it and you will see variables like:</p><pre>kube_version: v1.31.0<br>container_manager: containerd<br>kube_network_plugin: calico<br>kube_proxy_mode: ipvs<br>etcd_deployment_type: host</pre><p>The defaults/main/download.yml file contains download URLs and version mappings for every component. It tells Kubespray where to fetch containerd, runc, crictl, etcd, and dozens of other binaries.</p><p>The vars/main/checksums.yml file is massive. It contains checksums for every supported version of every component, organized by architecture.</p><p>A small excerpt looks like:</p><pre>crictl_checksums:<br>  amd64:<br>    v1.29.0: sha256:abc123...<br>    v1.30.0: sha256:def456...<br>  arm64:<br>    v1.29.0: sha256:789abc...<br>    v1.30.0: sha256:012def...</pre><p>The reason checksums live in vars/ instead of defaults/ is protection. At priority 15, they are harder to accidentally override. If you need to add a new version that Kubespray does not support yet, you would need to use extra_vars at priority 22 or modify the file directly. This friction is intentional.</p><h4>Play Vars and Task Vars in Kubespray</h4><p>Beyond role defaults and group_vars, Kubespray also uses play vars and task vars in specific situations. These are less common but worth understanding.</p><p>Play vars appear at priority 12. In Kubespray, they are used sparingly. I searched through the playbooks and found only four variables defined this way:</p><pre># playbooks/cluster.yml (around line 20-22)<br>- name: Install etcd<br>  vars:<br>    etcd_cluster_setup: true<br>    etcd_events_cluster_setup: &quot;{{ etcd_events_cluster_enabled }}&quot;<br>  import_playbook: install_etcd.yml</pre><p>The same pattern appears in upgrade_cluster.yml and scale.yml. Why does Kubespray define etcd_cluster_setup as a play var when there is already a default value in the role?</p><p>The answer lies in scale.yml. When you add a new node to an existing cluster, you do not want to reinstall etcd. The scale playbook sets etcd_cluster_setup: false to skip etcd installation. The cluster playbook sets it to true to perform the installation. By using play vars, Kubespray can reuse the same install_etcd.yml playbook with different behavior depending on which parent playbook invoked it.</p><pre># playbooks/scale.yml<br>- name: Install etcd<br>  vars:<br>    etcd_cluster_setup: false<br>    etcd_events_cluster_setup: false<br>  import_playbook: install_etcd.yml</pre><p>The ansible_version.yml playbook uses play vars to define minimum and maximum supported Ansible versions:</p><pre># playbooks/ansible_version.yml<br>- name: Check Ansible version<br>  hosts: localhost<br>  vars:<br>    minimal_ansible_version: &quot;2.14.0&quot;<br>    maximal_ansible_version: &quot;2.17.99&quot;</pre><p>Task vars at priority 17 appear more frequently but serve a narrow purpose. Most task vars in Kubespray are parameters passed to included tasks.</p><p>For example:</p><pre># roles/container-engine/containerd/tasks/main.yml<br>- name: Download containerd<br>  include_tasks: &quot;../download/tasks/download_file.yml&quot;<br>  vars:<br>    download: &quot;{{ containerd_download }}&quot;</pre><p>The download variable is scoped to that specific task inclusion. It does not override any global variable because nothing else uses a variable simply named download at higher levels. This is a parameter-passing pattern, not a global override pattern.</p><p>I checked whether any task vars conflict with group_vars or defaults. They do not. The variable names used in task vars like download, iface, and etcd_peer_addresses are local parameters that exist only within the scope of that task.</p><h4>Variable Categories from Documentation</h4><p>The official Kubespray documentation at docs/ansible/vars.md categorizes variables into several groups. While I recommend reading the actual files rather than memorizing documentation, knowing these categories helps when you are searching for a specific setting.</p><p>Generic Ansible Variables are facts that Ansible collects automatically. The most commonly referenced one is:</p><pre>ansible_default_ipv4.address</pre><p>This is the IP address that Ansible detects as your node’s primary address. Kubespray uses this as a fallback when you do not specify the ip variable in your inventory. You can see how Ansible determines this by running:</p><pre>ip -4 route get 8.8.8.8</pre><p>On a typical VirtualBox VM, this returns the NAT interface (10.0.2.15), which is why you must explicitly set the ip variable in multi-NIC environments.</p><p>Common Vars are the variables you will modify most often:</p><pre>kube_version: v1.31.0<br>kube_network_plugin: calico<br>kube_proxy_mode: ipvs<br>container_manager: containerd<br>etcd_version: v3.5.25</pre><p>Container runtime variables depend on which runtime you choose. If you set container_manager: containerd, Kubespray uses containerd_version. If you set container_manager: docker, it uses docker_version and docker_containerd_version.</p><p>Addressing Variables control how nodes communicate:</p><pre>ip: 192.168.10.10<br>access_ip: 192.168.10.10<br>loadbalancer_apiserver: 192.168.10.100</pre><p>The ip variable is what the node uses for internal cluster communication. The access_ip is what other nodes use to reach this node. In most cases they are the same, but in environments with separate management and data networks, they might differ.</p><p>Calico-specific variables live in a separate category if you use Calico as your CNI:</p><pre>calico_ipip_mode: Always<br>calico_vxlan_mode: Never<br>calico_network_backend: bird</pre><h4>The Customization Workflow</h4><p>Now that you understand where variables live and how priority works, let me walk you through the workflow for customizing a Kubespray deployment.</p><p>First, recognize what you should and should not modify. This table is critical:</p><pre>Location                              | Modify? | Reason<br>--------------------------------------|---------|----------------------------------<br>inventory/mycluster/group_vars/       | YES     | Your customization area<br>inventory/mycluster/host_vars/        | YES     | Host-specific overrides<br>-e command line option                | YES     | Temporary overrides<br>playbooks/*.yml                       | NO      | Kubespray code, gets overwritten<br>roles/*/defaults/                     | NO      | Kubespray code, gets overwritten<br>roles/*/vars/                         | NO      | Internal fixed values</pre><p>When you clone Kubespray and run git pull or switch to a newer release, everything in playbooks/ and roles/ gets updated. But your inventory/mycluster/ directory is separate. It survives upgrades because you created it by copying from inventory/sample/.</p><p>The three-step workflow goes like this.</p><p><strong>Step one: Copy the sample inventory.</strong></p><pre>cp -rfp inventory/sample inventory/mycluster</pre><p>The sample inventory is a template. It contains some variables with values, some commented out as hints, and many variables not present at all (meaning they use role defaults).</p><p><strong>Step two: Find the variable you want to change by checking role defaults.</strong></p><pre>grep -r &quot;kube_version&quot; roles/*/defaults/ | head -5</pre><p>This shows you where kube_version is defined and what its default value is:</p><pre>roles/kubespray_defaults/defaults/main/main.yml:kube_version: v1.31.0</pre><p><strong>Step three: Override it in your group_vars.</strong></p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>kube_version: v1.30.0</pre><p>You only define the variables you want to change. Everything else uses the defaults. This keeps your configuration minimal and makes upgrades easier because you are not duplicating values that you did not need to modify.</p><p>Let me show you a complete example. Suppose you want to change the Kubernetes version from the default v1.31.0 to v1.30.0.</p><p><strong>First, verify the default:</strong></p><pre>grep &quot;kube_version&quot; roles/kubespray_defaults/defaults/main/main.yml</pre><p>Output:</p><pre>kube_version: v1.31.0</pre><p>Next, check if the sample inventory already has this variable:</p><pre>grep &quot;kube_version&quot; inventory/sample/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>Output:</p><pre># kube_version: v1.31.0</pre><p>The sample has it commented out, which means it is using the default. The comment serves as documentation showing you that this variable exists and can be changed.</p><p>Now add your override:</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>kube_version: v1.30.0</pre><p>When you run ansible-playbook cluster.yml, Ansible loads the role default (v1.31.0) at priority 2, then loads your group_vars (v1.30.0) at priority 6, and the higher priority wins. Kubernetes 1.30.0 gets installed.</p><p>If you need to temporarily test a different version without modifying files, use extra vars:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  -e &quot;kube_version=v1.32.0&quot;</pre><p>Priority 22 beats everything. The cluster deploys with v1.32.0 regardless of what your files say.</p><h4>Key Files You Will Modify</h4><p>Let me give you a practical reference for the files you will interact with most often.</p><p>The inventory/mycluster/group_vars/all/all.yml file contains global settings that apply to every node, including etcd nodes. Common settings here include:</p><pre>bin_dir: /usr/local/bin<br>loadbalancer_apiserver_port: 6443<br>ntp_enabled: true<br>ntp_servers:<br>  - &quot;0.pool.ntp.org iburst&quot;<br>  - &quot;1.pool.ntp.org iburst&quot;<br>unsafe_show_logs: false</pre><p>The inventory/mycluster/group_vars/all/etcd.yml file controls etcd deployment:</p><pre>etcd_data_dir: /var/lib/etcd<br>etcd_deployment_type: host</pre><p>The etcd_deployment_type is particularly important. Setting it to host means etcd runs as a systemd service outside of Kubernetes. Setting it to kubeadm means etcd runs as a static pod managed by kubeadm. The host option gives you more control and independence from kubeadm&#39;s lifecycle management.</p><p>The inventory/mycluster/group_vars/all/containerd.yml file configures the container runtime. Most options are commented out in the sample, meaning you use defaults:</p><pre># containerd_storage_dir: &quot;/var/lib/containerd&quot;<br># containerd_state_dir: &quot;/run/containerd&quot;<br># containerd_oom_score: 0<br># containerd_default_runtime: &quot;runc&quot;</pre><p>The inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml file is the most important one. This is where you configure the Kubernetes cluster itself:</p><pre>kube_config_dir: /etc/kubernetes<br>kube_network_plugin: calico<br>kube_service_addresses: 10.233.0.0/18<br>kube_pods_subnet: 10.233.64.0/18<br>kube_proxy_mode: ipvs<br>dns_mode: coredns<br>enable_nodelocaldns: true<br>container_manager: containerd<br>auto_renew_certificates: false</pre><p>The inventory/mycluster/group_vars/k8s_cluster/addons.yml file controls which add-ons get installed:</p><pre>helm_enabled: false<br>metrics_server_enabled: false<br>ingress_nginx_enabled: false<br>cert_manager_enabled: false<br>metallb_enabled: false<br>argocd_enabled: false</pre><p>Every addon has additional configuration options below its enable flag. For example, if you enable metrics_server, you can configure its resource requests, replica count, and other settings in the same file.</p><p>The network plugin files like k8s-net-calico.yml and k8s-net-flannel.yml contain CNI-specific settings. The file that gets used depends on your kube_network_plugin value. If you set kube_network_plugin: flannel, then k8s-net-flannel.yml becomes relevant.</p><h4>Understanding Scope: all vs k8s_cluster</h4><p>A subtle but important detail is the difference between group_vars/all/ and group_vars/k8s_cluster/.</p><p>Variables in group_vars/all/ apply to every host in your inventory. This includes etcd nodes, control plane nodes, and worker nodes.</p><p>Variables in group_vars/k8s_cluster/ apply only to hosts that belong to the k8s_cluster group. Looking at a typical inventory:</p><pre>[kube_control_plane]<br>k8s-ctr<br><br>[etcd]<br>k8s-ctr<br><br>[kube_node]<br>k8s-ctr<br><br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>The k8s_cluster group is defined using the :children suffix, which means it includes all hosts from kube_control_plane and kube_node. In this single-node example, k8s-ctr belongs to all groups. But in a multi-node setup where you have dedicated etcd nodes that are not part of the Kubernetes cluster, the etcd nodes would be in the etcd group but not in k8s_cluster.</p><p>This distinction matters because some settings only make sense for Kubernetes nodes. The kube_proxy_mode variable, for example, should only apply to nodes running kube-proxy. Putting it in group_vars/k8s_cluster/ ensures it does not get applied to standalone etcd nodes.</p><p>The priority difference also matters. Remember that group_vars/all is priority 4 and group_vars/k8s_cluster is priority 6. If you somehow define the same variable in both places, the k8s_cluster value wins for nodes that belong to that group.</p><p>One practical question that comes up is how to know which Kubernetes versions Kubespray supports. The answer is in the checksums file.</p><pre>cat roles/kubespray_defaults/vars/main/checksums.yml | grep -A 20 &quot;kubelet_checksums:&quot;</pre><p>This shows you the kubelet checksums, and each version listed is a supported version. If you try to set kube_version to a value that does not have a corresponding checksum, the download task will fail because Kubespray cannot verify the binary integrity.</p><p>When a new Kubernetes version comes out, the Kubespray maintainers add its checksums to this file. Until they do, you cannot use that version unless you bypass the checksum verification (not recommended) or add the checksums yourself using extra_vars at priority 22.</p><p>When a variable does not behave as expected, here is how to debug it.</p><p>First, check what value Ansible is actually using. Add a debug task to your playbook or run:</p><pre>ansible -i inventory/mycluster/inventory.ini all -m debug -a &quot;var=kube_version&quot;</pre><p>This shows you the resolved value of kube_version for each host.</p><p>Second, trace where the value comes from. Ansible does not have a built-in “show me which file set this variable” feature, but you can search:</p><pre>grep -r &quot;kube_version&quot; inventory/mycluster/<br>grep -r &quot;kube_version&quot; roles/*/defaults/<br>grep -r &quot;kube_version&quot; roles/*/vars/</pre><p>Compare the search results with the priority list. The highest priority source wins.</p><p>Third, check for typos. A common mistake is defining kube_versions (plural) when the variable is kube_version (singular). Ansible silently ignores variables it does not recognize, so your setting has no effect.</p><p>Fourth, verify the file is being loaded. If you created a new file in group_vars but named it without the .yml extension, Ansible ignores it. The file must end in .yml or .yaml.</p><h3>Chapter 6: Group Variables Deep Dive</h3><p>I explained how Kubespray organizes its variable system and why the precedence matters. Now we get into the actual files you will be editing. When you copy inventory or sample to inventory/mycluster, you inherit a carefully structured set of configuration files that control every aspect of your cluster. Understanding what each file does and which variables live where saves you from the frustrating experience of changing something in the wrong place and wondering why nothing happened.</p><h4>The group_vars Directory Structure</h4><p>After copying the sample inventory, your mycluster directory looks like this:</p><pre>inventory/mycluster/<br>├── group_vars<br>│   ├── all<br>│   │   ├── all.yml<br>│   │   ├── aws.yml<br>│   │   ├── azure.yml<br>│   │   ├── containerd.yml<br>│   │   ├── coreos.yml<br>│   │   ├── cri-o.yml<br>│   │   ├── docker.yml<br>│   │   ├── etcd.yml<br>│   │   ├── gcp.yml<br>│   │   ├── hcloud.yml<br>│   │   ├── huaweicloud.yml<br>│   │   ├── oci.yml<br>│   │   ├── offline.yml<br>│   │   ├── openstack.yml<br>│   │   ├── upcloud.yml<br>│   │   └── vsphere.yml<br>│   └── k8s_cluster<br>│       ├── addons.yml<br>│       ├── k8s-cluster.yml<br>│       ├── k8s-net-calico.yml<br>│       ├── k8s-net-cilium.yml<br>│       ├── k8s-net-custom-cni.yml<br>│       ├── k8s-net-flannel.yml<br>│       ├── k8s-net-kube-ovn.yml<br>│       ├── k8s-net-kube-router.yml<br>│       ├── k8s-net-macvlan.yml<br>│       └── kube_control_plane.yml<br>└── inventory.ini</pre><p>Two directories exist under group_vars: all and k8s_cluster. This split is not arbitrary. It follows Ansible’s group variable system where variables defined in a directory apply only to hosts belonging to that group.</p><h4>How Ansible Group Variables Work</h4><p>The group_vars/all directory contains variables that apply to every single host in your inventory. When Ansible runs a play against any host, it loads all the YAML files from group_vars/all and makes those variables available.</p><p>The group_vars/k8s_cluster directory is more specific. It only applies to hosts that belong to the k8s_cluster group. If you look at a typical Kubespray inventory, you will see something like this:</p><pre>[kube_control_plane]<br>k8s-ctr<br><br>[etcd]<br>k8s-ctr<br><br>[kube_node]<br>k8s-ctr<br><br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>The k8s_cluster group is defined using the :children suffix, which means it includes all hosts from kube_control_plane and kube_node groups. So any host that is either a control plane node or a worker node gets the variables from group_vars/k8s_cluster.</p><p>This distinction matters for etcd. If you run etcd on separate dedicated nodes that are not part of kube_control_plane or kube_node, those etcd hosts would get variables from group_vars/all but not from group_vars/k8s_cluster. The separation allows you to configure etcd-specific settings without accidentally applying Kubernetes cluster settings to standalone etcd nodes.</p><h4>Variable Precedence Between Groups</h4><p>Ansible has a rule: more specific groups override less specific ones. The group_vars/all directory has priority 4 in Ansible’s precedence system, while group_vars/k8s_cluster has priority 6. This means if the same variable is defined in both places, the k8s_cluster version wins for hosts that belong to that group.</p><p>In practice, you rarely define the same variable in both places. Kubespray’s sample files are organized so that truly global settings go in all and cluster-specific settings go in k8s_cluster. But knowing this precedence helps when you need to override something for just the Kubernetes nodes without affecting standalone etcd hosts.</p><h4>Conditional File Application</h4><p>Not every file in group_vars gets used in every deployment. Many files are conditional based on other variable values.</p><p>For cloud provider files like aws.yml, azure.yml, and gcp.yml, the contents only matter if you set cloud_provider to that specific provider. If you are running on bare metal or a different cloud, these files are essentially ignored even though they exist.</p><p>Similarly, the container runtime files are conditional. The containerd.yml settings apply when container_manager is set to containerd. The cri-o.yml settings apply when you choose CRI-O instead. The docker.yml file is there for legacy Docker support through cri-dockerd, though this path is deprecated since Kubernetes 1.24 removed native Docker support.</p><p>The network plugin files in k8s_cluster follow the same pattern. The k8s-net-calico.yml file only matters when kube_network_plugin is calico. Switch to flannel and suddenly k8s-net-flannel.yml becomes relevant while the Calico file is ignored.</p><p>This conditional system is elegant because you can see all available options in your inventory without them interfering with each other. You pick your choices in k8s-cluster.yml and the corresponding configuration files automatically become active.</p><h4>The group_vars/all Directory</h4><p>This directory contains settings that every node needs regardless of its role.</p><p><em>all.yml — Global Settings</em></p><p>The all.yml file is where you configure settings that genuinely apply everywhere. Running grep to see only the non-commented lines:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/all/all.yml</pre><p>You get output like:</p><pre>---<br>bin_dir: /usr/local/bin<br>loadbalancer_apiserver_port: 6443<br>loadbalancer_apiserver_healthcheck_port: 8081<br>no_proxy_exclude_workers: false<br>kube_webhook_token_auth: false<br>kube_webhook_token_auth_url_skip_tls_verify: false<br>ntp_enabled: false<br>ntp_manage_config: false<br>ntp_servers:<br>  - &quot;0.pool.ntp.org iburst&quot;<br>  - &quot;1.pool.ntp.org iburst&quot;<br>  - &quot;2.pool.ntp.org iburst&quot;<br>  - &quot;3.pool.ntp.org iburst&quot;<br>unsafe_show_logs: false<br>allow_unsupported_distribution_setup: false</pre><p>The bin_dir setting tells Kubespray where to install binaries like kubectl, kubelet, and kubeadm. The default /usr/local/bin works for most systems. Change this if your organization has a different standard location for locally installed binaries.</p><p>The loadbalancer_apiserver_port defaults to 6443, which is the standard Kubernetes API server port. In HA setups with an external load balancer, this is the port your load balancer listens on.</p><p>The ntp_enabled setting is false by default, which might surprise you. Time synchronization is critical for Kubernetes, especially for certificate validation and etcd consistency. Kubespray assumes your nodes already have NTP configured through your base OS provisioning. If you want Kubespray to manage NTP, set this to true and it will configure chrony or ntpd with the servers listed in ntp_servers.</p><p>The unsafe_show_logs setting controls whether sensitive information appears in Ansible output. Keep this false in production. Setting it true during debugging can help you see what values are being used, but you risk exposing secrets in your logs.</p><p>The allow_unsupported_distribution_setup is a safety valve. Kubespray has a list of tested and supported operating systems. If you try to deploy on something not on that list, it fails with an error. Setting this to true bypasses that check. Use this only if you know what you are doing and accept that things might break in unexpected ways.</p><p><em>etcd.yml — etcd Configuration</em></p><p>The etcd configuration is surprisingly minimal in the sample file:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/all/etcd.yml<br><br>---<br>etcd_data_dir: /var/lib/etcd<br>etcd_deployment_type: host</pre><p>The etcd_data_dir is where etcd stores its database. This directory contains the entire state of your Kubernetes cluster. Losing this directory means losing your cluster. Make sure this path is on reliable storage and included in your backup strategy.</p><p>The etcd_deployment_type setting is more interesting. It has two possible values: host and kubeadm.</p><p>When set to host, Kubespray installs etcd as a systemd service running directly on the host. This is independent of Kubernetes itself. The etcd binary runs under the etcd user, managed by systemd, with its own certificates in /etc/ssl/etcd/ssl. This approach keeps etcd completely separate from Kubernetes, which some operators prefer because etcd can be managed, backed up, and recovered independently.</p><p>When set to kubeadm, etcd runs as a static pod managed by kubelet. This is what you get with a standard kubeadm init. The etcd container runs in the kube-system namespace, and kubeadm handles certificate generation and renewal. This approach is simpler but couples etcd lifecycle to Kubernetes.</p><p>Kubespray defaults to host because it provides more operational flexibility, especially for etcd upgrades and disaster recovery scenarios.</p><p><em>containerd.yml — Container Runtime Settings</em></p><p>Opening the containerd configuration file reveals mostly comments:</p><pre>cat inventory/mycluster/group_vars/all/containerd.yml<br><br>---<br># Please see roles/container-engine/containerd/defaults/main.yml <br># for more configuration options<br><br># containerd_storage_dir: &quot;/var/lib/containerd&quot;<br># containerd_state_dir: &quot;/run/containerd&quot;<br># containerd_oom_score: 0<br><br># containerd_default_runtime: &quot;runc&quot;<br># containerd_snapshotter: &quot;native&quot;<br>...</pre><p>The file is almost entirely commented out because the defaults work for most deployments. This is a pattern you will see throughout Kubespray: the sample files show you what options exist without forcing you to understand them all upfront.</p><p>If you need to change containerd’s storage location, perhaps because /var/lib is on a small root partition and you want container images on a larger disk, uncomment containerd_storage_dir and set your preferred path.</p><p>The comment at the top pointing to roles/container-engine/containerd/defaults/main.yml is valuable. When you need to customize something not shown in the sample file, that is where you look to find the variable name and its default value.</p><h4>Cloud Provider Files</h4><p>The group_vars/all directory contains configuration files for various cloud providers: aws.yml, azure.yml, gcp.yml, openstack.yml, vsphere.yml, and several others.</p><p>These files configure cloud-specific integrations like:</p><ul><li>Cloud controller manager settings</li><li>CSI driver configurations for persistent volumes</li><li>Load balancer integration</li><li>Node metadata services</li></ul><p>For example, if you deploy on AWS and want to use EBS volumes for persistent storage, you would configure the AWS CSI driver settings in aws.yml. If you deploy on bare metal, these files sit unused.</p><p>Most of the content in these files is commented out. The comments serve as documentation showing what options are available. When you need a specific cloud integration, you uncomment and configure the relevant settings.</p><p><em>docker.yml — Legacy Docker Support</em></p><p>The docker.yml file exists for deployments that still use Docker as the container runtime. Since Kubernetes 1.24 removed the built-in dockershim, using Docker now requires cri-dockerd as a shim layer.</p><pre>docker_container_storage_setup: false<br>docker_dns_servers_strict: false<br>docker_daemon_graph: &quot;/var/lib/docker&quot;<br>docker_iptables_enabled: &quot;false&quot;<br>docker_log_opts: &quot;--log-opt max-size=50m --log-opt max-file=5&quot;<br>docker_bin_dir: &quot;/usr/bin&quot;<br>docker_rpm_keepcache: 1</pre><p>Unless you have a specific reason to use Docker, stick with containerd. It is the default and the direction the Kubernetes ecosystem has moved.</p><p><em>offline.yml — Air-Gap Deployment</em></p><p>The offline.yml file contains settings for deploying Kubernetes in environments without internet access. Air-gap deployments require pre-downloading all container images and binaries, then serving them from internal repositories.</p><p>This file is where you configure your internal registry mirrors and binary download locations. In a connected environment, you can ignore this file entirely.</p><p><em>The group_vars/k8s_cluster Directory</em></p><p>This directory contains the real meat of your cluster configuration. The settings here define what kind of Kubernetes cluster you are building.</p><p><em>k8s-cluster.yml — The Heart of Your Cluster Configuration</em></p><p>This is the most important configuration file in Kubespray. It controls fundamental cluster characteristics that are difficult or impossible to change after deployment.</p><p>Running grep to see active settings:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>The output is substantial:</p><pre>---<br>kube_config_dir: /etc/kubernetes<br>kube_script_dir: &quot;/kubernetes-scripts&quot;<br>kube_manifest_dir: &quot;/manifests&quot;<br>kube_cert_dir: &quot;/ssl&quot;<br>kube_token_dir: &quot;/tokens&quot;<br>kube_api_anonymous_auth: true<br>local_release_dir: &quot;/tmp/releases&quot;<br>retry_stagger: 5<br>kube_owner: kube<br>kube_cert_group: kube-cert<br>kube_log_level: 2<br>credentials_dir: &quot;/credentials&quot;<br>kube_network_plugin: calico<br>kube_network_plugin_multus: false<br>kube_service_addresses: 10.233.0.0/18<br>kube_pods_subnet: 10.233.64.0/18<br>kube_network_node_prefix: 24<br>kube_service_addresses_ipv6: fd85:ee78:d8a6:8607::1000/116<br>kube_pods_subnet_ipv6: fd85:ee78:d8a6:8607::1:0000/112<br>kube_network_node_prefix_ipv6: 120<br>kube_apiserver_ip: &quot;&quot;<br>kube_apiserver_port: 6443<br>kube_proxy_mode: ipvs<br>kube_proxy_strict_arp: false<br>kube_encrypt_secret_data: false<br>cluster_name: cluster.local<br>ndots: 2<br>dns_mode: coredns<br>enable_nodelocaldns: true<br>enable_nodelocaldns_secondary: false<br>nodelocaldns_ip: 169.254.25.10<br>resolvconf_mode: host_resolvconf<br>deploy_netchecker: false<br>dns_domain: &quot;&quot;<br>container_manager: containerd<br>kata_containers_enabled: false<br>k8s_image_pull_policy: IfNotPresent<br>kubernetes_audit: false<br>volume_cross_zone_attachment: false<br>persistent_volumes_enabled: false<br>event_ttl_duration: &quot;1h0m0s&quot;<br>auto_renew_certificates: false<br>kubeadm_patches_dir: &quot;/patches&quot;<br>kubeadm_patches: []<br>remove_anonymous_access: false</pre><h4><em>Directory Settings</em></h4><p>The first group of settings defines where Kubernetes stores its files:</p><pre>kube_config_dir: /etc/kubernetes<br>kube_script_dir: &quot;/kubernetes-scripts&quot;<br>kube_manifest_dir: &quot;/manifests&quot;<br>kube_cert_dir: &quot;/ssl&quot;<br>kube_token_dir: &quot;/tokens&quot;</pre><p>These paths are relative to kube_config_dir, so the actual manifest directory ends up at /etc/kubernetes/manifests. This is where static pod manifests for control plane components live.</p><p>You rarely need to change these unless your organization has specific filesystem layout requirements.</p><h4>Network Settings</h4><p>The network configuration is critical and mostly immutable after cluster creation:</p><pre>kube_network_plugin: calico<br>kube_service_addresses: 10.233.0.0/18<br>kube_pods_subnet: 10.233.64.0/18<br>kube_network_node_prefix: 24</pre><p>The kube_network_plugin setting chooses your CNI provider. Calico is the default and provides excellent functionality including network policies. Other options include flannel for simplicity, cilium for eBPF-based networking, and several others.</p><p>The kube_service_addresses CIDR defines the IP range for Kubernetes Services. When you create a Service of type ClusterIP, it gets an IP from this range. The default 10.233.0.0/18 provides about 16,000 service IPs.</p><p>The kube_pods_subnet CIDR is where pod IPs come from. With 10.233.64.0/18, you have roughly 16,000 pod IPs. The kube_network_node_prefix of 24 means each node gets a /24 subnet (256 IPs) from this range for its pods.</p><p>These CIDR ranges must not overlap with your existing network infrastructure. If your corporate network uses 10.x.x.x addressing, you need to carefully plan these ranges to avoid conflicts. Changing these after deployment requires recreating the cluster.</p><h4>Proxy Mode</h4><p>The kube_proxy_mode setting determines how Kubernetes implements service networking:</p><pre>kube_proxy_mode: ipvs</pre><p>IPVS (IP Virtual Server) is the default and provides better performance than the older iptables mode, especially in clusters with many services. IPVS uses hash tables for service lookup, giving O(1) performance regardless of how many services exist. The iptables mode uses linear rule chains, so performance degrades as service count grows.</p><p>For most deployments, stick with ipvs. The iptables mode still works and might be preferred in very small clusters or when debugging networking issues since iptables rules are more familiar to many operators.</p><h4>DNS Configuration</h4><pre>dns_mode: coredns<br>enable_nodelocaldns: true<br>nodelocaldns_ip: 169.254.25.10<br>cluster_name: cluster.local</pre><p>CoreDNS is the standard Kubernetes DNS server and the only real option these days. The dns_mode setting exists for historical reasons when kube-dns was an alternative.</p><p>The enable_nodelocaldns setting deploys NodeLocal DNSCache, which runs a DNS caching agent on every node. Pods make DNS queries to the nodelocaldns_ip address (169.254.25.10, a link-local address), and the local agent either answers from cache or forwards to CoreDNS. This significantly reduces DNS latency and load on CoreDNS pods.</p><p>The cluster_name becomes your cluster’s DNS domain. Services are accessible at servicename.namespace.svc.cluster.local.</p><h4>Container Runtime</h4><pre>container_manager: containerd</pre><p>This chooses between containerd and CRI-O. Containerd is the dominant choice in the Kubernetes ecosystem now. It is what Docker uses internally, it is the default for most managed Kubernetes services, and it has excellent tooling support.</p><h4>Certificate Management</h4><pre>auto_renew_certificates: false</pre><p>Kubernetes component certificates expire after one year by default. When auto_renew_certificates is true, Kubespray configures a systemd timer to automatically renew certificates monthly using kubeadm certs renew.</p><p>The sample defaults to false, which means you need to manually renew certificates or implement your own automation. For any cluster running longer than a few months, enabling this is strongly recommended. Expired certificates cause control plane components to stop communicating, effectively bringing down your cluster.</p><p><em>addons.yml — Cluster Add-ons</em></p><p>The addons file controls which additional components get deployed:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/k8s_cluster/addons.yml</pre><pre>---<br>helm_enabled: false<br>registry_enabled: false<br>metrics_server_enabled: false<br>local_path_provisioner_enabled: false<br>local_volume_provisioner_enabled: false<br>gateway_api_enabled: false<br>ingress_nginx_enabled: false<br>ingress_publish_status_address: &quot;&quot;<br>ingress_alb_enabled: false<br>cert_manager_enabled: false<br>metallb_enabled: false<br>metallb_speaker_enabled: &quot;&quot;<br>metallb_namespace: &quot;metallb-system&quot;<br>argocd_enabled: false<br>kube_vip_enabled: false<br>node_feature_discovery_enabled: false</pre><p>Each boolean controls whether that component gets installed. The defaults are conservative, deploying a minimal cluster. You enable what you need.</p><p>For most production clusters, you probably want:</p><pre>helm_enabled: true<br>metrics_server_enabled: true</pre><p>Helm is the de facto package manager for Kubernetes applications. Metrics Server is required for kubectl top commands and Horizontal Pod Autoscaler functionality.</p><p>For bare metal clusters without a cloud load balancer, MetalLB provides LoadBalancer service support:</p><pre>metallb_enabled: true</pre><p>For clusters that need ingress routing:</p><pre>ingress_nginx_enabled: true</pre><p>The file contains many commented-out sections with detailed configuration options for each addon. When you enable an addon, scroll through those comments to see what customization options exist.</p><p>kube_control_plane.yml — Control Plane Resource Reservation</p><p>This file is entirely commented out in the sample:</p><pre>cat inventory/mycluster/group_vars/k8s_cluster/kube_control_plane.yml<br><br># Reservation for control plane kubernetes components<br># kube_memory_reserved: 512Mi<br># kube_cpu_reserved: 200m<br># kube_ephemeral_storage_reserved: 2Gi<br># kube_pid_reserved: &quot;1000&quot;<br><br># Reservation for control plane host system<br># system_memory_reserved: 256Mi<br># system_cpu_reserved: 250m<br># system_ephemeral_storage_reserved: 2Gi<br># system_pid_reserved: &quot;1000&quot;</pre><p>These settings reserve resources on control plane nodes for Kubernetes components and the host operating system. Without reservations, pods could consume all available resources, starving critical system processes.</p><p>In production, especially on nodes that run both control plane components and workload pods, configuring these reservations prevents resource exhaustion scenarios. The commented values are reasonable starting points.</p><h4>Network Plugin Configuration Files</h4><p>The k8s_cluster directory contains configuration files for each supported CNI plugin. Only the file matching your kube_network_plugin choice gets used.</p><p><em>For Calico (k8s-net-calico.yml):</em></p><pre>calico_ipip_mode: Always<br>calico_vxlan_mode: Never<br>calico_network_backend: bird</pre><p>These control Calico’s encapsulation mode. IPIP mode wraps pod traffic in IP-in-IP packets, which works across most networks. VXLAN is an alternative encapsulation. The bird backend uses BGP for routing, which can peer with your physical network infrastructure for advanced deployments.</p><p><em>For Flannel (k8s-net-flannel.yml):</em></p><pre>flannel_backend_type: vxlan</pre><p>Flannel is simpler than Calico. It uses VXLAN overlay networking by default. The main thing you might configure is flannel_interface if you have multiple network interfaces and need Flannel to use a specific one.</p><p><em>For Cilium (k8s-net-cilium.yml):</em></p><p>Cilium has extensive configuration options for its eBPF-based networking. The defaults work for getting started, but Cilium’s advanced features like transparent encryption, Hubble observability, and service mesh capabilities all have configuration options here.</p><p>Checking Supported Kubernetes Versions</p><p>The group_vars files define what you want, but Kubespray can only install versions it knows about. The roles/kubespray_defaults/vars/main/checksums.yml file contains SHA256 checksums for all supported binaries.</p><p>To see what Kubernetes versions are available:</p><pre>cat roles/kubespray_defaults/vars/main/checksums.yml | grep -i kube -A40</pre><p>If a version does not have checksums in this file, Kubespray cannot install it. This is a security feature ensuring binary integrity. When Kubernetes releases a new version, Kubespray maintainers add the checksums in a subsequent release.</p><p>This is why you cannot simply set kube_version to any arbitrary version. You need a Kubespray release that includes that version’s checksums.</p><h4>Practical Workflow</h4><p>When setting up a new cluster, the typical workflow is:</p><p>First, copy the sample inventory:</p><pre>cp -rfp inventory/sample inventory/mycluster</pre><p>Second, edit inventory.ini with your hosts and groups.</p><p>Third, review k8s-cluster.yml and change the settings that matter for your environment:</p><ul><li>Choose your CNI plugin</li><li>Verify the CIDR ranges do not conflict with your network</li><li>Enable certificate auto-renewal</li><li>Adjust any other settings based on requirements</li></ul><p>Fourth, review addons.yml and enable the add-ons you need.</p><p>Fifth, if using a specific cloud provider, configure the corresponding file in group_vars/all.</p><p>Variables you do not change keep their defaults from roles/*/defaults/main.yml. You do not need to understand every variable to deploy a working cluster. Start with the defaults, deploy, and then iterate as you learn what needs customization.</p><p>The group_vars files in your inventory are yours to modify freely. Kubespray updates will not overwrite them because they live in your custom inventory directory, not in the sample directory or the roles themselves.</p><h3>Chapter 7: cluster.yml Playbook Flow Analysis</h3><p>When you run ansible-playbook cluster.yml, you are kicking off a carefully orchestrated sequence of operations that transforms a bunch of Linux machines into a functioning Kubernetes cluster.</p><p>Understanding this flow is not optional if you want to operate Kubespray in production. Let me walk you through the entire cluster.yml playbook, explaining what happens at each stage and why it is structured the way it is.</p><p>The cluster.yml file sits at the root of the Kubespray repository. It is the main entry point for cluster deployment. But here is the thing that confused me when I first looked at it: the file itself is surprisingly short. Most of the actual work happens in sub-playbooks and roles that get imported.</p><p>The overall flow looks like this:</p><pre><br>Common tasks (boilerplate.yml)<br>↓<br>Fact gathering (internal_facts.yml)<br>↓<br>etcd installation preparation (preinstall, container-engine, download)<br>↓<br>etcd installation (install_etcd.yml)<br>↓<br>Kubernetes node installation (kubernetes/node)<br>↓<br>Control plane installation (kubernetes/control-plane)<br>↓<br>kubeadm execution and CNI installation (kubernetes/kubeadm, network_plugin)<br>↓<br>Calico Route Reflector (optional)<br>↓<br>Windows node patching (optional)<br>↓<br>Kubernetes apps installation (kubernetes-apps)<br>↓<br>resolv.conf finalization</pre><p>Each of these stages corresponds to one or more plays in the playbook. Let me show you the actual structure.</p><h4>The Playbook Structure</h4><p>Here is the cluster.yml file in its entirety. I am including the whole thing because it is worth reading through:</p><pre>---<br>- name: Common tasks for every playbooks<br>  import_playbook: boilerplate.yml<br><br>- name: Gather facts<br>  import_playbook: internal_facts.yml<br><br>- name: Prepare for etcd install<br>  hosts: &quot;{{ groups[&#39;etcd&#39;] | default([]) | union(groups[&#39;k8s_cluster&#39;] | default([])) }}&quot;<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, tags: preinstall }<br>    - { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>    - { role: download, tags: download, when: &quot;not skip_downloads&quot; }<br><br>- name: Install etcd<br>  vars:<br>    etcd_cluster_setup: true<br>    etcd_events_cluster_setup: &quot;{{ etcd_events_cluster_enabled }}&quot;<br>  import_playbook: install_etcd.yml<br><br>- name: Install Kubernetes nodes<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/node, tags: node }<br><br>- name: Install the control plane<br>  hosts: kube_control_plane<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/control-plane, tags: control-plane }<br><br>- name: Invoke kubeadm and install a CNI<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/kubeadm, tags: kubeadm }<br>    - { role: kubernetes/node-label, tags: node-label }<br>    - { role: kubernetes/node-taint, tags: node-taint, when: node_taints is defined }<br>    - { role: network_plugin, tags: network }<br><br>- name: Install Calico Route Reflector<br>  hosts: calico_rr<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: network_plugin/calico/rr, tags: [&#39;network&#39;, &#39;calico_rr&#39;] }<br><br>- name: Patch Kubernetes for Windows<br>  hosts: kube_control_plane[0]<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: win_nodes/kubernetes_patch, tags: k8s-windows }<br>- name: Install Kubernetes apps<br>  hosts: kube_control_plane<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes-apps, tags: apps }<br>- name: Apply resolv.conf changes now that cluster DNS is up<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, when: &quot;dns_mode != &#39;none&#39; and resolvconf_mode == &#39;host_resolvconf&#39;&quot;, tags: resolvconf, dns_late: true }</pre><h4>Play 1 and 2: Boilerplate and Fact Gathering</h4><p>The first two plays handle common setup tasks:</p><pre>- name: Common tasks for every playbooks<br>  import_playbook: boilerplate.yml<br><br>- name: Gather facts<br>  import_playbook: internal_facts.yml</pre><p>The boilerplate.yml playbook does things like validating your Ansible version, checking that your inventory is properly configured, and setting up bastion host SSH config if you are using one. It is the “sanity check before we do anything destructive” phase.</p><p>The internal_facts.yml playbook runs Ansible’s fact gathering across all nodes. This collects information about each machine: what operating system it runs, what IP addresses it has, how much memory, what network interfaces exist, and so on. This information becomes available as variables that later plays can use.</p><p>Notice that these use import_playbook rather than defining roles directly. This is a common pattern in Kubespray. The import_playbook directive pulls in another playbook file and executes it as if it were part of the current file.</p><h4>Play 3: Preparing for etcd Installation</h4><p>This play is where things start getting interesting:</p><pre>- name: Prepare for etcd install<br>  hosts: &quot;{{ groups[&#39;etcd&#39;] | default([]) | union(groups[&#39;k8s_cluster&#39;] | default([])) }}&quot;<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, tags: preinstall }<br>    - { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>    - { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>Let me explain each line.</p><p>The hosts directive uses Jinja2 templating to target both the etcd group and the k8s_cluster group. The union filter combines these two groups, and the default([]) ensures the playbook does not fail if either group is undefined. This means the play runs on all nodes that will either run etcd or be part of the Kubernetes cluster.</p><p>The gather_facts: false setting might seem strange. Why would we skip fact gathering? The answer is that we already gathered facts in the previous play (internal_facts.yml), and Kubespray caches these facts. Look at the ansible.cfg settings:</p><pre>gathering = smart<br>fact_caching = jsonfile<br>fact_caching_connection = /tmp<br>fact_caching_timeout = 86400</pre><p>The <em>gathering = smart</em> setting tells Ansible to use cached facts if they exist and are still valid. Since we gathered facts just moments ago and they are cached in /tmp as JSON files, there is no need to gather them again. This saves time, especially when you have many nodes.</p><p>The any_errors_fatal setting deserves attention. When set to true, if any single host fails a task, the entire playbook stops immediately. This is critical for cluster deployment because a half-configured cluster is worse than no cluster at all. If etcd fails to install on one node, you do not want the playbook to continue and try to bootstrap Kubernetes against a broken etcd cluster.</p><p>The environment directive sets environment variables for all tasks in this play. The proxy_disable_env variable is defined in roles/kubespray_defaults/defaults/main/main.yml:</p><pre>proxy_disable_env:<br>  http_proxy: &quot;&quot;<br>  HTTP_PROXY: &quot;&quot;<br>  https_proxy: &quot;&quot;<br>  HTTPS_PROXY: &quot;&quot;<br>  no_proxy: &quot;&quot;<br>  NO_PROXY: &quot;&quot;</pre><p>This clears all proxy-related environment variables. Cluster components communicate directly with each other, and you do not want that traffic going through an HTTP proxy.</p><h4>Understanding Role Import Syntax</h4><p>Look at how roles are imported:</p><pre>roles:<br>  - { role: kubespray_defaults }<br>  - { role: kubernetes/preinstall, tags: preinstall }<br>  - { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>  - { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>This is YAML shorthand syntax. It is equivalent to writing:</p><pre>roles:<br>  - role: kubespray_defaults<br>  <br>  - role: kubernetes/preinstall<br>    tags: preinstall<br>  <br>  - role: container-engine<br>    tags: container-engine<br>    when: deploy_container_engine<br>  <br>  - role: download<br>    tags: download<br>    when: &quot;not skip_downloads&quot;</pre><p>The curly brace syntax is more compact but can be harder to read. Kubespray uses it throughout.</p><p>Each role can have:</p><ul><li>A name (the path under the roles/ directory)</li><li>Tags for selective execution</li><li>Conditions (when clauses) that determine if the role runs</li></ul><p>The kubespray_defaults role always runs first and has no tags or conditions. This is intentional. It loads all the default variable values that other roles depend on. If you skip it, nothing else will work.</p><h4>Why kubespray_defaults Runs First in Every Play</h4><p>You will notice that every single play in cluster.yml starts with:</p><pre>roles:<br>  - { role: kubespray_defaults }</pre><p>This pattern is not accidental. The kubespray_defaults role lives at roles/kubespray_defaults/ and contains:</p><pre>roles/kubespray_defaults/<br>├── defaults/main/<br>│   ├── main.yml        # 800+ lines of default values<br>│   └── download.yml    # 1100+ lines of download-related defaults<br>└── vars/main/<br>    ├── main.yml<br>    └── checksums.yml   # Binary checksums for integrity verification</pre><p>This role does not run any tasks. It just loads variables. Every other role in Kubespray depends on these variables being available. Things like kube_version, container_manager, kube_network_plugin, and hundreds of other settings are defined here.</p><p>When Ansible runs a play, it loads role defaults at the beginning. By putting kubespray_defaults first, Kubespray ensures that all default values are loaded before any other role tries to use them.</p><h4>Tag-Based Selective Execution</h4><p>Tags are powerful. They let you run only specific parts of the playbook. Look at the tags assigned to each role:</p><pre>- { role: kubernetes/preinstall, tags: preinstall }<br>- { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>- { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>If you want to reinstall just the network plugin without rerunning the entire playbook, you can do:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --tags network</pre><p>This runs only the roles and tasks tagged with “network”. Everything else gets skipped.</p><p>Here is a quick reference of the major tags in cluster.yml:</p><pre>preinstall       → kubernetes/preinstall role<br>container-engine → container-engine role<br>download         → download role<br>node             → kubernetes/node role<br>control-plane    → kubernetes/control-plane role<br>kubeadm          → kubernetes/kubeadm role<br>network          → network_plugin role<br>apps             → kubernetes-apps role</pre><p>You can also combine tags:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --tags preinstall,container-engine</pre><p>This runs only the preinstall and container-engine stages.</p><h4>Play 4: Installing etcd</h4><pre>- name: Install etcd<br>  vars:<br>    etcd_cluster_setup: true<br>    etcd_events_cluster_setup: &quot;{{ etcd_events_cluster_enabled }}&quot;<br>  import_playbook: install_etcd.yml</pre><p>This play imports the install_etcd.yml playbook and passes two variables using the vars directive.</p><p>Here is something subtle but important. The etcd_cluster_setup variable is set to true here. But wait, is this variable not already defined somewhere? Yes, it is. In roles/etcd_defaults/defaults/main.yml, you will find:</p><pre>etcd_cluster_setup: true</pre><p>So why set it again in the play?</p><p>The answer becomes clear when you look at scale.yml:</p><pre>- name: Install etcd<br>  vars:<br>    etcd_cluster_setup: false<br>    etcd_events_cluster_setup: false<br>  import_playbook: install_etcd.yml</pre><p>In scale.yml, when adding new nodes to an existing cluster, etcd_cluster_setup is set to false. The etcd cluster already exists; we do not want to reinitialize it. By setting this variable at the play level, Kubespray controls how install_etcd.yml behaves differently in cluster.yml versus scale.yml.</p><p>The etcd_events_cluster_enabled variable controls whether to set up a separate etcd cluster for Kubernetes events. This is an advanced feature for large clusters where event traffic can overwhelm the main etcd cluster. Most deployments leave this disabled.</p><h4>Play 5: Installing Kubernetes Nodes</h4><pre>- name: Install Kubernetes nodes<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/node, tags: node }</pre><p>This play targets the k8s_cluster group, which includes both control plane nodes and worker nodes. The kubernetes/node role installs kubelet, kubectl, and kubeadm on all nodes that will be part of the cluster.</p><p>Notice the pattern repeating: gather_facts: false (using cached facts), any_errors_fatal (fail fast), environment (no proxy), and kubespray_defaults first.</p><h4>Play 6: Installing the Control Plane</h4><pre>- name: Install the control plane<br>  hosts: kube_control_plane<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/control-plane, tags: control-plane }</pre><p>This play targets only kube_control_plane nodes. The kubernetes/control-plane role prepares these nodes to run the Kubernetes control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.</p><h4>Play 7: Running kubeadm and Installing CNI</h4><pre>- name: Invoke kubeadm and install a CNI<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/kubeadm, tags: kubeadm }<br>    - { role: kubernetes/node-label, tags: node-label }<br>    - { role: kubernetes/node-taint, tags: node-taint, when: node_taints is defined }<br>    - { role: network_plugin, tags: network }</pre><p>This is where the cluster actually comes together. The kubernetes/kubeadm role runs kubeadm init on the first control plane node, then kubeadm join on all other nodes.</p><p>After the cluster is bootstrapped, the node-label role applies any custom labels you have defined. The node-taint role applies taints, but only if you have defined node_taints in your variables.</p><p>Finally, the network_plugin role installs your chosen CNI plugin. If you set kube_network_plugin: calico in your variables, this role installs Calico. If you set kube_network_plugin: flannel, it installs Flannel. The role reads your configuration and acts accordingly.</p><h4>Play 8: Calico Route Reflector</h4><pre>- name: Install Calico Route Reflector<br>  hosts: calico_rr<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: network_plugin/calico/rr, tags: [&#39;network&#39;, &#39;calico_rr&#39;] }</pre><p>This play only runs if you have a calico_rr group defined in your inventory. Calico Route Reflector is an advanced feature for large clusters using BGP routing. Most deployments do not use this, and if the calico_rr group is empty or undefined, this play does nothing.</p><p>Notice the tags syntax here: tags: [‘network’, ‘calico_rr’]. This assigns two tags to the role, so you can target it with either — tags network or — tags calico_rr.</p><h4>Play 9: Windows Node Support</h4><pre>- name: Patch Kubernetes for Windows<br>  hosts: kube_control_plane[0]<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: win_nodes/kubernetes_patch, tags: k8s-windows }</pre><p>This play applies patches needed for Windows node support. It targets only the first control plane node (kube_control_plane[0]) because it just needs to apply some manifests to the cluster.</p><p>If you are not running Windows workers, this play effectively does nothing. The win_nodes/kubernetes_patch role checks whether Windows support is enabled and skips itself if not.</p><h4>Play 10: Installing Kubernetes Apps</h4><pre>- name: Install Kubernetes apps<br>  hosts: kube_control_plane<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes-apps, tags: apps }</pre><p>This play installs cluster addons: CoreDNS, metrics-server, Helm, Ingress controllers, and whatever else you have enabled in your addons.yml file. It runs on control plane nodes because it needs kubectl access to apply manifests.</p><h4>Play 11: Finalizing DNS Configuration</h4><pre>- name: Apply resolv.conf changes now that cluster DNS is up<br>  hosts: k8s_cluster<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, when: &quot;dns_mode != &#39;none&#39; and resolvconf_mode == &#39;host_resolvconf&#39;&quot;, tags: resolvconf, dns_late: true }</pre><p>This final play updates /etc/resolv.conf on all cluster nodes now that CoreDNS is running. The kubernetes/preinstall role is called again, but this time with dns_late: true. This variable changes the role’s behavior, making it update DNS configuration rather than doing general preinstall tasks.</p><p>The when clause ensures this only runs if you are using coredns (dns_mode != ‘none’) and managing the host’s resolv.conf (resolvconf_mode == ‘host_resolvconf’).</p><h4>Mapping to Kubernetes The Hard Way</h4><p>If you have gone through Kubernetes The Hard Way, you might wonder how Kubespray’s stages correspond to those manual steps. Here is the mapping:</p><pre>Kubespray Role/Play              | The Hard Way Equivalent<br>---------------------------------|----------------------------------<br>kubernetes/preinstall            | OS configuration (sysctl, modules)<br>container-engine/containerd      | Containerd installation<br>download                         | Downloading binaries<br>etcd                             | Bootstrapping etcd cluster<br>kubernetes/node                  | Worker node setup (kubelet)<br>kubernetes/control-plane         | Control plane bootstrapping<br>kubernetes/kubeadm               | kubeadm init and join<br>network_plugin                   | Pod network configuration<br>kubernetes-apps                  | DNS and addon installation</pre><p>The certificate generation that took an entire chapter in The Hard Way? Kubespray delegates that to kubeadm, which handles it automatically during kubeadm init.</p><p>The kubeconfig file generation? Also handled by kubeadm.</p><p>The data encryption configuration? Kubespray has options for that in the variables, and kubeadm configures it.</p><p>This is the power of Kubespray. It takes all those manual steps and encodes them into Ansible roles that execute reliably and repeatably.</p><h4>Conditional Variables That Control Behavior</h4><p>Several variables act as switches that change what cluster.yml does:</p><p>The deploy_container_engine variable defaults to true. If you set it to false, Kubespray skips container runtime installation entirely, assuming you have already installed containerd or another runtime yourself.</p><pre>- { role: container-engine, tags: container-engine, when: deploy_container_engine }</pre><p>The skip_downloads variable does what it says. Set it to true if you have pre-downloaded all binaries and images (useful for air-gap deployments):</p><pre>- { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>The etcd_events_cluster_enabled variable controls whether to create a separate etcd cluster for Kubernetes events:</p><pre>etcd_events_cluster_setup: &quot;{{ etcd_events_cluster_enabled }}&quot;</pre><p>The dns_mode variable determines which cluster DNS to install. The default is coredns. Setting it to none skips DNS installation:</p><pre>when: &quot;dns_mode != &#39;none&#39; and resolvconf_mode == &#39;host_resolvconf&#39;&quot;</pre><p>The resolvconf_mode variable controls how Kubespray manages DNS resolution on nodes. The default host_resolvconf means Kubespray will modify /etc/resolv.conf.</p><h4>Running the Playbook</h4><p>Now that you understand the structure, running the playbook is straightforward:</p><pre>cd ~/kubespray<br><br>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v</pre><p>The -i flag specifies your inventory file. The -b flag enables become (sudo). The -v flag enables verbose output.</p><p>For a full deployment, expect this to take 15–30 minutes depending on your network speed and hardware. The download stage often takes the longest as it pulls container images and binaries.</p><p>If something fails, you can usually fix the issue and rerun the same command. Ansible’s idempotency means tasks that already completed successfully will show as “ok” and run quickly, while the failed tasks retry.</p><p>To run only specific stages:</p><pre># Just reinstall the network plugin<br>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --tags network<br><br># Reinstall apps only<br>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --tags apps<br><br># Run everything except downloads (useful after fixing a download issue)<br>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml -b -v --skip-tags download</pre><p>The — skip-tags option is the inverse of — tags. It runs everything except the specified tags.</p><h4>Watching the Execution</h4><p>When you run cluster.yml, you will see output like this:</p><pre>PLAY [Common tasks for every playbooks] ****************************************<br><br>TASK [Check that python netaddr is installed] **********************************<br>ok: [localhost]<br>PLAY [Gather facts] ************************************************************<br>TASK [Gather minimal facts] ****************************************************<br>ok: [k8s-ctr]<br>PLAY [Prepare for etcd install] ************************************************<br>TASK [kubespray_defaults : Load kubespray defaults] ****************************<br>ok: [k8s-ctr]<br>TASK [kubernetes/preinstall : Set facts] ***************************************<br>ok: [k8s-ctr]<br>...</pre><p>Each PLAY corresponds to a play in the playbook. Each TASK corresponds to a task within a role. The output shows which host the task runs on and whether it changed anything.</p><p>At the end, you get a PLAY RECAP:</p><pre>PLAY RECAP *********************************************************************<br>k8s-ctr    : ok=523  changed=147  unreachable=0  failed=0  skipped=892  rescued=0  ignored=2</pre><p>This tells you:</p><ul><li>ok: Tasks that completed without making changes</li><li>changed: Tasks that made changes to the system</li><li>unreachable: Hosts that could not be contacted</li><li>failed: Tasks that failed</li><li>skipped: Tasks that were skipped due to conditions</li><li>rescued: Tasks that failed but were rescued by a rescue block</li><li>ignored: Tasks that failed but had ignore_errors: true</li></ul><p>A successful run has failed=0 and unreachable=0.</p><p>The profile_tasks callback enabled in ansible.cfg also shows timing information:</p><pre>Tuesday 28 January 2026  15:23:45 +0900 (0:00:02.456)<br><br>===============================================================================<br>download : Download_container | Download image if required ----------- 48.84s<br>download : Download_container | Download image if required ----------- 33.23s<br>kubernetes/kubeadm : Join to cluster if needed ----------------------- 15.97s<br>container-engine : Containerd | Unpack containerd archive ------------- 8.34s<br>...</pre><p>This helps identify which tasks take the longest, useful for optimization or debugging slow deployments.</p><h3>Chapter 8: Lab Environment Setup</h3><p>Before we can deploy a Kubernetes cluster with Kubespray, we need machines to deploy it on. In this chapter, we will set up a lab environment using Vagrant and VirtualBox. By the end of this chapter, you will have a Rocky Linux virtual machine ready to become a single-node Kubernetes cluster.</p><h4>Prerequisites</h4><p>You will need the following software installed on your host machine:</p><p>VirtualBox version 7.2.4 or later. Earlier versions have compatibility issues with newer Linux kernels, particularly Rocky Linux 10 which ships with kernel 6.x. If you are running an older version of VirtualBox, upgrade it before proceeding.</p><p>Vagrant version 2.4.x or later. Vagrant will orchestrate the virtual machine lifecycle and handle the initial provisioning.</p><p>Your host machine should have at least 8 GB of RAM because we will allocate 4 GB to the virtual machine. You also need approximately 20 GB of free disk space for the VM disk image and downloaded artifacts.</p><p>To verify your installed versions, run these commands:</p><pre>VBoxManage --version<br>vagrant --version</pre><p>If you see version numbers that meet the requirements, you are ready to proceed.</p><h4>Windows Users: Disable Hyper-V First</h4><p>If you are running Windows, there is a critical step you must complete before anything else. Windows has a feature called Hyper-V which is a Type-1 hypervisor built into the operating system. When Hyper-V is enabled, it takes exclusive control of the CPU’s hardware virtualization features (Intel VT-x or AMD-V). This prevents VirtualBox from using native virtualization, forcing it to fall back to a compatibility mode called NEM (Native Execution Mode).</p><p>The problem is that Rocky Linux 10 uses a modern 6.x kernel that does not play well with NEM mode. When VirtualBox tries to boot Rocky Linux under NEM, you will see a kernel panic within seconds of boot:</p><pre>[    1.511197] ---[ end trace 0000000000000000 ]---<br>[    1.513571] RIP: 0010:wait_for_xmitr+0x61/0xc0<br>[    1.559573] Kernel panic - not syncing: Fatal exception<br>[    1.563919] ---[ end Kernel panic - not syncing: Fatal exception ]---</pre><p>I spent an embarrassing amount of time staring at this error before discovering the root cause. The VirtualBox log file contained a telltale message: “HM: HMR3Init: Attempting fall back to NEM: VT-x is not available” which indicated that Hyper-V was claiming the virtualization hardware.</p><p>You might wonder why Hyper-V was enabled in the first place. Several Windows features depend on it: WSL2 (Windows Subsystem for Linux 2), Docker Desktop when using the WSL2 backend, and Windows Sandbox. If you have used any of these features, Hyper-V is probably active on your system.</p><p>To disable Hyper-V and its related components, open PowerShell as Administrator and run these commands:</p><pre>Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart<br>Disable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart<br>Disable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -NoRestart</pre><p>Then disable the hypervisor launch at boot:</p><pre>bcdedit /set hypervisorlaunchtype off</pre><p>Finally, restart your computer:</p><pre>Restart-Computer</pre><p>After the reboot, verify that Hyper-V is truly disabled. Run this command which should produce no output if Hyper-V is off:</p><pre>Get-WindowsOptionalFeature -Online | Where-Object {$_.FeatureName -like &quot;*Hyper-V*&quot; -and $_.State -eq &quot;Enabled&quot;}</pre><p>Also verify the boot configuration:</p><pre>bcdedit /enum | Select-String hypervisorlaunchtype</pre><p>The output should show “hypervisorlaunchtype Off”. If it shows “Auto” or anything else, the disable did not take effect and you need to troubleshoot further.</p><p>Be aware that disabling Hyper-V means WSL2 and Docker Desktop (with WSL2 backend) will stop working. You can re-enable everything after completing this lab, which I will explain at the end of the chapter.</p><h4>Creating the Vagrantfile</h4><p>Create a new directory for this lab and create a file named Vagrantfile with the following content:</p><pre>BOX_IMAGE = &quot;bento/rockylinux-10.0&quot;<br>BOX_VERSION = &quot;202510.26.0&quot;<br><br>Vagrant.configure(&quot;2&quot;) do |config|<br>  config.vm.define &quot;k8s-ctr&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/Kubespray-Lab&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;k8s-ctr&quot;<br>      vb.cpus = 4<br>      vb.memory = 4096<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.host_name = &quot;k8s-ctr&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.10&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: &quot;60100&quot;, auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;init_cfg.sh&quot;<br>  end<br>end</pre><p>Let me explain what each section does.</p><p>The BOX_IMAGE and BOX_VERSION variables specify the Vagrant box to use. We are using the Bento project’s Rocky Linux 10.0 image, which is a well-maintained community box. Pinning the version ensures reproducibility.</p><p>The vm.provider block configures VirtualBox-specific settings. The — groups option organizes the VM into a folder called “Kubespray-Lab” in the VirtualBox Manager GUI. The — nicpromisc2 option enables promiscuous mode on the second network adapter, which is necessary for some CNI plugins to function correctly. We allocate 4 CPU cores and 4096 MB of memory, which is sufficient for a single-node cluster. The linked_clone option saves disk space by creating a copy-on-write clone rather than a full copy of the base image.</p><p>The network configuration creates two network interfaces. VirtualBox always creates a NAT interface as the first adapter, which provides outbound internet connectivity. We add a private_network with a static IP of 192.168.10.10, which creates a Host-Only network that allows the host machine and VM to communicate directly. The forwarded_port directive maps port 60100 on the host to port 22 on the guest, providing an alternative way to SSH into the VM.</p><p>The synced_folder line disables the default Vagrant folder synchronization. We do not need it and disabling it avoids potential permission issues.</p><p>Finally, the provision directive tells Vagrant to run a shell script called init_cfg.sh after the VM boots for the first time.</p><h4>The Initialization Script</h4><p>Create a file named init_cfg.sh in the same directory as your Vagrantfile:</p><pre>#!/usr/bin/env bash<br><br>echo &quot;&gt;&gt;&gt;&gt; Initial Config Start &lt;&lt;&lt;&lt;&quot;<br>echo &quot;[TASK 1] Change Timezone and Enable NTP&quot;<br>timedatectl set-local-rtc 0<br>timedatectl set-timezone Asia/Seoul<br>echo &quot;[TASK 2] Disable firewalld and selinux&quot;<br>systemctl disable --now firewalld &gt;/dev/null 2&gt;&amp;1<br>setenforce 0<br>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config<br>echo &quot;[TASK 3] Disable and turn off SWAP &amp; Delete swap partitions&quot;<br>swapoff -a<br>sed -i &#39;/swap/d&#39; /etc/fstab<br>sfdisk --delete /dev/sda 2 &gt;/dev/null 2&gt;&amp;1<br>partprobe /dev/sda &gt;/dev/null 2&gt;&amp;1<br>echo &quot;[TASK 4] Config kernel &amp; module&quot;<br>cat &lt;&lt; EOF &gt; /etc/modules-load.d/k8s.conf<br>overlay<br>br_netfilter<br>EOF<br>modprobe overlay &gt;/dev/null 2&gt;&amp;1<br>modprobe br_netfilter &gt;/dev/null 2&gt;&amp;1<br>cat &lt;&lt; EOF &gt;/etc/sysctl.d/k8s.conf<br>net.bridge.bridge-nf-call-iptables  = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward                 = 1<br>EOF<br>sysctl --system &gt;/dev/null 2&gt;&amp;1<br>echo &quot;[TASK 5] Setting Local DNS Using Hosts file&quot;<br>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts<br>cat &lt;&lt; EOF &gt;&gt; /etc/hosts<br>192.168.10.10 k8s-ctr<br>EOF<br>echo &quot;[TASK 6] Delete default routing - Secondary NIC&quot;<br>SECONDARY_NIC=$(ip -o -4 addr show | grep &quot;192.168.10&quot; | awk &#39;{print $2}&#39;)<br>if [ -n &quot;$SECONDARY_NIC&quot; ]; then<br>  echo &quot;Found secondary NIC: $SECONDARY_NIC, disabling default route...&quot;<br>  nmcli connection modify &quot;$SECONDARY_NIC&quot; ipv4.never-default yes 2&gt;/dev/null || true<br>  nmcli connection up &quot;$SECONDARY_NIC&quot; 2&gt;/dev/null || true<br>else<br>  echo &quot;No secondary NIC found, skipping...&quot;<br>fi<br>echo &quot;sudo su -&quot; &gt;&gt; /home/vagrant/.bashrc<br>echo &quot;&gt;&gt;&gt;&gt; Initial Config End &lt;&lt;&lt;&lt;&quot;</pre><p>This script performs essential pre-configuration that Kubernetes requires. Let me walk through each task in detail.</p><p>Task 1 sets the timezone and configures the system clock. The timedatectl set-local-rtc 0 command tells the system to use UTC for the hardware clock, which is the recommended setting for servers. Change Asia/Seoul to your preferred timezone. Accurate time is important for Kubernetes because certificate validation, log correlation, and distributed coordination all depend on synchronized clocks.</p><p>Task 2 disables the firewall and sets SELinux to permissive mode. In a production environment, you would configure proper firewall rules and SELinux policies. For a lab environment, disabling them eliminates potential networking issues that could complicate troubleshooting. The firewalld service is stopped and disabled. SELinux is set to permissive mode, which logs policy violations without enforcing them. The sed command makes this change persistent across reboots by modifying the SELinux configuration file.</p><p>Task 3 disables swap. Kubernetes requires swap to be disabled because the kubelet is not designed to handle swap memory. When swap is active, it can cause unpredictable latency spikes and interfere with resource management. The swapoff -a command immediately disables all swap. The sed command removes any swap entries from /etc/fstab so swap does not re-enable after a reboot. The sfdisk command attempts to delete the swap partition entirely, though this may fail depending on the disk layout.</p><p>Task 4 loads kernel modules and sets sysctl parameters that container networking requires. The overlay module enables OverlayFS, which containerd uses for efficient container image layer management. The br_netfilter module enables netfilter to process traffic traversing network bridges, which is necessary for iptables rules to work correctly with container networks. The sysctl parameters enable IP forwarding and bridge netfilter processing for both IPv4 and IPv6.</p><p>Task 5 adds a hosts file entry so the VM can resolve its own hostname to the correct IP address. This is important because Kubernetes components need to resolve node names to IP addresses.</p><p>Task 6 is where things get interesting and where I encountered problems on my first attempt. VirtualBox VMs have two network interfaces: the NAT interface (usually eth0 or enp0s3) and the Host-Only interface (usually eth1 or enp0s8). By default, both interfaces might try to set a default route, and the NAT interface typically wins. This causes problems later when Kubespray tries to determine the node’s IP address because it might pick up the NAT IP (10.0.2.15) instead of the Host-Only IP (192.168.10.10).</p><p>The original version of this script hardcoded the interface name as enp0s9, but interface names can vary depending on the VirtualBox version and the order of network adapter configuration. I modified the script to dynamically detect the interface by looking for the one that has an IP address in the 192.168.10.0/24 range:</p><pre>SECONDARY_NIC=$(ip -o -4 addr show | grep &quot;192.168.10&quot; | awk &#39;{print $2}&#39;)</pre><p>The ip -o -4 addr show command outputs IPv4 address information in a single-line format. We grep for our known IP range and extract the interface name with awk. Once we have the interface name, we tell NetworkManager to never use this interface as the default route with the ipv4.never-default option. This ensures that outbound traffic to the internet goes through the NAT interface while still allowing direct communication over the Host-Only network.</p><p>The last line adds “sudo su -” to the vagrant user’s bashrc so you automatically become root when you SSH into the VM. This is purely a convenience for lab environments.</p><h4>Launching the Virtual Machine</h4><p>With both files in place, launch the VM:</p><pre>vagrant up k8s-ctr</pre><p>Vagrant will download the Rocky Linux box if it is not already cached, create the VM, and run the initialization script. The output will show each task being executed. When you see “Initial Config End” followed by the Vagrant completion message, the VM is ready.</p><p>Verify the VM is running:</p><pre>vagrant status</pre><p>You should see:</p><pre>Current machine states:<br>k8s-ctr                   running (virtualbox)</pre><p>SSH into the VM to verify everything is working:</p><pre>vagrant ssh k8s-ctr</pre><p>You should be logged in as the vagrant user and immediately switched to root due to the bashrc modification. Verify the network configuration:</p><pre>ip addr show</pre><p>You should see at least two interfaces with IP addresses. One will have 10.0.2.15 (the NAT interface) and another will have 192.168.10.10 (the Host-Only interface). Verify the hostname resolves correctly:</p><pre>hostname<br>ping -c 1 k8s-ctr</pre><p>Both commands should work without errors.</p><h4>Windows Troubleshooting: VERR_ALREADY_EXISTS Error</h4><p>If you are on Windows and encounter an error containing VERR_ALREADY_EXISTS when running vagrant up a second time or after a failed first attempt, the problem is leftover VM directories. The — groups setting causes VirtualBox to move the VM into a folder, and if that folder already exists from a previous attempt, VirtualBox throws an error.</p><p>To fix this, destroy the VM and clean up the directories:</p><pre>vagrant destroy -f k8s-ctr</pre><p>Then remove the leftover directories in PowerShell:</p><pre>Remove-Item -Recurse -Force &quot;C:\Users\$env:USERNAME\VirtualBox VMs\Kubespray-Lab&quot; -ErrorAction SilentlyContinue<br>Remove-Item -Recurse -Force &quot;C:\Users\$env:USERNAME\VirtualBox VMs\k8s-ctr&quot; -ErrorAction SilentlyContinue<br>Remove-Item -Recurse -Force .vagrant -ErrorAction SilentlyContinue</pre><p>Now try vagrant up again. If the problem persists, you can work around it by removing the — groups line from the Vagrantfile:</p><pre># vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/Kubespray-Lab&quot;]</pre><p>The — groups option is purely cosmetic. It organizes VMs into folders in the VirtualBox Manager UI but has no effect on VM operation.</p><h4>Windows Troubleshooting: Unknown Connection Error</h4><p>If the init_cfg.sh script fails with “Error: unknown connection ‘enp0s9’” or a similar message, it means the original script’s hardcoded interface name does not match your system. Make sure you are using the modified version of the script that dynamically detects the interface name using the IP address pattern:</p><pre>SECONDARY_NIC=$(ip -o -4 addr show | grep &quot;192.168.10&quot; | awk &#39;{print $2}&#39;)</pre><p>This approach works regardless of what VirtualBox names the interfaces on your particular system.</p><h4>Setting Up SSH Keys for Ansible</h4><p>Kubespray uses Ansible, which connects to target machines via SSH. For password-less authentication, we need to set up SSH key-based access from the Ansible control node to the target nodes. In this single-node lab, the VM will act as both the Ansible control node and the target node, so it needs to be able to SSH to itself.</p><p>First, generate an SSH key pair if you do not already have one. From your host machine or from within the VM (depending on where you plan to run Ansible from):</p><pre>ssh-keygen -t ed25519 -N &quot;&quot; -f ~/.ssh/id_ed25519</pre><p>The -N “” option creates a key without a passphrase. In production, you would use a passphrase and an SSH agent, but for a lab environment, a passphrase-less key simplifies automation.</p><p>Copy the public key to the VM. If you are running Ansible from your host machine, use the forwarded port:</p><pre>ssh-copy-id -o StrictHostKeyChecking=no -p 60100 vagrant@127.0.0.1</pre><p>Enter “vagrant” when prompted for the password.</p><p>If you plan to run Ansible from within the VM itself (which is what we will do in the deployment chapter), SSH into the VM and set up local SSH access:</p><pre>vagrant ssh k8s-ctr</pre><p>Then generate a key and copy it to localhost:</p><pre>ssh-keygen -t ed25519 -N &quot;&quot; -f ~/.ssh/id_ed25519<br>ssh-copy-id -o StrictHostKeyChecking=no root@192.168.10.10</pre><p>Enter the root password when prompted. On this Vagrant box, the root password is typically “vagrant”.</p><p>Verify password-less SSH works:</p><pre>ssh root@192.168.10.10 &quot;hostname&quot;</pre><p>If this prints “k8s-ctr” without asking for a password, SSH key authentication is working correctly.</p><h4>Re-enabling Hyper-V After the Lab</h4><p>If you disabled Hyper-V on Windows and want to restore WSL2 and Docker Desktop functionality after completing this lab, run these commands in an Administrator PowerShell:</p><pre>Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All -NoRestart<br>Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -NoRestart<br>Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -NoRestart<br>bcdedit /set hypervisorlaunchtype auto</pre><p>Then restart your computer:</p><pre>Restart-Computer</pre><p>Your WSL2 distributions and their data will still be intact. Disabling and re-enabling Hyper-V does not affect the stored data.</p><h4>Verifying the Environment</h4><p>Before moving on, let us verify that the VM is correctly configured for Kubernetes:</p><p>Check that swap is disabled:</p><pre>swapon --show</pre><p>This should produce no output. If it shows swap partitions, the disable did not work and you need to troubleshoot.</p><p>Check that the kernel modules are loaded:</p><pre>lsmod | grep -E &quot;overlay|br_netfilter&quot;</pre><p>You should see both overlay and br_netfilter in the output.</p><p>Check the sysctl parameters:</p><pre>sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward</pre><p>All three should show a value of 1.</p><p>Check the SELinux status:</p><pre>getenforce</pre><p>This should show “Permissive”.</p><p>Check that the firewall is disabled:</p><pre>systemctl status firewalld</pre><p>This should show “inactive (dead)”.</p><p>If all these checks pass, your lab environment is ready for Kubespray. In the next chapter, we will configure the Kubespray inventory and customize the cluster variables before deploying Kubernetes.</p><h3>Chapter 9: Inventory Configuration and Variable Modification</h3><p>With your lab environment up and running, you now have a Rocky Linux VM sitting at 192.168.10.10, waiting to become a Kubernetes node. But Kubespray does not know anything about your infrastructure yet. You need to tell it what machines exist, what roles they should play, and how the cluster should be configured. This is where inventory configuration comes in.</p><p>Kubespray ships with a sample inventory that serves as both a template and a learning resource. The sample contains sensible defaults and commented examples that show you what options are available. Your first step is to copy this sample and make it your own.</p><pre>cp -rfp /root/kubespray/inventory/sample /root/kubespray/inventory/mycluster</pre><p>The flags here matter. The -r flag copies recursively since the inventory is a directory containing subdirectories. The -f flag forces the copy without prompting. The -p flag preserves the original file attributes including permissions, ownership, and timestamps. This last flag is often overlooked but can save you from mysterious permission issues later.</p><p>After running this command, you have a complete inventory structure at /root/kubespray/inventory/mycluster that mirrors the sample. Now you need to edit it to match your actual infrastructure.</p><h4>The Inventory File</h4><p>The heart of any Ansible-based deployment is the inventory file. This file tells Ansible which machines to manage and how to connect to them. In Kubespray, the inventory also determines which machines become control plane nodes, which become workers, and which host etcd.</p><p>Open the inventory file and replace its contents with your cluster definition:</p><pre>cat &lt;&lt; EOF &gt; /root/kubespray/inventory/mycluster/inventory.ini<br>k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10<br><br>[kube_control_plane]<br>k8s-ctr<br><br>[etcd:children]<br>kube_control_plane<br><br>[kube_node]<br>k8s-ctr<br>EOF</pre><p>This inventory defines a single-node cluster where one machine plays all roles. Let me walk through each part because understanding this structure will save you hours of debugging later.</p><p>The first line defines a host:</p><pre>k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10</pre><p>Here, k8s-ctr is the hostname that Ansible will use internally. It is an alias, a convenient name you can reference throughout your playbooks and inventory. The ansible_host variable tells Ansible the actual IP address or hostname to use when establishing SSH connections. When Ansible needs to connect to k8s-ctr, it will SSH to 192.168.10.10.</p><p>The ip variable is specific to Kubespray, not a standard Ansible variable. Kubespray uses this to determine which IP address the Kubernetes components should bind to and advertise. This distinction between ansible_host and ip becomes critical in environments with multiple network interfaces.</p><p>You might wonder why both variables are set to the same value. In simple environments, they often are identical. But consider a scenario where you have a bastion host or jump server. You might SSH through one IP address but want Kubernetes to communicate over a different internal network. Or consider our VirtualBox environment where each VM has multiple network interfaces.</p><h4>The VirtualBox NAT IP Trap</h4><p>If you have been following this tutorial series from the beginning, you have encountered this problem before. VirtualBox creates a NAT interface as the first network adapter on each VM. This interface gets the IP address 10.0.2.15 on every single VM. It is designed for outbound internet access from the VM, not for inter-VM communication.</p><p>When Ansible gathers facts about a host, it collects information about all network interfaces. Kubespray then tries to determine which IP address to use for Kubernetes components. If you do not explicitly set the ip variable, Kubespray might pick 10.0.2.15 because it appears first in the interface list. The result is a cluster where the API server advertises itself at 10.0.2.15, and every worker node tries to contact that address, which is their own localhost NAT interface. Nothing works.</p><p>I have made this mistake more times than I care to admit. Even after documenting it thoroughly in previous tutorials, I still forget to set the ip variable when setting up new test clusters. The symptom is always the same: kubeadm init succeeds on the control plane, but kubeadm join fails on workers with connection refused errors pointing to 10.0.2.15.</p><p>By explicitly setting ip=192.168.10.10, you tell Kubespray to use the Host-Only network interface for all Kubernetes communication. This is the interface where your VMs can actually reach each other.</p><h4>Understanding Ansible Groups</h4><p>The inventory file defines several groups using the bracket notation. Each group serves a specific purpose in Kubespray.</p><pre>[kube_control_plane]<br>k8s-ctr</pre><p>This group lists all nodes that should run Kubernetes control plane components. These are the nodes that will run kube-apiserver, kube-controller-manager, and kube-scheduler. In a production cluster, you would list three or five nodes here for high availability. For this single-node lab, only k8s-ctr appears in the list.</p><pre>[kube_node]<br>k8s-ctr</pre><p>This group lists all nodes that should run workloads. In Kubernetes terminology, these are the worker nodes. They run kubelet and kube-proxy and can schedule pods. Notice that k8s-ctr appears here as well. This means your control plane node will also accept workloads. In production, you often want control plane nodes dedicated to cluster management, but for a learning environment, having one node do everything is perfectly fine.</p><p>The etcd group is more interesting:</p><pre>[etcd:children]<br>kube_control_plane</pre><p>This uses Ansible’s nested group feature. Instead of listing hosts directly, the :children suffix tells Ansible that the entries below are names of other groups, not hosts. This line says “the etcd group contains all hosts that are members of kube_control_plane.”</p><p>This pattern is elegant because it creates an automatic relationship. If you add a node to kube_control_plane, it automatically becomes an etcd member too. For the common case where etcd runs on control plane nodes, this reduces duplication and prevents inconsistencies.</p><p>But this is not the only way to configure etcd. In production environments with strict performance requirements, you might want dedicated etcd nodes that do nothing else. In that case, you would define the etcd group directly:</p><pre>[etcd]<br>etcd-1 ansible_host=192.168.10.20<br>etcd-2 ansible_host=192.168.10.21<br>etcd-3 ansible_host=192.168.10.22</pre><p>Kubespray supports both patterns. The sample inventory uses the children approach because it matches the common deployment model where etcd is colocated with control plane nodes.</p><h4>The Implicit all Group</h4><p>You might have noticed that the host definition line appears outside any group:</p><pre>k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10</pre><p>In Ansible, any host defined outside of explicit groups automatically belongs to the special all group. This group is reserved by Ansible and always contains every host in the inventory. Kubespray uses this property extensively. Variables defined in group_vars/all/ apply to every single node, regardless of what other groups they belong to.</p><h4>Verifying Your Inventory</h4><p>Before running any playbooks, verify that Ansible interprets your inventory correctly. Mistakes in inventory files often produce confusing errors during playbook execution. It is much easier to catch them early.</p><p>Check the etcd group:</p><pre>ansible -i /root/kubespray/inventory/mycluster/inventory.ini etcd --list-hosts</pre><p>The output should show:</p><pre>hosts (1):<br>    k8s-ctr</pre><p>Check the control plane group:</p><pre>ansible -i /root/kubespray/inventory/mycluster/inventory.ini kube_control_plane --list-hosts</pre><pre>hosts (1):<br>    k8s-ctr</pre><p>Check the worker node group:</p><pre>ansible -i /root/kubespray/inventory/mycluster/inventory.ini kube_node --list-hosts</pre><pre>hosts (1):<br>    k8s-ctr</pre><p>And verify the all group contains your host:</p><pre>ansible -i /root/kubespray/inventory/mycluster/inventory.ini all --list-hosts</pre><pre>hosts (1):<br>    k8s-ctr</pre><p>All four groups resolve to the same single host, which is exactly what we expect for a single-node cluster. If any of these commands show unexpected results, go back and check your inventory file syntax carefully. A misplaced bracket or typo can cause hosts to end up in the wrong groups.</p><h4>Modifying Cluster Configuration</h4><p>With the inventory defining your infrastructure, the next step is configuring how Kubespray should build the cluster. These settings live in the group_vars directory, which contains YAML files that set Ansible variables.</p><p>Kubespray provides sensible defaults for everything, but you will almost always want to customize some settings. For this tutorial, we will make several changes to k8s-cluster.yml, the file that controls core Kubernetes settings.</p><p><em>The changes we need to make are:</em></p><p>First, switch the CNI plugin from Calico to Flannel. Calico is the default and is an excellent choice for production, but Flannel is simpler and easier to understand when learning:</p><pre>sed -i &#39;s|kube_network_plugin: calico|kube_network_plugin: flannel|g&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>Second, switch kube-proxy from IPVS mode to iptables mode. IPVS offers better performance at scale, but iptables is the traditional mode and is easier to debug:</p><pre>sed -i &#39;s|kube_proxy_mode: ipvs|kube_proxy_mode: iptables|g&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>Third, disable NodeLocal DNSCache. This feature improves DNS performance by running a caching DNS server on each node, but it adds complexity that we do not need for learning:</p><pre>sed -i &#39;s|enable_nodelocaldns: true|enable_nodelocaldns: false|g&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>Fourth, enable automatic certificate renewal. Kubernetes certificates expire after one year, and this setting ensures they are renewed automatically:</p><pre>sed -i &#39;s|auto_renew_certificates: false|auto_renew_certificates: true|g&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>Fifth, uncomment the certificate renewal schedule. The setting exists in the file but is commented out by default:</p><pre>sed -i &#39;s|# auto_renew_certificates_systemd_calendar|auto_renew_certificates_systemd_calendar|g&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>After running all these commands, verify the changes took effect:</p><pre>grep -iE &#39;kube_network_plugin:|kube_proxy_mode|enable_nodelocaldns:|^auto_renew_certificates&#39; \<br>  inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml</pre><p>You should see:</p><pre>kube_network_plugin: flannel<br>kube_proxy_mode: iptables<br>enable_nodelocaldns: false<br>auto_renew_certificates: true<br>auto_renew_certificates_systemd_calendar: &quot;Mon *-*-1,2,3,4,5,6,7 03:00:00&quot;</pre><p>The calendar expression for certificate renewal looks cryptic at first glance. It follows the systemd timer calendar format. “Mon — 1,2,3,4,5,6,7 03:00:00” means “every Monday that falls on the 1st through 7th of any month, at 3:00 AM.” In practice, this triggers certificate renewal once a month on the first Monday, at a time when the cluster is likely idle.</p><h4>Configuring Flannel</h4><p>Since we switched from Calico to Flannel, we need to configure Flannel-specific settings. Kubespray maintains separate configuration files for each CNI plugin. The Flannel settings live in k8s-net-flannel.yml.</p><p>One critical setting for VirtualBox environments is specifying which network interface Flannel should use for VXLAN traffic. Without this setting, Flannel might pick the wrong interface, just like Kubespray picking the wrong IP address when the ip variable is missing.</p><p>Check the current contents of the Flannel configuration:</p><pre>cat inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml</pre><p>The file contains various Flannel settings, mostly commented out because the defaults work for typical deployments. Add the interface specification:</p><pre>echo &quot;flannel_interface: enp0s8&quot; &gt;&gt; inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml</pre><p>The interface name enp0s8 corresponds to the Host-Only network adapter in VirtualBox. This is where your VMs can communicate with each other. If you used different network settings in your Vagrantfile, you might need to adjust this value. You can check available interfaces on your VM by running ip addr.</p><p>Verify the setting was added:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/k8s_cluster/k8s-net-flannel.yml</pre><p>This grep command shows only non-comment lines, filtering out the extensive documentation that Kubespray includes in its configuration files.</p><h4>What These Settings Actually Do</h4><p>Let me explain why we chose these particular settings, beyond just simplicity for learning.</p><p>Flannel versus Calico represents a fundamental tradeoff in Kubernetes networking. Calico uses BGP routing and supports NetworkPolicy enforcement, making it the preferred choice for production clusters that need fine-grained network security. Flannel uses simpler VXLAN encapsulation and does not support NetworkPolicy at all. For a single-node learning cluster where you are not going to test network policies, Flannel’s simplicity is an advantage. There is less to go wrong, and the networking behavior is easier to understand.</p><p>The kube-proxy mode setting affects how Kubernetes implements Service load balancing. In iptables mode, kube-proxy programs iptables rules that distribute traffic across pod endpoints. Every packet traverses the iptables rules, which works fine for moderate traffic loads. In IPVS mode, kube-proxy uses the Linux kernel’s IPVS (IP Virtual Server) module, which is designed for load balancing and handles high connection rates more efficiently. For a learning cluster, iptables mode is easier to inspect and debug. You can see the rules with iptables -L -t nat, which is helpful when troubleshooting service connectivity.</p><p>NodeLocal DNSCache improves DNS performance by running a DNS caching agent on every node. Instead of sending all DNS queries to CoreDNS pods that might be running on different nodes, queries go to the local cache first. This reduces latency and network traffic. However, it adds another moving part to the cluster, and for learning purposes, the extra complexity is not worth the performance gain.</p><p>Certificate auto-renewal is something you absolutely want in production. Kubernetes uses many certificates: the API server certificate, the kubelet client certificates, the etcd certificates, and more. These certificates typically have a one-year validity period. If they expire, your cluster stops working. The auto-renewal feature uses a systemd timer to run kubeadm certs renew automatically before expiration. The monthly schedule gives plenty of margin since certificates are valid for a full year.</p><h4>The Variable Hierarchy Reminder</h4><p>Remember from earlier chapters that Kubespray uses Ansible’s variable precedence system. The files you just modified in group_vars/k8s_cluster/ have a priority of 6 in Ansible’s hierarchy. They will override the default values defined in roles/kubespray_defaults/defaults/ (priority 2) but can themselves be overridden by host_vars files (priority 9) or command-line extra-vars (priority 22).</p><p>This means if you ever need to test a different setting without permanently changing your inventory, you can use the -e flag:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  -e &quot;kube_network_plugin=cilium&quot;</pre><p>This would deploy with Cilium instead of Flannel, overriding the inventory setting just for this run. The inventory files remain unchanged, so subsequent runs without -e would use Flannel again.</p><p>This hierarchy is intentional. Kubespray defaults represent best practices that work for most deployments. Your inventory customizations represent your organization’s requirements. Command-line overrides let you experiment without changing anything permanently.</p><h4>Preparing for Deployment</h4><p>At this point, your inventory is configured and validated. You have defined which machine will be your cluster node, specified the correct IP address to avoid VirtualBox networking pitfalls, and customized the cluster configuration to use Flannel with iptables-mode kube-proxy.</p><p>The configuration is complete. In the next chapter, we will execute the deployment and watch Kubespray transform a bare Rocky Linux VM into a functioning Kubernetes cluster. Before proceeding, it is worth double-checking that SSH connectivity still works:</p><pre>ansible -i inventory/mycluster/inventory.ini all -m ping</pre><p>If this returns success, you are ready to deploy. If it fails, troubleshoot SSH connectivity before moving on. No amount of correct inventory configuration will help if Ansible cannot reach your nodes.</p><h3>Chapter 10: Kubernetes The Kubespray Way — A Comparative Tutorial</h3><p>This chapter walks through deploying a Kubernetes cluster using Kubespray, deliberately structured to mirror the steps from Kubernetes The Hard Way. The goal here is not just to get a cluster running, but to understand exactly which manual steps Kubespray automates for you. By the end, you will have a working cluster and a clear mental map of what happens behind that single ansible-playbook command.</p><p>The tutorial follows the official Kubespray documentation “Setting up your first cluster,” which was itself inspired by Kubernetes The Hard Way. The difference is that instead of manually executing hundreds of commands across multiple nodes, you will define your desired state in inventory files and let Ansible do the work.</p><h4>Prerequisites</h4><p>In Kubernetes The Hard Way, the prerequisites chapter covered installing VirtualBox and Vagrant, verifying host system requirements, and planning the network topology. The same applies here, but Kubespray adds a few more requirements on the control node side.</p><p>For Kubespray, you need a machine that will run Ansible. This is called the Ansible Control Node, and it sits outside the cluster. From this machine, you will SSH into all the target nodes and orchestrate the installation. The target nodes themselves only need SSH access and Python installed.</p><p>Here is the requirement summary:</p><pre>Ansible Control Node: Linux or Mac with Python 3 installed<br>Target Nodes: SSH accessible, Python installed<br>Network: Nodes can communicate with each other, internet access for downloading images<br>Privileges: Root or sudo access on all target nodes</pre><p>The original documentation uses Google Cloud Platform for the infrastructure. This tutorial uses Vagrant instead, for several reasons. Vagrant is free, runs locally without network latency, and you have been using it throughout the previous tutorials. There is no reason to change now.</p><h4>Provisioning Compute Resources</h4><p>In Kubernetes The Hard Way, you created four VMs: a Jumpbox for running commands, a Server for the control plane, and two worker nodes. The Kubespray setup follows a similar pattern, but the Jumpbox becomes the Ansible Control Node.</p><p>Here is the node layout for this tutorial:</p><pre>controller (192.168.10.10) — Ansible Control Node, equivalent to the Jumpbox<br>controller-0 (192.168.10.100) — Control Plane node, equivalent to the Server<br>worker-0 (192.168.10.101) — Worker node<br>worker-1 (192.168.10.102) — Worker node</pre><p>Create a directory for this lab and add the following Vagrantfile:</p><pre><br>BOX_IMAGE = &quot;bento/rockylinux-10.0&quot; BOX_VERSION = &quot;202510.26.0&quot;<br>Vagrant.configure(&quot;2&quot;) do |config|<br>config.vm.define &quot;controller&quot; do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider &quot;virtualbox&quot; do |vb| vb.customize [&quot;modifyvm&quot;, :id, &quot; - groups&quot;, &quot;/Kubespray-Lab&quot;] vb.customize [&quot;modifyvm&quot;, :id, &quot; - nicpromisc2&quot;, &quot;allow-all&quot;] vb.name = &quot;controller&quot; vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = &quot;controller&quot; subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.10&quot; subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true end<br>config.vm.define &quot;controller-0&quot; do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider &quot;virtualbox&quot; do |vb| vb.customize [&quot;modifyvm&quot;, :id, &quot; - groups&quot;, &quot;/Kubespray-Lab&quot;] vb.customize [&quot;modifyvm&quot;, :id, &quot; - nicpromisc2&quot;, &quot;allow-all&quot;] vb.name = &quot;controller-0&quot; vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = &quot;controller-0&quot; subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.100&quot; subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true end<br>config.vm.define &quot;worker-0&quot; do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider &quot;virtualbox&quot; do |vb| vb.customize [&quot;modifyvm&quot;, :id, &quot; - groups&quot;, &quot;/Kubespray-Lab&quot;] vb.customize [&quot;modifyvm&quot;, :id, &quot; - nicpromisc2&quot;, &quot;allow-all&quot;] vb.name = &quot;worker-0&quot; vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = &quot;worker-0&quot; subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.101&quot; subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true end<br>config.vm.define &quot;worker-1&quot; do |subconfig| subconfig.vm.box = BOX_IMAGE subconfig.vm.box_version = BOX_VERSION subconfig.vm.provider &quot;virtualbox&quot; do |vb| vb.customize [&quot;modifyvm&quot;, :id, &quot; - groups&quot;, &quot;/Kubespray-Lab&quot;] vb.customize [&quot;modifyvm&quot;, :id, &quot; - nicpromisc2&quot;, &quot;allow-all&quot;] vb.name = &quot;worker-1&quot; vb.cpus = 2 vb.memory = 2048 vb.linked_clone = true end subconfig.vm.host_name = &quot;worker-1&quot; subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.102&quot; subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true end<br>end</pre><p>Bring up all four VMs:</p><pre><br>vagrant up</pre><p>This takes a few minutes. Once complete, verify all machines are running:</p><pre><br>vagrant status</pre><p>You should see all four VMs in the “running” state.</p><h4>Configuring SSH Access</h4><p>Just like in Kubernetes The Hard Way where you set up SSH from the Jumpbox to the other nodes, Ansible needs passwordless SSH access to all target nodes. Log into the controller node:</p><pre><br>vagrant ssh controller</pre><p>Generate an SSH key pair:</p><pre><br>ssh-keygen -t rsa -b 4096 -N &quot;&quot; -f ~/.ssh/id_rsa</pre><p>Now copy the public key to each target node. The default password for Vagrant boxes is “vagrant”:</p><pre><br>ssh-copy-id root@192.168.10.100 ssh-copy-id root@192.168.10.101 ssh-copy-id root@192.168.10.102</pre><p>Test that passwordless SSH works:</p><pre><br>ssh root@192.168.10.100 &quot;hostname&quot; ssh root@192.168.10.101 &quot;hostname&quot; ssh root@192.168.10.102 &quot;hostname&quot;</pre><p>Each command should return the hostname without prompting for a password. If this works, you are ready to set up Kubespray.</p><h4>Setting Up Kubespray</h4><p>This is where things diverge dramatically from The Hard Way. In The Hard Way, the next several hours would be spent manually generating certificates, creating kubeconfig files, writing systemd unit files, and bootstrapping etcd. With Kubespray, you will spend the next few minutes configuring inventory files, and then a single command handles the rest.</p><p>Still on the controller node, create a Python virtual environment. Ansible is a Python application, and using a virtual environment keeps dependencies isolated:</p><pre>python3 -m venv venv source venv/bin/activate</pre><p>Clone the Kubespray repository and check out a stable release:</p><pre><br>git clone https://github.com/kubernetes-sigs/kubespray.git cd kubespray git checkout release-2.28</pre><p>Install the required Python packages:</p><pre>pip install -r requirements.txt</pre><p>You will see Ansible and its dependencies being installed. The output ends with something like:</p><pre><br>Successfully installed MarkupSafe-3.0.3 PyYAML-6.0.3 ansible-9.13.0 ansible-core-2.16.15 …</pre><h4>Configuring the Inventory</h4><p>Kubespray uses Ansible inventory files to define which hosts belong to which groups. Copy the sample inventory to create your own.</p><pre>cp -rfp inventory/sample inventory/mycluster</pre><p>Now edit the inventory file to match your node layout. Open inventory/mycluster/inventory.ini and replace its contents with:</p><pre>[all] controller-0 ansible_host=192.168.10.100 ip=192.168.10.100 worker-0 ansible_host=192.168.10.101 ip=192.168.10.101 worker-1 ansible_host=192.168.10.102 ip=192.168.10.102<br>[kube_control_plane] controller-0<br>[etcd] controller-0<br>[kube_node] worker-0 worker-1<br>[calico_rr]<br>[k8s_cluster:children] kube_control_plane kube_node calico_rr</pre><p>Let me explain what each section does.</p><p>The [all] section lists every host with its connection details. The ansible_host variable tells Ansible which IP address to SSH to. The ip variable is critical and often misunderstood. This tells Kubespray which IP address to use for Kubernetes internal communication. If you omit ip, Kubespray will try to auto-detect it, and on VirtualBox it often picks the NAT interface (10.0.2.15) instead of the host-only network. This causes the cluster to fail because nodes cannot reach each other on 10.0.2.15.</p><p>The [kube_control_plane] group contains nodes that will run the Kubernetes control plane components: kube-apiserver, kube-controller-manager, and kube-scheduler.</p><p>The [etcd] group contains nodes that will run the etcd cluster. In this setup, etcd runs on the same node as the control plane, which is called “stacked etcd.” For production, you might run etcd on dedicated nodes.</p><p>The [kube_node] group contains worker nodes where your application pods will run.</p><p>The [calico_rr] group is for Calico Route Reflectors, used in large-scale deployments. Leave it empty for now.</p><p>The [k8s_cluster:children] group is a meta-group that includes all Kubernetes cluster members. The :children suffix means this group inherits members from the listed child groups.</p><p>Notice how the sample inventory uses [etcd:children] with kube_control_plane listed under it. This is a shorthand that says “the etcd group consists of all hosts in the kube_control_plane group.” In this tutorial, we list controller-0 directly under [etcd] for clarity, but both approaches produce the same result.</p><h4>Understanding the Default Configuration</h4><p>Before deploying, take a moment to look at the default configuration. The most important file is inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml. You can view the active settings with:</p><pre>grep &quot;^[^#]&quot; inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml | head -30</pre><p>Some key defaults you will notice:</p><pre>kube_network_plugin: calico — The CNI plugin for pod networking<br>kube_service_addresses: 10.233.0.0/18 — The CIDR range for Kubernetes services<br>kube_pods_subnet: 10.233.64.0/18 — The CIDR range for pods<br>container_manager: containerd — The container runtime<br>kube_proxy_mode: ipvs — How kube-proxy handles service routing</pre><p>For this tutorial, keep the defaults. In production, you would carefully review and customize these settings before deployment.</p><p>One optional but recommended setting is enabling the Metrics Server. Edit inventory/mycluster/group_vars/k8s_cluster/addons.yml and find the metrics_server_enabled line. Change it to:</p><p>metrics_server_enabled: true</p><p>The Metrics Server provides resource usage data for commands like kubectl top and is required for Horizontal Pod Autoscaler.</p><h4>Deploying the Cluster</h4><p>Everything up to this point has been preparation. Now comes the moment of truth. From the kubespray directory, run:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini <br> -u root -b -v <br> - private-key=~/.ssh/id_rsa <br> cluster.yml</pre><p>Let me break down this command:</p><pre>-i inventory/mycluster/inventory.ini specifies the inventory file<br>-u root tells Ansible to connect as the root user<br>-b enables &quot;become&quot; mode for privilege escalation (though root does not need it)<br>-v enables verbose output so you can see what is happening<br>--private-key specifies the SSH key to use<br>cluster.yml is the main playbook that deploys everything</pre><p>Press enter and watch the output scroll by. This takes 15 to 30 minutes depending on your hardware and internet speed. You will see hundreds of tasks execute across multiple plays.</p><p>If everything succeeds, the final output looks like:</p><pre>PLAY RECAP ********************************************************************* controller-0 : ok=XXX changed=XX unreachable=0 failed=0 … worker-0 : ok=XXX changed=XX unreachable=0 failed=0 … worker-1 : ok=XXX changed=XX unreachable=0 failed=0 …</pre><p>The key metric is failed=0 for all hosts.</p><h4>Troubleshooting: Inventory Path Issues</h4><p>If you see warnings like “Unable to parse inventory” and all plays show “skipping: no hosts matched,” check your inventory path. The official documentation uses a directory path:</p><pre>-i inventory/mycluster/</pre><p>However, this does not work reliably in all environments. Specify the file directly:</p><pre>-i inventory/mycluster/inventory.ini</pre><h4>Troubleshooting: USERNAME Variable</h4><p>If you get an error about expecting an argument for -u, or if the ansible-playbook help is displayed instead of running the playbook, the $USERNAME environment variable is not set. The official documentation assumes you ran:</p><pre>USERNAME=$(whoami)</pre><p>Either set this variable or replace $USERNAME with root directly in the command.</p><h4>Troubleshooting: VirtualBox NAT IP Problem</h4><p>This is the most common issue with VirtualBox environments, and you have encountered it before in previous tutorials. If the deployment fails during the worker node join phase with an error like:</p><pre>error execution phase preflight: couldn&#39;t validate the identity of the API Server: Get &quot;https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s&quot;: dial tcp 10.0.2.15:6443: connect: connection refused</pre><p>The problem is that Kubespray detected the wrong IP address. VirtualBox VMs have a NAT interface as their first network adapter, which gets the IP 10.0.2.15. This IP is not routable between VMs. The fix is to explicitly set the ip variable in your inventory, which you already did in the configuration above.</p><p>If you hit this error, verify your inventory.ini has ip= set for every host:</p><p>[all] controller-0 ansible_host=192.168.10.100 ip=192.168.10.100 worker-0 ansible_host=192.168.10.101 ip=192.168.10.101 worker-1 ansible_host=192.168.10.102 ip=192.168.10.102</p><p>Then run the playbook again. Thanks to Ansible’s idempotency, it will skip the already-completed tasks and continue from where it failed.</p><h4>Troubleshooting: etcd Health Check Failure After IP Change</h4><p>Here is a more insidious problem. Suppose you initially ran the playbook without the ip variable set, the control plane was configured with 10.0.2.15, then you fixed the inventory and ran again. This time you might see:</p><pre>TASK [etcd : Configure | Wait for etcd cluster to be healthy] ********************* fatal: [controller-0]: FAILED! =&gt; { &quot;cmd&quot;: &quot;… /usr/local/bin/etcdctl endpoint - cluster health …&quot;, &quot;stderr&quot;: &quot;… dial tcp 192.168.10.100:2379: connect: connection refused …&quot; }</pre><p>The etcd service was already installed and configured to bind to 10.0.2.15. Changing the ip variable does not automatically reconfigure already-running services. Ansible’s idempotency means “if the service is running, don’t touch it.”</p><p>The solution is to reset the cluster and start fresh:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini <br> -u root -b -v <br> - private-key=~/.ssh/id_rsa <br> reset.yml</pre><p>When prompted with “Are you sure you want to reset cluster state?”, type yes. This removes all Kubernetes components and etcd data, returning the nodes to a clean state.</p><p>After the reset completes, run cluster.yml again:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini <br> -u root -b -v <br> - private-key=~/.ssh/id_rsa <br> cluster.yml</pre><p>This time, everything installs correctly with the right IP addresses.</p><p>Note: When you run cluster.yml after reset.yml, you might see an error during the etcd version check:</p><pre>fatal: [controller-0]: FAILED! =&gt; {&quot;msg&quot;: &quot;[Errno 2] No such file or directory: b&#39;/usr/local/bin/etcd&#39;&quot;}</pre><p>This is expected. The reset removed the etcd binary, so the version check fails. Kubespray handles this gracefully and proceeds to install etcd. Check the final PLAY RECAP — if failed=0, everything is fine.</p><h4>Accessing the Cluster</h4><p>The cluster is deployed, but you still need kubectl access from outside the cluster. Kubespray created a kubeconfig file on the control plane node at /etc/kubernetes/admin.conf. You need to copy this to your Ansible control node.</p><p>Still on the controller node, create the kubectl configuration directory:</p><p>mkdir -p ~/.kube</p><p>Copy the kubeconfig from the control plane:</p><p>scp root@192.168.10.100:/etc/kubernetes/admin.conf ~/.kube/config</p><p>There is one more step. The kubeconfig file was generated for use on the control plane node itself, so it points to localhost:</p><p>cat ~/.kube/config | grep server</p><p>You will see server: <a href="https://127.0.0.1:6443.">https://127.0.0.1:6443.</a> This does not work from the controller node because the API server is not running locally. Fix it by replacing 127.0.0.1 with the actual control plane IP:</p><p>sed -i ‘s/127.0.0.1/192.168.10.100/g’ ~/.kube/config</p><p>Now install kubectl. If your controller node does not have it already:</p><pre>curl -LO &quot;https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl&quot; chmod +x kubectl sudo mv kubectl /usr/local/bin/</pre><p>Test the connection:</p><p>kubectl get nodes</p><p>You should see all three nodes:</p><pre>NAME STATUS ROLES AGE VERSION<br>controller-0 Ready control-plane 11m v1.32.11<br>worker-0 Ready &lt;none&gt; 10m v1.32.11 worker-1 Ready &lt;none&gt; 10m v1.32.11</pre><p>The cluster is ready.</p><h4>Smoke Tests</h4><p>Run the same verification tests from Kubernetes The Hard Way to confirm everything works.</p><p>If you enabled Metrics Server, check that resource metrics are available:</p><p>kubectl top nodes</p><p>After a few minutes for metrics collection to start:</p><pre>NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)<br> controller-0 98m 7% 1939Mi 66%<br> worker-0 40m 2% 977Mi 99%<br> worker-1 43m 3% 994Mi 101%</pre><h4>Network — Pod to Pod Communication</h4><p>Open two terminal windows, both connected to the controller node.</p><p>In the first terminal, create a pod and note its IP:</p><pre>kubectl run myshell1 -it - rm - image busybox - sh</pre><p>Once inside the pod:</p><pre>hostname -i</pre><p>Note the IP address, something like 10.233.107.4.</p><p>In the second terminal, create another pod and ping the first one:</p><pre>kubectl run myshell2 -it - rm - image busybox - sh</pre><p>Inside the second pod:</p><p>ping 10.233.107.4</p><p>You should see successful ping responses:</p><p>64 bytes from 10.233.107.4: seq=0 ttl=62 time=0.686 ms 64 bytes from 10.233.107.4: seq=1 ttl=62 time=0.620 ms</p><p>This confirms pod-to-pod networking works. Calico is doing its job.</p><p>Exit both pods with exit or Ctrl-D.</p><h4>Deployments</h4><p>Create a simple deployment:</p><p>kubectl create deployment nginx — image=nginx</p><p>Watch the pod come up:</p><p>kubectl get pods -l app=nginx</p><p>Wait until STATUS shows Running.</p><h4>Port Forwarding</h4><p>Forward a local port to the nginx pod:</p><pre>POD_NAME=$(kubectl get pods -l app=nginx -o jsonpath=&quot;{.items[0].metadata.name}&quot;) kubectl port-forward $POD_NAME 8080:80</pre><p>In another terminal:</p><p>curl — head <a href="http://127.0.0.1:8080">http://127.0.0.1:8080</a></p><p>You should see HTTP/1.1 200 OK and nginx headers.</p><h4>Logs</h4><p>Retrieve container logs:</p><p>kubectl logs $POD_NAME</p><p>You will see nginx startup logs and, if you ran the curl command, the access log entry.</p><h4>Exec</h4><p>Execute a command inside the container:</p><p>kubectl exec -ti $POD_NAME — nginx -v</p><p>Output: nginx version: nginx/1.x.x</p><h4>Services — NodePort</h4><p>Expose the deployment as a NodePort service:</p><p>kubectl expose deployment nginx — port 80 — type NodePort</p><p>Find the assigned node port:</p><pre>NODE_PORT=$(kubectl get svc nginx -o jsonpath=&#39;{.spec.ports[0].nodePort}&#39;)<br>echo $NODE_PORT</pre><p>Access the service through a worker node’s IP:</p><p>curl -I <a href="http://192.168.10.101:$NODE_PORT">http://192.168.10.101:$NODE_PORT</a></p><p>You should get HTTP/1.1 200 OK.</p><h4>Local DNS</h4><p>Test cross-namespace DNS resolution. First, create a namespace and deploy nginx there:</p><pre>kubectl create namespace dev<br>kubectl create deployment nginx - image=nginx -n dev<br>kubectl expose deployment nginx - port 80 - type ClusterIP -n dev</pre><p>Now create a pod in the default namespace and access the service in the dev namespace by DNS name:</p><p>kubectl run curly -it — rm — image curlimages/curl:7.70.0 — /bin/sh</p><p>Inside the pod:</p><p>curl — head <a href="http://nginx.dev:80">http://nginx.dev:80</a></p><p>This resolves nginx.dev to nginx.dev.svc.cluster.local and returns the nginx response. CoreDNS is working correctly.</p><h4>Cleaning Up</h4><p>When you are done experimenting, clean up the Kubernetes resources:</p><pre>kubectl delete namespace dev<br>kubectl delete deployment nginx<br>kubectl delete svc nginx</pre><p>To reset the cluster but keep the VMs (for example, to redeploy with different settings):</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini <br> -u root -b -v <br> - private-key=~/.ssh/id_rsa <br> reset.yml</pre><p>To completely remove everything:</p><p>exit # Leave the controller VM vagrant destroy -f</p><h3>Chapter 11: containerd Configuration File Explained</h3><p>I kept running into the same problem. Every time I needed to configure containerd, I found myself searching “containerd config.toml” and piecing together information from five different sources. After the third time doing this in a month, I decided to sit down and actually understand the configuration file properly. This chapter is the result of that effort.</p><h4>Understanding TOML Before We Start</h4><p>containerd uses TOML for its configuration file. If you’ve never encountered TOML before, you might wonder why not just use JSON or YAML like everything else in the Kubernetes ecosystem. The answer comes down to readability and structure.</p><p>TOML stands for Tom’s Obvious, Minimal Language. It looks similar to INI files but has a more rigorous specification. Here’s what basic TOML looks like:</p><pre># Comments start with hash<br>key = &quot;value&quot;<br>number = 42<br>boolean = true<br><br>[section]<br>  nested_key = &quot;nested_value&quot;<br><br>[section.subsection]<br>  deep_key = &quot;deep_value&quot;</pre><p>containerd chose TOML because its configuration is deeply hierarchical. Plugins have sub-configurations, which have their own sub-configurations. TOML handles this nesting elegantly while remaining readable.</p><p>JSON would work but becomes a mess of brackets. YAML works but its significant whitespace causes endless debugging sessions when someone mixes tabs and spaces.</p><p>The bracket notation in TOML defines what are called tables, which are essentially sections or namespaces. When you see [grpc], everything that follows belongs to the grpc configuration until the next table declaration. When you see [plugins.&quot;io.containerd.grpc.v1.cri&quot;], that&#39;s a table with a quoted key because the key contains dots that would otherwise be interpreted as nested tables.</p><p>This distinction matters. Consider the difference:</p><pre># This creates nested tables: plugins.io.containerd.grpc.v1.cri<br>[plugins.io.containerd.grpc.v1.cri]<br><br># This creates a single table with a key that happens to contain dots<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]</pre><p>containerd uses the second form. The entire string “io.containerd.grpc.v1.cri” is the plugin identifier, not a hierarchy.</p><h4>Where containerd Looks for Configuration</h4><p>When containerd starts, it looks for configuration in this order. First, it checks if you passed a — config flag:</p><pre>containerd --config /path/to/custom-config.toml</pre><p>If you did, it uses that file and ignores everything else. If you didn’t pass — config, it looks at the default path:</p><pre>/etc/containerd/config.toml</pre><p>If that file doesn’t exist either, containerd runs with built-in defaults. You can see these defaults anytime:</p><pre>containerd config default</pre><p>This command dumps the complete default configuration to stdout. It’s incredibly useful when you’re trying to figure out what options exist. If you want to see what configuration containerd is actually running with right now, use:</p><pre>containerd config dump</pre><p>The difference matters. The default command shows what containerd would use if you had no config file. The dump command shows the merged result of your config file plus defaults for anything you didn’t specify.</p><h4>Configuration File Versions</h4><p>The config file has a version field, and this trips people up constantly. If you don’t specify a version, containerd assumes version 1, which is deprecated and removed in containerd 2.x. You should always explicitly set the version.</p><p>For containerd 1.x, use version 2:</p><pre>version = 2</pre><p>For containerd 2.x, use version 3:</p><pre>version = 3</pre><p>The version numbers refer to the configuration schema, not the containerd version. This naming is confusing but we’re stuck with it.</p><p>What changed between versions? Version 1 to 2 added the “io.containerd.” prefix to all plugin identifiers. Version 2 to 3 reorganized the CRI plugin structure and changed some default values.</p><p>containerd performs automatic migration when it starts. If you have a version 2 config and run containerd 2.x, it converts the config in memory to version 3 format. Your original file stays unchanged, but there’s a small performance cost at startup. If you want to avoid this, you can pre-migrate your config:</p><pre>containerd config migrate /etc/containerd/config.toml &gt; /etc/containerd/config.toml.new</pre><p>Review the output before replacing your original. Migrated configs aren’t backward compatible, so you can’t easily roll back to an older containerd version.</p><h4>Global Settings at the Top</h4><p>The top of config.toml contains settings that apply to containerd as a whole, not to any specific plugin.</p><pre>version = 3<br>root = &quot;/var/lib/containerd&quot;<br>state = &quot;/run/containerd&quot;<br>temp = &quot;&quot;<br>oom_score = 0</pre><p>The root directory is where containerd stores persistent data. Images, container metadata, snapshots — everything that should survive a reboot lives here. Don’t put this on a tmpfs.</p><p>The state directory holds runtime state. Sockets, PID files, information about running containers. This data is ephemeral and gets recreated on restart. That’s why it defaults to /run, which is typically a tmpfs.</p><p>The temp setting specifies where containerd creates temporary files. If empty, it uses the system default (usually /tmp).</p><p>The oom_score adjusts how the Linux OOM killer prioritizes containerd. Values range from -1000 to 1000. Lower values mean the process is less likely to be killed. A value of 0 means no adjustment from the default.</p><p>There’s also an imports field that lets you split configuration across multiple files:</p><pre>version = 3<br>imports = [&quot;/etc/containerd/conf.d/*.toml&quot;]</pre><p>This works like nginx’s conf.d pattern. You can drop additional config files into that directory and they’ll be merged. In version 3, the default imports path is /etc/containerd/conf.d/*.toml. Files are processed in glob order, and later files override earlier ones.</p><p>This is useful for managing configuration in production. Your base config stays in config.toml, while environment-specific settings go in conf.d. For example:</p><pre>/etc/containerd/config.toml              # Base configuration<br>/etc/containerd/conf.d/50-registry.toml  # Private registry settings<br>/etc/containerd/conf.d/99-nvidia.toml    # GPU runtime configuration</pre><h4>The gRPC Section</h4><p>containerd communicates with clients through gRPC. This is not an implementation choice — it’s a requirement. The Container Runtime Interface (CRI) that Kubernetes uses is defined as a gRPC service using Protocol Buffers. Any container runtime that wants to work with Kubernetes must speak gRPC.</p><pre>[grpc]<br>  address = &quot;/run/containerd/containerd.sock&quot;<br>  uid = 0<br>  gid = 0<br>  max_recv_message_size = 16777216<br>  max_send_message_size = 16777216</pre><p>The address is the Unix socket path where containerd listens. The kubelet connects to this socket to create and manage containers. The uid and gid control who owns the socket file. Setting both to 0 means root owns it, so only root can connect.</p><p>The message size limits are 16MB by default. You rarely need to change these unless you’re doing something unusual with very large container specs.</p><p>There’s also support for TCP with TLS if you need remote access:</p><pre>[grpc]<br>  address = &quot;/run/containerd/containerd.sock&quot;<br>  tcp_address = &quot;0.0.0.0:10000&quot;<br>  tcp_tls_cert = &quot;/etc/containerd/cert.pem&quot;<br>  tcp_tls_key = &quot;/etc/containerd/key.pem&quot;<br>  tcp_tls_ca = &quot;/etc/containerd/ca.pem&quot;</pre><p>Most people leave tcp_address empty because exposing containerd over the network is a significant security risk.</p><h4>The ttrpc Section</h4><p>You’ll also see a ttrpc section:</p><pre>[ttrpc]<br>  address = &quot;&quot;<br>  uid = 0<br>  gid = 0</pre><p>TTRPC is a lightweight RPC protocol that containerd developed for communication with shims. A shim is the process that sits between containerd and the actual container process. When containerd needs to talk to a shim, it uses TTRPC instead of gRPC because the overhead is lower.</p><p>In most deployments, you can ignore this section entirely. The default empty address means TTRPC is disabled, and containerd uses other mechanisms for shim communication.</p><h4>Debug and Metrics</h4><p>For troubleshooting, there’s a debug section:</p><pre>[debug]<br>  address = &quot;/run/containerd/debug.sock&quot;<br>  uid = 0<br>  gid = 0<br>  level = &quot;info&quot;<br>  format = &quot;text&quot;</pre><p>The level can be trace, debug, info, warn, error, fatal, or panic. During initial setup, setting this to debug helps immensely. In production, info or warn is appropriate.</p><p>The format can be text or json. Text is readable for humans watching logs. JSON is better when you’re feeding logs into a log aggregation system.</p><p>containerd can also expose Prometheus metrics:</p><pre>[metrics]<br>  address = &quot;127.0.0.1:1338&quot;<br>  grpc_histogram = false</pre><p>By default, address is empty and metrics are disabled. Set an address to enable the metrics endpoint. The grpc_histogram option adds detailed gRPC latency histograms but increases cardinality significantly.</p><h4>The Plugins Section</h4><p>Now we get to the heart of containerd configuration. Almost everything in containerd is a plugin. The CRI implementation that Kubernetes talks to? A plugin. The snapshotter that manages image layers? A plugin. The runtime that actually runs containers? A plugin.</p><p>Each plugin has its own configuration section under [plugins]:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  # CRI plugin configuration</pre><pre>[plugins.&quot;io.containerd.gc.v1.scheduler&quot;]<br>  # Garbage collection scheduler</pre><pre>[plugins.&quot;io.containerd.snapshotter.v1.overlayfs&quot;]<br>  # OverlayFS snapshotter</pre><p>The plugin identifiers follow a pattern: io.containerd.&lt;type&gt;.&lt;version&gt;.&lt;name&gt;. For example, io.containerd.grpc.v1.cri is a gRPC service plugin, version 1, named cri.</p><p>Let me walk through the important plugins.</p><p>The garbage collection scheduler (io.containerd.gc.v1.scheduler) cleans up unused content:</p><pre>[plugins.&quot;io.containerd.gc.v1.scheduler&quot;]<br>  pause_threshold = 0.02<br>  deletion_threshold = 0<br>  mutation_threshold = 100<br>  schedule_delay = &quot;0s&quot;<br>  startup_delay = &quot;100ms&quot;</pre><p>The thresholds control how aggressively garbage collection runs. Higher values mean more frequent collection. The startup_delay prevents GC from running immediately when containerd starts, giving the system time to stabilize.</p><p>The metadata plugin (io.containerd.metadata.v1.bolt) stores containerd’s metadata in a BoltDB database:</p><pre>[plugins.&quot;io.containerd.metadata.v1.bolt&quot;]<br>  content_sharing_policy = &quot;shared&quot;</pre><p>The content_sharing_policy determines whether multiple containers can share the same content. Shared is the default and appropriate for most cases.</p><p>The overlayfs snapshotter (io.containerd.snapshotter.v1.overlayfs) manages container filesystem layers. On most Linux systems, this is the default snapshotter and usually needs no configuration.</p><h4>The CRI Plugin in Detail</h4><p>For Kubernetes integration, the CRI plugin is what matters most. In version 2 configs, everything is under one plugin:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]</pre><p>In version 3, it’s split into two:</p><pre>[plugins.&quot;io.containerd.cri.v1.images&quot;]<br>[plugins.&quot;io.containerd.cri.v1.runtime&quot;]</pre><p>I’ll focus on version 2 since that’s what most people are still using. The concepts translate directly to version 3.</p><p>The most commonly configured options:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  sandbox_image = &quot;registry.k8s.io/pause:3.10&quot;<br>  max_concurrent_downloads = 3<br>  max_container_log_line_size = 16384</pre><p>The sandbox_image is the pause container that Kubernetes uses to hold network namespaces for pods. If you’re in an air-gapped environment, you need to change this to point to your internal registry.</p><p>The max_concurrent_downloads limits how many image layers containerd downloads in parallel. Increase this if you have good network bandwidth and want faster image pulls. Decrease it if you’re overwhelming your registry or network.</p><p>Underneath the CRI plugin, there’s a containerd subsection for runtime configuration:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>  snapshotter = &quot;overlayfs&quot;<br>  default_runtime_name = &quot;runc&quot;</pre><p>The snapshotter setting determines which snapshotter plugin manages image layers. OverlayFS is the standard choice on modern Linux.</p><p>The default_runtime_name specifies which runtime to use when one isn’t explicitly requested. This points to a runtime defined in the runtimes subsection.</p><h4>Configuring Container Runtimes</h4><p>The runtimes subsection is where you define the OCI runtimes that containerd can use:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>  runtime_type = &quot;io.containerd.runc.v2&quot;<br><br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>    BinaryName = &quot;&quot;<br>    SystemdCgroup = true</pre><p>This defines a runtime named “runc”. The runtime_type specifies which shim to use. The io.containerd.runc.v2 shim is the standard choice for runc.</p><p>The options section contains runtime-specific settings. For runc, the important ones are:</p><p>BinaryName specifies the path to the runc binary. If empty, containerd searches PATH for “runc”. You can set an explicit path like “/usr/local/bin/runc” if you have multiple runc versions installed.</p><p>SystemdCgroup is critical for Kubernetes. When true, runc uses systemd to manage cgroups instead of the filesystem directly. This must match your kubelet configuration. If kubelet is configured with cgroupDriver: systemd (which is the default and recommended), then SystemdCgroup must be true. Mismatching these causes subtle and frustrating problems.</p><p>You can define multiple runtimes for different use cases:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>  default_runtime_name = &quot;runc&quot;<br><br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>  runtime_type = &quot;io.containerd.runc.v2&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>    SystemdCgroup = true<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.nvidia]<br>  runtime_type = &quot;io.containerd.runc.v2&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.nvidia.options]<br>    BinaryName = &quot;/usr/local/nvidia/toolkit/nvidia-container-runtime&quot;<br>    SystemdCgroup = true<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.kata]<br>  runtime_type = &quot;io.containerd.kata.v2&quot;</pre><p>Here I’ve defined three runtimes. The default is runc for normal containers. There’s also an nvidia runtime for GPU workloads and a kata runtime for VM-isolated containers.</p><p>To use a non-default runtime in Kubernetes, you create a RuntimeClass:</p><pre>apiVersion: node.k8s.io/v1<br>kind: RuntimeClass<br>metadata:<br>  name: nvidia<br>handler: nvidia</pre><p>The handler field must match the runtime name in your containerd config. Then pods can request this runtime:</p><pre>apiVersion: v1<br>kind: Pod<br>metadata:<br>  name: gpu-pod<br>spec:<br>  runtimeClassName: nvidia<br>  containers:<br>  - name: cuda<br>    image: nvidia/cuda:12.0-base</pre><h4>CNI Configuration</h4><p>The CRI plugin also manages CNI (Container Network Interface) settings:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.cni]<br>  bin_dir = &quot;/opt/cni/bin&quot;<br>  conf_dir = &quot;/etc/cni/net.d&quot;</pre><p>The bin_dir is where CNI plugin binaries live. The conf_dir is where CNI configuration files go. These are the standard paths that most CNI plugins expect.</p><p>In a Kubernetes cluster, you typically don’t configure CNI through containerd directly. Instead, you install a CNI plugin like Calico or Flannel, which drops its own configuration files into conf_dir. But understanding these paths helps when troubleshooting network issues.</p><h4>Registry Configuration</h4><p>Registry configuration changed significantly between versions. In version 2, there were two approaches. The old way used inline configuration:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.registry.mirrors]<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.registry.mirrors.&quot;docker.io&quot;]<br>    endpoint = [&quot;https://registry-1.docker.io&quot;]</pre><p>The new way (recommended) uses a config_path:</p><pre>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.registry]<br>  config_path = &quot;/etc/containerd/certs.d&quot;</pre><p>With config_path, you create a directory structure:</p><pre>/etc/containerd/certs.d/<br>├── docker.io/<br>│   └── hosts.toml<br>├── gcr.io/<br>│   └── hosts.toml<br>└── my-registry.example.com/<br>    ├── hosts.toml<br>    └── ca.crt</pre><p>Each registry gets its own directory with a hosts.toml file:</p><pre># /etc/containerd/certs.d/docker.io/hosts.toml<br>server = &quot;https://docker.io&quot;<br><br>[host.&quot;https://registry-1.docker.io&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;]<br>[host.&quot;https://mirror.example.com&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;]<br>  skip_verify = false</pre><p>This approach is more flexible. You can add registries without editing the main config file, and you can include CA certificates alongside the host configuration.</p><p>The capabilities field specifies what operations are allowed. Pull means downloading images, resolve means looking up image metadata, push means uploading images. Most configurations only need pull and resolve.</p><p>The skip_verify option disables TLS certificate verification. Never set this to true in production. If you’re using a private CA, put the CA certificate in the same directory and reference it:</p><pre>[host.&quot;https://my-registry.example.com&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;, &quot;push&quot;]<br>  ca = &quot;/etc/containerd/certs.d/my-registry.example.com/ca.crt&quot;</pre><h4>A Complete Configuration Example</h4><p>Let me put this all together into a complete version 2 configuration that you might use in production:</p><pre>version = 2<br><br>root = &quot;/var/lib/containerd&quot;<br>state = &quot;/run/containerd&quot;<br>[grpc]<br>  address = &quot;/run/containerd/containerd.sock&quot;<br>  uid = 0<br>  gid = 0<br>[debug]<br>  level = &quot;info&quot;<br>[metrics]<br>  address = &quot;127.0.0.1:1338&quot;<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  sandbox_image = &quot;registry.k8s.io/pause:3.10&quot;<br>  max_concurrent_downloads = 3<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>    snapshotter = &quot;overlayfs&quot;<br>    default_runtime_name = &quot;runc&quot;<br>    [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>      runtime_type = &quot;io.containerd.runc.v2&quot;<br>      <br>      [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>        SystemdCgroup = true<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.cni]<br>    bin_dir = &quot;/opt/cni/bin&quot;<br>    conf_dir = &quot;/etc/cni/net.d&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.registry]<br>    config_path = &quot;/etc/containerd/certs.d&quot;</pre><p>This configuration enables metrics, uses systemd cgroups (matching a typical kubelet configuration), and uses the hosts.toml approach for registry configuration.</p><h4>K3s and containerd</h4><p>If you use K3s, you’ll encounter a different configuration pattern. K3s embeds containerd and generates its own configuration file at:</p><pre>/var/lib/rancher/k3s/agent/etc/containerd/config.toml</pre><p>This file has a comment at the top:</p><pre># File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.</pre><p>K3s regenerates this file on every start. If you need to customize containerd in K3s, you create a template file:</p><pre>/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl</pre><p>K3s uses Go templates, so you can include the base configuration and add your customizations:</p><pre>version = 2<br><br>{{ template &quot;base&quot; . }}<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.nvidia]<br>  runtime_type = &quot;io.containerd.runc.v2&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.nvidia.options]<br>    BinaryName = &quot;/usr/local/nvidia/toolkit/nvidia-container-runtime&quot;<br>    SystemdCgroup = true</pre><p>The {{ template &quot;base&quot; . }} line includes K3s&#39;s default configuration. Your additions come after.</p><p>For simple registry configuration changes in K3s, there’s an easier approach. Edit /etc/rancher/k3s/registries.yaml:</p><pre>mirrors:<br>  docker.io:<br>    endpoint:<br>      - &quot;https://mirror.example.com&quot;<br>configs:<br>  &quot;my-registry.example.com&quot;:<br>    tls:<br>      ca_file: &quot;/etc/rancher/k3s/certs/ca.crt&quot;</pre><p>K3s translates this YAML into the appropriate containerd configuration automatically.</p><h4>Verifying Your Configuration</h4><p>After changing containerd configuration, restart the service and verify everything is working:</p><pre>systemctl restart containerd<br>systemctl status containerd</pre><p>Check that the configuration loaded correctly:</p><pre>containerd config dump | grep -A5 &quot;runtimes.runc&quot;</pre><p>Test that you can pull images:</p><pre>crictl pull docker.io/library/alpine:latest</pre><p>If you’re running Kubernetes, verify that kubelet can communicate with containerd:</p><pre>crictl info</pre><p>This command returns JSON with containerd’s current state. Look for the runtime configuration to confirm your changes took effect.</p><p>When troubleshooting, the containerd logs are essential:</p><pre>journalctl -u containerd -f</pre><p>Watch these logs while performing operations to understand what containerd is doing and where failures occur.</p><p>The configuration might seem overwhelming at first, but most of it is things you’ll never touch. In practice, you’ll spend 90% of your configuration time in the CRI plugin section, dealing with runtimes and registries. The rest is set once and forgotten.</p><h3>Chapter 12: High Availability Configuration</h3><p>When you run a single control plane node, you have a single point of failure. The API server goes down, and suddenly kubectl stops working, new pods cannot be scheduled, and your cluster is effectively dead. In production, this is unacceptable. You need high availability.</p><p>Kubespray was built with HA in mind from the start. Unlike kubeadm where you have to manually configure load balancers, set up additional control plane nodes, and worry about etcd quorum, Kubespray handles most of this automatically. But “automatically” does not mean “magically.” You still need to understand what is happening under the hood, because when something breaks at 3 AM, you will be the one fixing it.</p><p>Let me walk you through how Kubespray implements HA, starting with the control plane.</p><h4>Control Plane High Availability</h4><p>A Kubernetes control plane consists of three main components: the API server, the controller manager, and the scheduler. Each of these has different HA characteristics.</p><p>The API server is stateless. It reads from and writes to etcd, but it does not maintain any state itself. This means you can run multiple API server instances simultaneously, and they will all work correctly. Every API server instance is active and can handle requests. This is called active-active configuration.</p><p>The controller manager and scheduler are different. They maintain internal state and make decisions that could conflict if multiple instances ran simultaneously. Imagine two controller managers both deciding to create a replacement pod for a failed one.</p><p>You would end up with two replacement pods instead of one. To prevent this, Kubernetes uses leader election. Only one instance is active at any time, while the others wait in standby. If the leader fails, one of the standby instances takes over. This is active-standby configuration.</p><p>When you configure multiple control plane nodes in Kubespray, here is what happens. On each control plane node, Kubespray deploys the API server, controller manager, and scheduler as static pods. The API servers all run and accept requests. The controller managers and schedulers all run, but only one of each is the leader. The others continuously try to acquire the leadership lease stored in etcd.</p><p>You can see the leader election in action by checking the lease objects:</p><pre>kubectl get lease -n kube-system</pre><p>You will see leases named kube-controller-manager and kube-scheduler. The holderIdentity field shows which node currently holds the leadership.</p><p>To configure an HA control plane with Kubespray, you simply list multiple nodes in the kube_control_plane group. Here is an example inventory for a three-node HA setup:</p><pre>[all]<br>k8s-ctr1 ansible_host=192.168.10.11 ip=192.168.10.11<br>k8s-ctr2 ansible_host=192.168.10.12 ip=192.168.10.12<br>k8s-ctr3 ansible_host=192.168.10.13 ip=192.168.10.13<br>k8s-w1 ansible_host=192.168.10.21 ip=192.168.10.21<br>k8s-w2 ansible_host=192.168.10.22 ip=192.168.10.22<br><br>[kube_control_plane]<br>k8s-ctr1<br>k8s-ctr2<br>k8s-ctr3<br>[etcd]<br>k8s-ctr1<br>k8s-ctr2<br>k8s-ctr3<br>[kube_node]<br>k8s-w1<br>k8s-w2<br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>Notice that all three control plane nodes are also in the etcd group. This is a common pattern called stacked etcd, where etcd runs on the same nodes as the control plane. I will discuss the alternative, external etcd, later in this chapter.</p><p>With this inventory, Kubespray will deploy everything needed for HA. But there is one critical piece missing from this picture: how do clients know which API server to talk to?</p><h4>The Load Balancing Problem</h4><p>When you have three API servers running on three different nodes, you need some way to distribute traffic among them. If a client always connects to k8s-ctr1:6443 and that node goes down, the client cannot reach the cluster even though k8s-ctr2 and k8s-ctr3 are perfectly healthy.</p><p>There are two approaches to solving this: external load balancers and client-side load balancing.</p><p>External load balancers sit in front of your API servers and distribute incoming connections. In AWS, you would use an ELB or NLB. In GCP, you would use a GCP Load Balancer. On-premises, you might use HAProxy with keepalived and a virtual IP. The external load balancer provides a single endpoint that clients connect to, and it forwards requests to healthy API servers.</p><p>Kubespray does not automatically configure external load balancers. This is a deliberate design decision, not an oversight. The reason is simple: external load balancers are infrastructure-level components that vary dramatically between environments. An AWS NLB requires AWS API calls to provision. A GCP Load Balancer requires GCP API calls. An on-premises HAProxy setup requires virtual IP configuration, which involves network infrastructure that Kubespray cannot possibly know about.</p><p>Kubespray is a tool for configuring software on top of your operating system. It is not an infrastructure provisioning tool. That is what Terraform is for. The expected workflow in production is to use Terraform to provision your VMs and external load balancer, then use Kubespray to deploy Kubernetes on those VMs.</p><p>However, Kubespray does provide one option for external load balancing that works without infrastructure provisioning: kube-vip. This is a software-based solution that creates a virtual IP address using either ARP (for layer 2 networks) or BGP (for layer 3 networks). If you enable kube-vip in Kubespray, it will deploy as a static pod on your control plane nodes and manage a floating VIP that always points to a healthy API server.</p><p>To enable kube-vip, you set these variables in your group_vars:</p><pre>kube_vip_enabled: true<br>kube_vip_arp_enabled: true<br>kube_vip_address: 192.168.10.100</pre><p>The kube_vip_address should be an unused IP in your network that will become the virtual IP for your API server endpoint.</p><h4>Client-Side Load Balancing</h4><p>For internal cluster communication, Kubespray uses a different approach: client-side load balancing. Instead of routing traffic through an external load balancer, each worker node runs its own local load balancer that distributes requests across all API servers.</p><p>Here is how it works. On each worker node, Kubespray deploys nginx as a lightweight TCP load balancer. This nginx listens on localhost:6443. The kubelet on that worker node is configured to connect to <a href="https://localhost:6443">https://localhost:6443</a> as its API server endpoint. When kubelet makes a request, nginx receives it and forwards it to one of the actual API servers.</p><p>This might seem like an unnecessary extra hop, but it has significant advantages. First, there is no single point of failure in the load balancing layer. If the nginx on one worker node crashes, only that worker node is affected. Other worker nodes continue operating normally. Second, you do not need to provision or maintain any external infrastructure. Third, the latency added by the local nginx proxy is negligible since it is all localhost communication.</p><p>The nginx configuration that Kubespray generates looks something like this:</p><pre>stream {<br>    upstream kube_apiserver {<br>        least_conn;<br>        server 192.168.10.11:6443 max_fails=3 fail_timeout=30s;<br>        server 192.168.10.12:6443 max_fails=3 fail_timeout=30s;<br>        server 192.168.10.13:6443 max_fails=3 fail_timeout=30s;<br>    }<br><br>    server {<br>        listen 127.0.0.1:6443;<br>        proxy_pass kube_apiserver;<br>        proxy_timeout 10m;<br>        proxy_connect_timeout 1s;<br>    }<br>}</pre><p>The upstream block lists all your API servers. The least_conn directive tells nginx to send new connections to the server with the fewest active connections, which provides good load distribution. The max_fails and fail_timeout parameters control health checking. If an API server fails three times within 30 seconds, nginx temporarily removes it from the pool.</p><p>The server block listens only on 127.0.0.1, meaning only processes on the local machine can connect. This is intentional. The nginx proxy is not meant to be accessed from other machines.</p><p>The kubelet configuration on worker nodes points to this local proxy:</p><pre>apiVersion: v1<br>kind: Config<br>clusters:<br>- cluster:<br>    certificate-authority-data: &lt;base64-encoded-ca&gt;<br>    server: https://localhost:6443<br>  name: cluster.local</pre><p>When you look at this kubeconfig, you might wonder how the kubelet can connect to localhost:6443 and reach the API server. The answer is nginx. Every request to localhost:6443 goes through nginx, which forwards it to a real API server.</p><p>Kubespray calls this approach “localhost load balancing” or “nginx proxy.” You can choose between nginx and haproxy as the local proxy by setting the loadbalancer_apiserver_type variable:</p><pre>loadbalancer_apiserver_type: nginx</pre><p>or</p><pre>loadbalancer_apiserver_type: haproxy</pre><p>Both work well. nginx is the default and is lighter weight. haproxy provides more sophisticated load balancing options if you need them.</p><p>One important detail: control plane nodes do not need this local proxy. They can connect directly to their local API server at localhost:6443 because the API server is running on the same machine. Kubespray only deploys the nginx proxy on nodes that are in kube_node but not in kube_control_plane.</p><p>If you have nodes that are both control plane and worker nodes (which is common in smaller clusters), Kubespray handles this correctly. The node uses its local API server directly rather than going through nginx.</p><h4>etcd High Availability</h4><p>The API server is stateless, but etcd is not. etcd stores all cluster state: every pod definition, every service, every secret, every configmap. If you lose etcd data, you lose your cluster. This makes etcd HA critically important.</p><p>etcd uses the Raft consensus algorithm to replicate data across multiple nodes. I will not go into the full details of Raft here, but the key concept you need to understand is quorum. A quorum is the minimum number of nodes that must agree for a write to be committed. In Raft, the quorum is a majority: more than half of the total nodes.</p><p>For a 3-node etcd cluster, the quorum is 2. This means writes require at least 2 nodes to acknowledge them before they are considered committed. If one node is down, the remaining 2 nodes can still form a quorum and the cluster continues operating. But if 2 nodes are down, the single remaining node cannot form a quorum, and the cluster stops accepting writes.</p><p>For a 5-node etcd cluster, the quorum is 3. The cluster can tolerate 2 node failures.</p><p>This is why etcd clusters should always have an odd number of nodes. Consider what happens with 4 nodes. The quorum is 3 (more than half of 4). The cluster can tolerate only 1 failure, same as a 3-node cluster. But you have to maintain an extra node. You get no additional fault tolerance for the extra operational burden.</p><p>Here is the fault tolerance for different cluster sizes:</p><pre>Nodes    Quorum    Tolerated Failures<br>1        1         0<br>3        2         1<br>5        3         2<br>7        4         3</pre><p>Notice that going from 1 to 3 nodes gives you the ability to survive 1 failure. Going from 3 to 5 nodes lets you survive 2 failures. In practice, 3 nodes is sufficient for most production environments. You use 5 nodes when you need to perform rolling maintenance on etcd nodes while still tolerating a failure.</p><p>Never run 2 or 4 nodes. With 2 nodes, quorum is 2, so a single node failure makes the cluster unavailable. You would be better off with a single node at that point, since the operational complexity of 2 nodes buys you nothing.</p><p>In the inventory example I showed earlier, all three control plane nodes were also etcd nodes:</p><pre>[etcd]<br>k8s-ctr1<br>k8s-ctr2<br>k8s-ctr3</pre><p>This is stacked etcd, where etcd runs on the same machines as the control plane. The alternative is external etcd, where etcd runs on dedicated machines separate from the control plane.</p><p>Stacked etcd is simpler to set up and requires fewer machines. For a minimum HA setup with stacked etcd, you need 3 nodes (which are both control plane and etcd). With external etcd, you would need 3 control plane nodes plus 3 etcd nodes, totaling 6 machines.</p><p>External etcd has advantages in larger environments. etcd can be resource-intensive under heavy load, and isolating it on dedicated machines ensures that control plane components do not compete with etcd for CPU and memory. External etcd also allows you to scale and maintain the etcd cluster independently from the control plane.</p><p>Kubespray supports both configurations. For stacked etcd, you put the same nodes in both kube_control_plane and etcd groups. For external etcd, you put different nodes in each group:</p><pre>[kube_control_plane]<br>k8s-ctr1<br>k8s-ctr2<br>k8s-ctr3<br><br>[etcd]<br>k8s-etcd1<br>k8s-etcd2<br>k8s-etcd3</pre><p>With this configuration, Kubespray will deploy the control plane on k8s-ctr1 through k8s-ctr3, and etcd on k8s-etcd1 through k8s-etcd3. The API servers on the control plane nodes will be configured to connect to the external etcd cluster.</p><h4>etcd Deployment Types in Kubespray</h4><p>Kubespray supports two ways to run etcd, controlled by the etcd_deployment_type variable.</p><p>When etcd_deployment_type is set to “host”, etcd runs as a systemd service directly on the host. Kubespray downloads the etcd binary, generates all necessary certificates, creates the systemd unit file, and starts the service. This is the default in Kubespray and what I have been describing throughout this chapter.</p><p>The systemd unit file looks like this:</p><pre>[Unit]<br>Description=etcd<br>Documentation=https://github.com/etcd-io/etcd<br>After=network.target<br><br>[Service]<br>User=etcd<br>Type=notify<br>EnvironmentFile=/etc/etcd.env<br>ExecStart=/usr/local/bin/etcd<br>Restart=always<br>RestartSec=10s<br>LimitNOFILE=65536<br><br>[Install]<br>WantedBy=multi-user.target</pre><p>The environment file /etc/etcd.env contains all the etcd configuration:</p><pre>ETCD_NAME=k8s-ctr1<br>ETCD_DATA_DIR=/var/lib/etcd<br>ETCD_LISTEN_PEER_URLS=https://192.168.10.11:2380<br>ETCD_LISTEN_CLIENT_URLS=https://192.168.10.11:2379,https://127.0.0.1:2379<br>ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.10.11:2380<br>ETCD_ADVERTISE_CLIENT_URLS=https://192.168.10.11:2379<br>ETCD_INITIAL_CLUSTER=k8s-ctr1=https://192.168.10.11:2380,k8s-ctr2=https://192.168.10.12:2380,k8s-ctr3=https://192.168.10.13:2380<br>ETCD_INITIAL_CLUSTER_STATE=new<br>ETCD_INITIAL_CLUSTER_TOKEN=k8s-etcd-cluster<br>ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1.pem<br>ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem<br>ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem<br>ETCD_CLIENT_CERT_AUTH=true<br>ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1.pem<br>ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem<br>ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem<br>ETCD_PEER_CLIENT_CERT_AUTH=true</pre><p>When etcd_deployment_type is set to “kubeadm”, etcd runs as a static pod managed by kubeadm. This is how kubeadm normally deploys etcd when you run kubeadm init. The etcd manifest goes in /etc/kubernetes/manifests/etcd.yaml, and kubelet manages the pod.</p><p>The host deployment type gives you more control over etcd lifecycle. You can start, stop, and manage etcd independently of Kubernetes. The kubeadm deployment type integrates etcd more tightly with the Kubernetes control plane, which can simplify some operations but makes etcd management dependent on kubelet.</p><p>Most Kubespray users stick with the default host deployment type.</p><h4>etcd Certificates</h4><p>etcd uses TLS for all communication, both between etcd nodes (peer communication) and between clients and etcd (client communication). Kubespray generates a complete certificate hierarchy for etcd.</p><p>The CA certificate and key are stored at:</p><pre>/etc/ssl/etcd/ssl/ca.pem<br>/etc/ssl/etcd/ssl/ca-key.pem</pre><p>For each etcd node, Kubespray generates member certificates used for peer communication:</p><pre>/etc/ssl/etcd/ssl/member-k8s-ctr1.pem<br>/etc/ssl/etcd/ssl/member-k8s-ctr1-key.pem</pre><p>For administrative access to etcd (like running etcdctl commands), Kubespray generates admin certificates:</p><pre>/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem<br>/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem</pre><p>For the Kubernetes API server to connect to etcd, Kubespray generates node certificates:</p><pre>/etc/ssl/etcd/ssl/node-k8s-ctr1.pem<br>/etc/ssl/etcd/ssl/node-k8s-ctr1-key.pem</pre><p>When you need to run etcdctl commands, you must provide the appropriate certificates:</p><pre>ETCDCTL_API=3 etcdctl \<br>  --cacert=/etc/ssl/etcd/ssl/ca.pem \<br>  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \<br>  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \<br>  --endpoints=https://192.168.10.11:2379 \<br>  endpoint health</pre><p>If you forget the certificates, etcdctl will fail to connect because etcd requires mutual TLS authentication for all connections.</p><p>To check the health of your entire etcd cluster:</p><pre>ETCDCTL_API=3 etcdctl \<br>  --cacert=/etc/ssl/etcd/ssl/ca.pem \<br>  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \<br>  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \<br>  --endpoints=https://192.168.10.11:2379,https://192.168.10.12:2379,https://192.168.10.13:2379 \<br>  endpoint health</pre><p>This will show the health status of each endpoint:</p><pre>https://192.168.10.11:2379 is healthy: successfully committed proposal: took = 2.5ms<br>https://192.168.10.12:2379 is healthy: successfully committed proposal: took = 2.7ms<br>https://192.168.10.13:2379 is healthy: successfully committed proposal: took = 2.4ms</pre><p>To see which node is the current etcd leader:</p><pre>ETCDCTL_API=3 etcdctl \<br>  --cacert=/etc/ssl/etcd/ssl/ca.pem \<br>  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem \<br>  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem \<br>  --endpoints=https://192.168.10.11:2379,https://192.168.10.12:2379,https://192.168.10.13:2379 \<br>  endpoint status --write-out=table</pre><p>The output includes a column showing whether each node is the leader.</p><h4>Putting It All Together</h4><p>Let me walk through what happens when you deploy an HA cluster with Kubespray.</p><p>First, Kubespray reads your inventory and identifies which nodes belong to which groups. It sees three nodes in kube_control_plane, three nodes in etcd (the same three), and two nodes in kube_node.</p><p>Kubespray begins by bootstrapping all nodes: installing Python, configuring the OS, loading kernel modules, setting kernel parameters, and disabling swap.</p><p>Next, Kubespray installs the container runtime (containerd) on all nodes. This is needed because the Kubernetes components run in containers (as static pods) on control plane nodes.</p><p>Then Kubespray sets up the etcd cluster. It generates the CA certificate, generates member certificates for each etcd node, creates the systemd unit files, and starts etcd on all three nodes. The etcd nodes discover each other using the ETCD_INITIAL_CLUSTER configuration and form a cluster.</p><p>With etcd running, Kubespray initializes the first control plane node using kubeadm init. The kubeadm configuration specifies the external etcd endpoints:</p><pre>etcd:<br>  external:<br>    endpoints:<br>      - https://192.168.10.11:2379<br>      - https://192.168.10.12:2379<br>      - https://192.168.10.13:2379<br>    caFile: /etc/ssl/etcd/ssl/ca.pem<br>    certFile: /etc/ssl/etcd/ssl/node-k8s-ctr1.pem<br>    keyFile: /etc/ssl/etcd/ssl/node-k8s-ctr1-key.pem</pre><p>This tells kubeadm not to deploy its own etcd, but instead connect to the external etcd cluster that Kubespray already set up.</p><p>After the first control plane node is ready, Kubespray joins the other two control plane nodes using kubeadm join with the — control-plane flag. These nodes receive copies of the certificates and become additional control plane members.</p><p>For the worker nodes, Kubespray first deploys the nginx load balancer. It generates the nginx configuration with all three API server addresses in the upstream block, starts nginx, and verifies it is working. Then it runs kubeadm join on the worker nodes with their kubeconfig pointing to localhost:6443.</p><p>Finally, Kubespray deploys the CNI plugin, CoreDNS, and any other addons you have enabled.</p><p>The result is a fully functional HA cluster. You can lose any one control plane node and the cluster continues operating. You can lose any one etcd node and data remains safe. Worker nodes automatically failover to healthy API servers through their local nginx proxies.</p><h4>Testing Your HA Setup</h4><p>After deploying an HA cluster, you should test that failover actually works. Here is how to do it.</p><p>First, verify the cluster is healthy:</p><pre>kubectl get nodes<br>kubectl get pods -A</pre><p>All nodes should be Ready and all system pods should be Running.</p><p>Check that all API servers are responding:</p><pre>kubectl get --raw=&#39;/healthz&#39; --server=https://192.168.10.11:6443 --insecure-skip-tls-verify<br>kubectl get --raw=&#39;/healthz&#39; --server=https://192.168.10.12:6443 --insecure-skip-tls-verify<br>kubectl get --raw=&#39;/healthz&#39; --server=https://192.168.10.13:6443 --insecure-skip-tls-verify</pre><p>Each should return “ok”.</p><p>Now simulate a control plane failure. SSH into one of your control plane nodes and stop the kubelet:</p><pre>systemctl stop kubelet</pre><p>This stops all static pods including the API server, controller manager, and scheduler on that node.</p><p>From another machine, verify the cluster is still operational:</p><pre>kubectl get nodes<br>kubectl run test --image=nginx --restart=Never<br>kubectl get pods</pre><p>The node you stopped will show NotReady, but you can still create pods. They will be scheduled on the remaining nodes.</p><p>Check the leader election shifted if needed:</p><pre>kubectl get lease -n kube-system kube-controller-manager -o yaml<br>kubectl get lease -n kube-system kube-scheduler -o yaml</pre><p>If the stopped node was the leader, you will see a different holderIdentity now.</p><p>Start the kubelet again:</p><pre>systemctl start kubelet</pre><p>The node will rejoin the cluster and become Ready again.</p><p>For etcd, the test is similar but more dangerous. Do not do this on a production cluster without proper backups. Stop etcd on one node:</p><pre>systemctl stop etcd</pre><p>Verify the etcd cluster is still healthy from another node:</p><pre>ETCDCTL_API=3 etcdctl \<br>  --cacert=/etc/ssl/etcd/ssl/ca.pem \<br>  --cert=/etc/ssl/etcd/ssl/admin-k8s-ctr2.pem \<br>  --key=/etc/ssl/etcd/ssl/admin-k8s-ctr2-key.pem \<br>  --endpoints=https://192.168.10.12:2379,https://192.168.10.13:2379 \<br>  endpoint health</pre><p>The two remaining endpoints should be healthy. Create a configmap to verify writes work:</p><pre>kubectl create configmap test-ha --from-literal=key=value<br>kubectl get configmap test-ha -o yaml</pre><p>Start etcd again:</p><pre>systemctl start etcd</pre><p>The node rejoins the cluster and syncs data from the other members.</p><p>These tests give you confidence that your HA setup actually works. Run them during initial deployment and periodically afterward to catch any configuration drift.</p><h3>Chapter 13: Certificate Management and Auto-Renewal</h3><p>If you have ever run a Kubernetes cluster in production for more than a few months, you have probably experienced that sinking feeling when you realize your cluster certificates are about to expire. Or worse, you have already experienced the chaos that ensues when they do expire and suddenly nothing works. The API server refuses connections, kubelet cannot authenticate, and your perfectly healthy cluster becomes completely unresponsive.</p><p>Kubernetes relies heavily on TLS certificates for secure communication between components. The API server needs certificates. The kubelet needs certificates. The controller-manager, scheduler, etcd — they all need certificates. And by default, kubeadm generates certificates with a one-year validity period. That sounds like plenty of time until you realize that a year passes quickly, and nobody remembered to put certificate renewal on the calendar.</p><p>Kubespray addresses this problem elegantly by providing automatic certificate renewal out of the box. In this chapter, we will explore how Kubernetes certificates work, how Kubespray configures automatic renewal, and how you can manually manage certificates when needed.</p><h4>Understanding Kubernetes Certificates</h4><p>Before diving into renewal, let us understand what certificates exist in a Kubernetes cluster. When you run kubeadm init, it generates a whole tree of certificates under /etc/kubernetes/pki. Each certificate serves a specific purpose in the authentication and encryption chain.</p><p>The certificate authority sits at the root. Everything else chains back to it.</p><pre>/etc/kubernetes/pki/<br>├── ca.crt                    # Kubernetes CA certificate<br>├── ca.key                    # Kubernetes CA private key<br>├── apiserver.crt             # API server serving certificate<br>├── apiserver.key<br>├── apiserver-kubelet-client.crt    # API server → kubelet client cert<br>├── apiserver-kubelet-client.key<br>├── apiserver-etcd-client.crt       # API server → etcd client cert<br>├── apiserver-etcd-client.key<br>├── front-proxy-ca.crt        # Front proxy CA (for API aggregation)<br>├── front-proxy-ca.key<br>├── front-proxy-client.crt    # Front proxy client certificate<br>├── front-proxy-client.key<br>├── etcd/<br>│   ├── ca.crt                # etcd CA certificate<br>│   ├── ca.key<br>│   ├── server.crt            # etcd server certificate<br>│   ├── server.key<br>│   ├── peer.crt              # etcd peer communication<br>│   ├── peer.key<br>│   ├── healthcheck-client.crt<br>│   └── healthcheck-client.key<br>└── sa.pub                    # Service account public key<br>sa.key                    # Service account private key</pre><p>The CA certificates (ca.crt, etcd/ca.crt, front-proxy-ca.crt) have a default validity of 10 years. You generally do not need to worry about these expiring anytime soon. However, all the other certificates — the ones actually used for day-to-day communication — expire after just one year.</p><p>When Kubespray deploys a cluster with etcd_deployment_type set to host (which runs etcd as a systemd service rather than a static pod), etcd gets its own separate certificate infrastructure under /etc/ssl/etcd/ssl:</p><pre>/etc/ssl/etcd/ssl/<br>├── ca.pem                    # etcd CA<br>├── ca-key.pem<br>├── member-k8s-ctr.pem        # etcd member certificate<br>├── member-k8s-ctr-key.pem<br>├── admin-k8s-ctr.pem         # etcd admin certificate (for etcdctl)<br>├── admin-k8s-ctr-key.pem<br>├── node-k8s-ctr.pem          # client cert for API server → etcd<br>└── node-k8s-ctr-key.pem</pre><p>This separation exists because Kubespray manages etcd independently from kubeadm when using host deployment. The etcd certificates are generated by Kubespray’s etcd role, not by kubeadm.</p><h4>Checking Certificate Expiration</h4><p>Before we talk about renewal, you should know how to check when your certificates expire. The kubeadm tool provides a convenient command for this:</p><p>kubeadm certs check-expiration</p><p>The output looks something like this:</p><pre>CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED<br>admin.conf                 Jan 28, 2027 12:34 UTC   364d            ca                      no<br>apiserver                  Jan 28, 2027 12:34 UTC   364d            ca                      no<br>apiserver-etcd-client      Jan 28, 2027 12:34 UTC   364d            etcd-ca                 no<br>apiserver-kubelet-client   Jan 28, 2027 12:34 UTC   364d            ca                      no<br>controller-manager.conf    Jan 28, 2027 12:34 UTC   364d            ca                      no<br>front-proxy-client         Jan 28, 2027 12:34 UTC   364d            front-proxy-ca          no<br>scheduler.conf             Jan 28, 2027 12:34 UTC   364d            ca                      no<br>CERTIFICATE AUTHORITY      EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED<br>ca                         Jan 25, 2036 12:34 UTC   9y              no<br>etcd-ca                    Jan 25, 2036 12:34 UTC   9y              no<br>front-proxy-ca             Jan 25, 2036 12:34 UTC   9y              no</pre><p>Notice the RESIDUAL TIME column. When that number starts getting small — say, less than 30 days — you should be concerned. When it hits zero, you have a problem.</p><p>If you want to check a specific certificate file directly, openssl works perfectly:</p><p>openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates</p><p>This outputs:</p><p>notBefore=Jan 28 12:34:00 2026 GMT notAfter=Jan 28 12:34:00 2027 GMT</p><p>For the etcd certificates managed by Kubespray:</p><p>openssl x509 -in /etc/ssl/etcd/ssl/member-k8s-ctr.pem -noout -dates</p><p>You can also check who signed the certificate and what names it is valid for:</p><pre>openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 &quot;Subject:&quot;<br>openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 &quot;Subject Alternative Name&quot;</pre><p>The Subject Alternative Name (SAN) field is particularly important for the API server certificate. It must include all the names and IPs that clients use to connect to the API server. If you add a new control plane node or change the API server’s external DNS name, you might need to regenerate this certificate with the new SANs.</p><h4>Enabling Automatic Renewal in Kubespray</h4><p>Kubespray makes certificate renewal straightforward. In your inventory’s group_vars, you enable automatic renewal with a single variable:</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>auto_renew_certificates: true</pre><p>That is it. When this variable is set to true, Kubespray configures a systemd timer on each control plane node that automatically renews certificates on a monthly schedule.</p><p>But what exactly does “monthly” mean? Kubespray allows you to customize the schedule using systemd’s calendar syntax:</p><pre># inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml<br>auto_renew_certificates: true<br>auto_renew_certificates_systemd_calendar: &quot;Mon --1,2,3,4,5,6,7 03:00:00&quot;</pre><p>This particular schedule means “every Monday that falls on the 1st through 7th of any month, at 3:00 AM.” In other words, the first Monday of every month at 3 AM. This is a sensible default — it runs monthly, during off-peak hours, on a predictable day.</p><p>You can adjust this to whatever schedule fits your operational needs. Maybe you want it to run weekly:</p><p>auto_renew_certificates_systemd_calendar: “Mon — * 03:00:00”</p><p>Or perhaps you want more control and prefer to run it only on specific months:</p><p>auto_renew_certificates_systemd_calendar: “*-1,4,7,10–01 03:00:00”</p><p>The systemd calendar syntax is flexible but has a learning curve. You can test your calendar expressions with the systemd-analyze tool:</p><p>systemd-analyze calendar “Mon — 1,2,3,4,5,6,7 03:00:00”</p><p>This outputs when the next trigger times would be, helping you verify your schedule is correct.</p><h4>How the Auto-Renewal System Works</h4><p>When Kubespray runs with auto_renew_certificates enabled, it creates two systemd units on each control plane node: a timer and a service.</p><p>The timer unit lives at /etc/systemd/system/k8s-certs-renew.timer:</p><pre>[Unit]<br>Description=Renew K8S control plane certificates<br>[Timer]<br>OnCalendar=Mon --1,2,3,4,5,6,7 03:00:00<br>Persistent=true<br>[Install]<br>WantedBy=timers.target</pre><p>The Persistent=true setting is important. It means that if the system was powered off when the timer should have fired, systemd will run the service immediately upon next boot. This prevents missed renewals due to maintenance windows or unexpected downtime.</p><p>The service unit at /etc/systemd/system/k8s-certs-renew.service defines what actually happens when the timer fires:</p><pre>[Unit] Description=Renew K8S control plane certificates After=kubelet.service<br>[Service] Type=oneshot ExecStart=/usr/local/bin/k8s-certs-renew.sh</pre><p>The service is of type oneshot, meaning it runs once and exits rather than staying resident. The After=kubelet.service ensures the kubelet is already running before we attempt renewal.</p><p>The actual renewal logic lives in the shell script at /usr/local/bin/k8s-certs-renew.sh:</p><pre>#!/bin/bash<br>set -e<br><br># Renew all certificates<br>/usr/local/bin/kubeadm certs renew all<br><br># Touch the static pod manifests to trigger kubelet to restart them<br># kubelet watches these files and restarts pods when they change<br><br>find /etc/kubernetes/manifests/ -type f -exec touch {} ;</pre><p>The script does three things. First, it calls kubeadm certs renew all to generate new certificates. Second, it touches the static pod manifest files. This is a clever trick — the kubelet watches these manifest files for changes, and when it detects a modification (even just a timestamp change), it restarts the corresponding static pods. This ensures the API server, controller-manager, and scheduler pick up the new certificates without requiring a full node reboot. Third, it restarts the kubelet service itself to ensure it loads any renewed client certificates.</p><p>You can verify the timer is active and check when it last ran:</p><p>systemctl list-timers — all | grep k8s-certs-renew</p><p>The output shows the next scheduled run and the last time it triggered:</p><pre>NEXT                        LEFT          LAST                        PASSED       UNIT                    ACTIVATES<br>Mon 2026-02-02 03:00:00 KST 4 days left   Mon 2026-01-06 03:00:00 KST 3 weeks ago  k8s-certs-renew.timer   k8s-certs-renew.service</pre><p>To see detailed status of the timer:</p><p>systemctl status k8s-certs-renew.timer</p><p>And to check the logs from the last renewal:</p><p>journalctl -u k8s-certs-renew.service</p><h4>Manual Certificate Renewal</h4><p>Sometimes you need to renew certificates manually. Perhaps you are approaching expiration and do not want to wait for the next scheduled run. Or maybe you need to regenerate certificates after adding new SANs.</p><p>The kubeadm certs renew command handles this. To renew all certificates at once:</p><p>kubeadm certs renew all</p><p>The output confirms each certificate was renewed:</p><pre>[renew] Reading configuration from the cluster...<br>[renew] FYI: You can look at this config file with &#39;kubectl -n kube-system get cm kubeadm-config -o yaml&#39;<br>certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed<br>certificate for serving the Kubernetes API renewed<br>certificate the apiserver uses to access etcd renewed<br>certificate for the API server to connect to kubelet renewed<br>certificate embedded in the kubeconfig file for the controller manager to use renewed<br>certificate for the front proxy client renewed<br>certificate embedded in the kubeconfig file for the scheduler to use renewed</pre><p>If you only want to renew specific certificates, kubeadm supports that too:</p><pre>kubeadm certs renew apiserver<br>kubeadm certs renew apiserver-kubelet-client<br>kubeadm certs renew apiserver-etcd-client<br>kubeadm certs renew front-proxy-client<br>kubeadm certs renew admin.conf<br>kubeadm certs renew controller-manager.conf<br>kubeadm certs renew scheduler.conf</pre><p>After manual renewal, you must restart the control plane components to pick up the new certificates. The simplest approach is to restart the kubelet, which will restart all static pods:</p><p>systemctl restart kubelet</p><p>Alternatively, you can use the touch trick to trigger individual pod restarts:</p><pre>touch /etc/kubernetes/manifests/kube-apiserver.yaml<br>touch /etc/kubernetes/manifests/kube-controller-manager.yaml<br>touch /etc/kubernetes/manifests/kube-scheduler.yaml</pre><p>Give it a minute, then verify the components are back up:</p><p>kubectl get pods -n kube-system</p><p>You should see the control plane pods with recent start times.</p><h4>Renewing etcd Certificates</h4><p>When Kubespray deploys etcd using etcd_deployment_type: host, the etcd certificates are separate from the Kubernetes PKI and are not managed by kubeadm. Kubespray generates these certificates using its own etcd role.</p><p>To renew etcd certificates, you would typically re-run the Kubespray etcd role. However, for manual renewal, you need to understand the certificate structure.</p><p>The etcd certificates under /etc/ssl/etcd/ssl serve different purposes:</p><p>The member certificate (member-hostname.pem) is used by the etcd server for both serving clients and communicating with peer etcd nodes. The admin certificate (admin-hostname.pem) is used by etcdctl for administrative operations. The node certificate (node-hostname.pem) is used by the Kubernetes API server to authenticate when connecting to etcd.</p><p>To check etcd certificate expiration:</p><pre>openssl x509 -in /etc/ssl/etcd/ssl/member-k8s-ctr.pem -noout -dates<br>openssl x509 -in /etc/ssl/etcd/ssl/admin-k8s-ctr.pem -noout -dates<br>openssl x509 -in /etc/ssl/etcd/ssl/node-k8s-ctr.pem -noout -dates</pre><p>If you need to regenerate etcd certificates, the safest approach is to re-run Kubespray’s etcd role:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml — tags etcd</p><p>This regenerates the certificates and restarts etcd as needed.</p><p>For testing etcd connectivity with the admin certificate:</p><pre>ETCDCTL_API=3 etcdctl <br>--cacert=/etc/ssl/etcd/ssl/ca.pem <br>--cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem <br>--key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem <br>--endpoints=https://192.168.10.10:2379 <br>endpoint health</pre><p>A healthy response looks like:</p><p><a href="https://192.168.10.10:2379">https://192.168.10.10:2379</a> is healthy: successfully committed proposal: took = 12.345ms</p><h4>Certificate Renewal and Kubeconfig Files</h4><p>One thing that catches people off guard is that kubeadm certs renew all also regenerates the kubeconfig files for admin, controller-manager, and scheduler. These kubeconfig files contain embedded client certificates.</p><p>After renewal, if you have copied the admin.conf to your local machine (as ~/.kube/config), your local copy is now outdated. The embedded certificate in your local copy is the old one, which will eventually expire.</p><p>You need to copy the renewed admin.conf again:</p><pre># On the control plane node<br>cat /etc/kubernetes/admin.conf<br><br># Or copy it to your local machine<br>scp root@192.168.10.10:/etc/kubernetes/admin.conf ~/.kube/config</pre><p>Remember to update the server address if needed. The default admin.conf generated by kubeadm points to 127.0.0.1:6443, which only works from the control plane node itself:</p><p>sed -i ‘s/127.0.0.1/192.168.10.10/g’ ~/.kube/config</p><h4>Kubelet Certificate Rotation</h4><p>The kubelet has its own certificate rotation mechanism that works independently from the control plane certificates. When you look at the kubelet configuration, you will see:</p><pre># /etc/kubernetes/kubelet-config.yaml<br>rotateCertificates: true serverTLSBootstrap: true</pre><p>With rotateCertificates: true, the kubelet automatically requests new client certificates from the API server before the current ones expire. This happens transparently without any manual intervention.</p><p>The kubelet’s certificates are stored in /var/lib/kubelet/pki:</p><pre>/var/lib/kubelet/pki/<br>├── kubelet.crt          # kubelet serving certificate<br>├── kubelet.key<br>├── kubelet-client-current.pem    # current client certificate (symlink)<br>├── kubelet-client-2026-01-28-12-34-56.pem    # actual client cert file<br>└── ...</pre><p>The kubelet-client-current.pem is a symlink that points to the currently active client certificate. When rotation occurs, a new certificate file is created and the symlink is updated. Old certificate files are kept around for a while before being cleaned up.</p><p>You can check the kubelet’s client certificate expiration:</p><p>openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates</p><p>If serverTLSBootstrap is enabled, the kubelet’s serving certificate (used when the API server connects to the kubelet for logs, exec, port-forward, etc.) is also managed automatically. However, these certificates require manual approval by default. You can check pending CSRs:</p><p>kubectl get csr</p><p>If you see pending CSRs for kubelet serving certificates, approve them:</p><p>kubectl certificate approve &lt;csr-name&gt;</p><p>In a Kubespray-deployed cluster, this is typically handled automatically or approved in bulk during initial cluster setup.</p><h4>Troubleshooting Certificate Issues</h4><p>When certificates expire or become invalid, the symptoms are often confusing. The API server might refuse connections. kubectl commands fail with TLS errors. Pods cannot be scheduled. Here are some common scenarios and how to diagnose them.</p><p>If kubectl fails with a certificate error:</p><pre>kubectl get nodes<br>Unable to connect to the server: x509: certificate has expired or is not yet valid</pre><p>First, check if the issue is your local kubeconfig or the server certificates:</p><pre># Check your local kubeconfig&#39;s embedded certificate<br>kubectl config view - raw -o jsonpath=&#39;{.users[0].user.client-certificate-data}&#39; | base64 -d | openssl x509 -noout -dates<br>SSH to the control plane and check server certificates<br>kubeadm certs check-expiration</pre><p>If the API server certificate has expired, you cannot use kubectl to check things. You need to SSH directly to the control plane node and work locally.</p><p>If etcd refuses connections:</p><pre>ETCDCTL_API=3 etcdctl - endpoints=https://127.0.0.1:2379 endpoint health Error: context deadline exceeded</pre><p>Check etcd logs for certificate errors:</p><p>journalctl -u etcd | grep -i “certificate|tls|x509”</p><p>If you see certificate expiration errors, you need to regenerate the etcd certificates and restart etcd.</p><p>If the API server cannot connect to etcd:</p><pre>kubectl get pods -n kube-system<br><br># or check API server logs<br>journalctl -u kubelet | grep apiserver</pre><p>Look for errors like:</p><p>connection error: desc = “transport: authentication handshake failed: x509: certificate has expired”</p><p>This indicates the API server’s client certificate for etcd (apiserver-etcd-client.crt) has expired. Renew it with:</p><p>kubeadm certs renew apiserver-etcd-client</p><p>Then restart the API server:</p><p>touch /etc/kubernetes/manifests/kube-apiserver.yaml</p><h4>Best Practices for Certificate Management</h4><p>After going through all this, here are some practical recommendations for managing certificates in production.</p><p>First, always enable automatic renewal. There is no good reason not to have auto_renew_certificates: true in your Kubespray configuration. The overhead is negligible, and it prevents the most common certificate disaster.</p><p>Second, monitor certificate expiration. Even with auto-renewal, you should monitor expiration dates. Set up alerts that fire when any certificate has less than 30 days of validity remaining. This gives you time to investigate if auto-renewal is not working for some reason.</p><p>Third, test renewal before it matters. Do not wait until certificates are about to expire to find out if your renewal process works. Periodically run manual renewal in a test environment. Verify that services come back up correctly after renewal.</p><p>Fourth, keep your kubeconfig files updated. After any certificate renewal, remember to update the kubeconfig files on any machine that accesses the cluster remotely. This includes CI/CD systems, monitoring tools, and developer workstations.</p><p>Fifth, document your certificate architecture. Know which certificates exist, what they are for, and how they are renewed. When something goes wrong at 3 AM, you want this information readily available, not scattered across blog posts and Stack Overflow answers.</p><p>Finally, consider the CA certificate expiration. While the CA certificates are valid for 10 years by default, that day will eventually come. CA rotation is significantly more complex than regular certificate renewal because every certificate signed by the old CA must be regenerated. Plan for this well in advance — years in advance, ideally.</p><p>The peace of mind that comes from properly configured certificate management is worth every minute spent understanding how it works. When your cluster crosses the one-year mark and everything continues running smoothly because certificates renewed themselves automatically, you will appreciate having set this up correctly from the beginning.</p><h3>Chapter 14: Cluster Operations — Upgrades, Scaling, and Backup</h3><p>Running a Kubernetes cluster is not a one-time event. Once you have a production cluster humming along, you will inevitably need to upgrade it when new versions come out, add or remove nodes as your workload changes, and maintain backups in case something goes catastrophically wrong.</p><p>Kubespray provides playbooks for all of these operations, but as we discussed in the opening chapters, understanding what these playbooks actually do is crucial. Blindly running upgrade-cluster.yml without understanding the implications can leave you with a half-upgraded cluster in an inconsistent state.</p><p>Let me walk you through each of these operational tasks in detail.</p><h4>Cluster Upgrades</h4><p>Kubernetes moves fast. A new minor version drops every three to four months, and security patches come even more frequently. If you want to stay current and benefit from bug fixes, performance improvements, and new features, you need to upgrade regularly. The alternative is running an increasingly outdated cluster that drifts further from community support.</p><p>Kubespray handles upgrades through the upgrade-cluster.yml playbook. The basic invocation looks like this:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.34.0</p><p>That single command kicks off a carefully orchestrated sequence of operations across your entire cluster. But before you hit enter, you need to understand what is about to happen.</p><p>The upgrade proceeds in a specific order. First, Kubespray upgrades etcd if a new version is required. Then it moves to the control plane nodes, upgrading them one at a time. After the control plane is fully upgraded, it proceeds to the worker nodes, again one at a time. Finally, it upgrades the CNI plugin and any addons that need updating.</p><p>This sequenced approach is not arbitrary. Kubernetes has strict version skew policies that dictate which component versions can talk to each other. The kube-apiserver must be the newest component. The kubelet on nodes can be up to two minor versions behind the apiserver, but no newer. The controller-manager and scheduler must match the apiserver version. If you violate these constraints, components will refuse to communicate, and your cluster will break.</p><p>The upgrade-cluster.yml playbook respects these constraints automatically, but you still need to follow the one-minor-version-at-a-time rule. If you are running Kubernetes 1.30 and want to get to 1.33, you cannot jump directly. You must upgrade to 1.31 first, verify everything works, then upgrade to 1.32, verify again, and finally upgrade to 1.33. Each hop requires a separate playbook execution:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml -e kube_version=v1.31.0</p><p>Wait for completion, run your smoke tests, check that all pods are running, verify that your applications are healthy. Then proceed to the next version:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini upgrade-cluster.yml-e kube_version=v1.32.0</p><p>And so on. This is tedious but necessary. Skipping versions is asking for trouble.</p><p>Before any upgrade, you should verify the current state of your cluster. Check the existing versions:</p><p>kubectl get nodes -o wide</p><p>This shows you the kubelet version on each node. Check the control plane component versions:</p><p>kubectl get pods -n kube-system -o custom-columns=<br> “NAME:.metadata.name,IMAGE:.spec.containers[*].image” | grep -E “apiserver|controller|scheduler”</p><p>Make sure everything is healthy before you start:</p><p>kubectl get nodes kubectl get pods -A | grep -v Running | grep -v Completed</p><p>Any pods stuck in CrashLoopBackOff or Pending state should be investigated and fixed before upgrading. Upgrades will not magically fix existing problems; they are more likely to make them worse.</p><p>You should also take an etcd backup before upgrading. I will cover the backup procedure in detail later in this chapter, but the short version is:</p><pre>ETCDCTL_API=3 etcdctl snapshot save /backup/pre-upgrade-snapshot.db <br> - cacert=/etc/ssl/etcd/ssl/ca.pem <br> - cert=/etc/ssl/etcd/ssl/admin-$(hostname).pem <br> - key=/etc/ssl/etcd/ssl/admin-$(hostname)-key.pem <br> - endpoints=https://127.0.0.1:2379</pre><p>With your backup secured and your cluster verified healthy, you can proceed with the upgrade. Watch the output carefully. Ansible will show you each task as it executes, and the profile_tasks callback will tell you how long each step takes. If something fails, the playbook will stop. Do not panic. Read the error message carefully, fix the underlying issue, and re-run the playbook. Thanks to Ansible’s idempotency, tasks that already completed successfully will be skipped, and execution will resume from approximately where it left off.</p><p>After the upgrade completes, run your verification checks again:</p><p>kubectl get nodes -o wide kubectl get pods -A kubectl get cs</p><p>That last command checks component status, though note that it has been deprecated in newer Kubernetes versions and may not show useful information.</p><p>One more thing about upgrades: test them in a non-production environment first. Always. Set up a dev or staging cluster that mirrors your production configuration, run the upgrade there, exercise your applications, look for problems. Only after you have confidence that the upgrade is safe should you touch production. This advice seems obvious, but the number of teams that skip this step because they are in a hurry is distressingly high.</p><h4>Node Scaling</h4><p>Clusters are not static. Your workload grows, and you need more capacity. A node fails catastrophically, and you need to replace it. You are migrating to new hardware and need to gracefully transition. All of these scenarios involve adding or removing nodes from your cluster.</p><p>Adding nodes is straightforward with Kubespray. First, update your inventory file to include the new node:</p><pre>[all] k8s-ctr ansible_host=192.168.10.10 ip=192.168.10.10 k8s-w1 ansible_host=192.168.10.21 ip=192.168.10.21 k8s-w2 ansible_host=192.168.10.22 ip=192.168.10.22 k8s-w3 ansible_host=192.168.10.23 ip=192.168.10.23 # New node<br>[kube_control_plane] k8s-ctr<br>[etcd] k8s-ctr<br>[kube_node] k8s-w1 k8s-w2 k8s-w3 # New node added to worker group</pre><p>Make sure the new node meets all the prerequisites. It needs SSH access configured with your Ansible control node’s public key. It needs Python installed. The network needs to allow communication with the existing cluster nodes on the required ports.</p><p>Then run the scale playbook with the limit option to target only the new node:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini scale.yml <br> — limit=k8s-w3</p><p>The limit flag is important. Without it, the playbook would run against all nodes, which is unnecessary and time-consuming. With the limit, Ansible only processes the new node, installing the container runtime, kubelet, and kube-proxy, then joining it to the cluster.</p><p>After the playbook completes, verify that the new node has joined:</p><p>kubectl get nodes</p><p>You should see k8s-w3 in the Ready state. Pods will start scheduling on it automatically based on resource availability and any node selectors or affinities in your workloads.</p><p>Removing nodes requires more care. You cannot just yank a node out of the cluster. There might be pods running on it, including pods that are the only replicas of their deployment. If you remove the node without preparation, those workloads will be disrupted.</p><p>The correct procedure starts with draining the node. This tells Kubernetes to gracefully evict all pods from the node and mark it as unschedulable so no new pods land on it:</p><p>kubectl drain k8s-w1 — ignore-daemonsets — delete-emptydir-data</p><p>The ignore-daemonsets flag is necessary because DaemonSet pods run on every node by design and cannot be evicted in the normal sense. The delete-emptydir-data flag allows draining even if pods are using emptyDir volumes, acknowledging that data in those volumes will be lost.</p><p>Watch the drain operation:</p><p>kubectl get pods -A -o wide | grep k8s-w1</p><p>You should see pods terminating and being recreated on other nodes. Wait until no pods remain on the node except for DaemonSet pods.</p><p>Once the node is drained, you can remove it from the cluster using Kubespray:</p><p>ansible-playbook -i inventory/mycluster/inventory.ini remove-node.yml <br> -e node=k8s-w1</p><p>This playbook cleans up the node, removing Kubernetes components and resetting it to a pre-Kubernetes state. It also removes the node object from the Kubernetes API.</p><p>After removal, update your inventory file to remove the node entry. Keeping stale entries in your inventory will cause confusion and errors in future playbook runs.</p><p>If you are removing a control plane node or an etcd node, the process is more complex. You need to ensure you maintain quorum in etcd (never go below three nodes in a production etcd cluster), and you need to handle the control plane components carefully. Kubespray’s remove-node.yml handles most of this, but you should understand the implications. Removing a control plane node from a three-node HA setup leaves you with two nodes, which is actually worse than three for fault tolerance because you have lost redundancy without reducing the quorum requirement. Plan your control plane sizing carefully.</p><h4>Backup and Recovery</h4><p>etcd is the brain of your Kubernetes cluster. Every object you create, every configuration change you make, every secret you store ends up in etcd. If you lose etcd data without a backup, you lose your cluster. Full stop. You will have to rebuild everything from scratch.</p><p>This is not a theoretical concern. Disk failures happen. Corruption happens. Operator errors happen. Someone might accidentally delete critical data. Without backups, you are one bad day away from a very long night of rebuilding.</p><p>Creating an etcd backup is straightforward with etcdctl. The exact command depends on your etcd deployment, but for a Kubespray-deployed etcd using the host deployment type, it looks like this:</p><p>ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db <br> — cacert=/etc/ssl/etcd/ssl/ca.pem <br> — cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem <br> — key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem <br> — endpoints=https://192.168.10.10:2379</p><p>Let me break down these options. The ETCDCTL_API=3 environment variable ensures you are using the v3 API, which is required for modern etcd. The snapshot save command creates a point-in-time snapshot of the entire etcd database. The certificate options authenticate you to etcd, which requires TLS client authentication. The endpoints option specifies which etcd member to connect to.</p><p>After creating a snapshot, verify its integrity:</p><p>ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db</p><p>This shows you the hash, revision, total keys, and total size of the snapshot. If the command succeeds, your snapshot is valid.</p><p>You should automate backups. Running manual backups occasionally is not sufficient for production. Here is a simple script that takes daily backups and retains them for seven days:</p><pre>#!/bin/bash BACKUP_DIR=&quot;/backup/etcd&quot; DATE=$(date +%Y%m%d-%H%M%S) SNAPSHOT=&quot;$BACKUP_DIR/etcd-snapshot-$DATE.db&quot;<br>mkdir -p &quot;$BACKUP_DIR&quot;<br>ETCDCTL_API=3 etcdctl snapshot save &quot;$SNAPSHOT&quot; <br> - cacert=/etc/ssl/etcd/ssl/ca.pem <br> - cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem <br> - key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem <br> - endpoints=https://192.168.10.10:2379<br>if [ $? -eq 0 ]; then echo &quot;Backup successful: $SNAPSHOT&quot; find &quot;$BACKUP_DIR&quot; -name &quot;etcd-snapshot-*.db&quot; -mtime +7 -delete echo &quot;Old backups cleaned up&quot; else echo &quot;Backup failed!&quot; exit 1 fi</pre><p>Save this as /usr/local/bin/etcd-backup.sh, make it executable with chmod +x, and set up a cron job or systemd timer to run it daily.</p><p>For a systemd timer approach, create two files. First, the service unit:</p><pre>[Unit] Description=etcd backup service After=etcd.service<br>[Service] Type=oneshot ExecStart=/usr/local/bin/etcd-backup.sh User=root</pre><p>Save this as /etc/systemd/system/etcd-backup.service. Then create the timer unit:</p><pre>[Unit] Description=Daily etcd backup<br>[Timer] OnCalendar=daily Persistent=true<br>[Install] WantedBy=timers.target</pre><p>Save this as /etc/systemd/system/etcd-backup.timer. Enable and start the timer:</p><p>systemctl daemon-reload systemctl enable etcd-backup.timer systemctl start etcd-backup.timer</p><p>Verify the timer is active:</p><p>systemctl list-timers | grep etcd</p><p>Store your backups off-node. A backup sitting on the same disk as the data it protects is not really a backup. Copy snapshots to remote storage, whether that is an NFS share, an S3 bucket, or another server entirely. The specific mechanism depends on your infrastructure, but the principle is universal: backups must be stored separately from the primary data.</p><p>Now let us talk about recovery, which you hopefully will never need but absolutely must understand. Restoring from an etcd snapshot is a destructive operation. You are replacing the current etcd state with the state from the snapshot. Any changes made after the snapshot was taken will be lost.</p><p>The recovery procedure varies depending on whether you have a single etcd node or a multi-node cluster. For a single node, the process is:</p><p>First, stop etcd:</p><p>systemctl stop etcd</p><p>Back up the current data directory, just in case:</p><p>mv /var/lib/etcd /var/lib/etcd.broken</p><p>Restore the snapshot to a new data directory:</p><p>ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db <br> — data-dir=/var/lib/etcd</p><p>Set the correct ownership on the restored data:</p><p>chown -R etcd:etcd /var/lib/etcd</p><p>Start etcd:</p><p>systemctl start etcd</p><p>Verify etcd is healthy:</p><p>ETCDCTL_API=3 etcdctl endpoint health <br> — cacert=/etc/ssl/etcd/ssl/ca.pem <br> — cert=/etc/ssl/etcd/ssl/admin-k8s-ctr.pem <br> — key=/etc/ssl/etcd/ssl/admin-k8s-ctr-key.pem <br> — endpoints=https://192.168.10.10:2379</p><p>Then check that Kubernetes can communicate with etcd:</p><p>kubectl get nodes</p><p>If the nodes appear and the cluster responds, your recovery was successful.</p><p>For multi-node etcd clusters, the procedure is more involved because you need to restore each member and reconfigure them to form a new cluster with the same data. The etcdctl snapshot restore command has additional flags for this:</p><p>ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db <br> — data-dir=/var/lib/etcd <br> — name=k8s-ctr1 <br> — initial-cluster=k8s-ctr1=https://192.168.10.11:2380,k8s-ctr2=https://192.168.10.12:2380,k8s-ctr3=https://192.168.10.13:2380 <br> — initial-cluster-token=etcd-cluster-1 <br> — initial-advertise-peer-urls=https://192.168.10.11:2380</p><p>You run this command on each etcd node, adjusting the name and initial-advertise-peer-urls for each member. The initial-cluster-token should be different from the original cluster token to prevent the restored cluster from accidentally joining the old cluster if any members are still running.</p><p>After restoring all members, start etcd on each node and verify the cluster formed correctly:</p><p>ETCDCTL_API=3 etcdctl member list <br> — cacert=/etc/ssl/etcd/ssl/ca.pem <br> — cert=/etc/ssl/etcd/ssl/admin-k8s-ctr1.pem <br> — key=/etc/ssl/etcd/ssl/admin-k8s-ctr1-key.pem <br> — endpoints=https://192.168.10.11:2379</p><p>You should see all three members listed with their correct peer URLs.</p><p>Multi-node recovery is complex enough that I strongly recommend practicing it in a test environment before you ever need to do it in production. Set up a throwaway cluster, take a backup, intentionally break something, and practice the recovery. Do this until the procedure is second nature. When disaster strikes production at 3 AM, you do not want to be reading documentation for the first time.</p><p>One final note on backups: etcd snapshots capture the Kubernetes state, but they do not capture everything. Persistent volume data is not in etcd. Application data in databases running on your cluster is not in etcd. Those need their own backup strategies. etcd backup is necessary but not sufficient for complete disaster recovery. Plan accordingly.</p><h3>Chapter 15: Air-Gap (Offline) Deployment</h3><p>Not every Kubernetes cluster lives in an environment with internet access. Banks, government agencies, defense contractors, healthcare organizations, and many manufacturing facilities run their infrastructure in isolated networks. These air-gapped environments present a unique challenge: how do you deploy a Kubernetes cluster when you cannot reach Docker Hub, GitHub releases, or the official Kubernetes registry?</p><p>Kubespray handles this scenario, but it requires preparation. You need to stage all the binaries, container images, and packages inside your isolated network before running the deployment. This chapter walks through the entire process.</p><h4>The fundamental problem with offline deployment</h4><p>When Kubespray runs in a normal environment, it downloads dozens of components from the internet. The containerd binary comes from GitHub releases. The runc binary comes from the OpenContainers project. CNI plugins, etcd, kubelet, kubectl, kubeadm — all of these get pulled during the deployment process. On top of that, Kubernetes itself needs container images: the pause image, CoreDNS, kube-proxy, and whatever CNI plugin you chose.</p><p>In an air-gapped network, none of these downloads work. The deployment fails immediately when Ansible tries to fetch the first binary. You need to solve two separate problems: getting the binaries into the network, and getting the container images into the network.</p><h4>What you need inside the air-gap</h4><p>Before starting, you need infrastructure inside the isolated network:</p><p>A private container registry is mandatory. Harbor and Nexus are popular choices, but any OCI-compliant registry works. This registry will hold all the container images that Kubernetes needs. You need to pre-populate it with images before deployment.</p><p>An HTTP file server is necessary for binaries. Kubespray downloads binaries like containerd, runc, etcd, and the Kubernetes components from URLs. You need to host these files on an internal server and tell Kubespray where to find them.</p><p>An Ansible control node must exist inside the air-gapped network. You cannot run Ansible from outside and expect it to reach into the isolated environment. The machine running ansible-playbook needs network access to all the target nodes.</p><h4>Figuring out what to download</h4><p>Kubespray maintains a list of everything it needs in the variable files. The file roles/kubespray_defaults/defaults/main/download.yml contains over a thousand lines of download URLs and version mappings. You need to examine this file to understand what to stage.</p><p>For a specific version of Kubespray, you can generate the complete list of required files. The download role has a feature that helps with this:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  --tags download \<br>  -e download_run_once=true \<br>  -e download_localhost=true</pre><p>This command runs only the download tasks and stores everything locally on the Ansible control node. In a connected environment, you would run this on a machine with internet access, then transfer the downloaded files into the air-gapped network.</p><p>The downloaded files end up in the directory specified by local_release_dir, which defaults to /tmp/releases. You will find binaries for containerd, runc, crictl, nerdctl, CNI plugins, etcd, and the Kubernetes components.</p><h4>Staging the binaries</h4><p>Once you have the binaries, you need to host them on an HTTP server inside your air-gapped network. The structure matters because Kubespray constructs download URLs based on patterns.</p><p>Create a directory structure on your internal web server:</p><pre>/var/www/files/<br>├── kubernetes/<br>│   └── v1.32.0/<br>│       ├── kubeadm<br>│       ├── kubectl<br>│       └── kubelet<br>├── containerd/<br>│   └── v2.0.0/<br>│       └── containerd-2.0.0-linux-amd64.tar.gz<br>├── runc/<br>│   └── v1.2.0/<br>│       └── runc.amd64<br>├── cni-plugins/<br>│   └── v1.6.0/<br>│       └── cni-plugins-linux-amd64-v1.6.0.tgz<br>└── etcd/<br>    └── v3.5.15/<br>        └── etcd-v3.5.15-linux-amd64.tar.gz</pre><p>Then configure Kubespray to use your internal server. In inventory/mycluster/group_vars/all/offline.yml, you override the download URLs:</p><pre>files_repo: &quot;http://files.internal.example.com&quot;<br><br>kubeadm_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubeadm&quot;<br>kubectl_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubectl&quot;<br>kubelet_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubelet&quot;<br><br>containerd_download_url: &quot;{{ files_repo }}/containerd/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz&quot;<br><br>runc_download_url: &quot;{{ files_repo }}/runc/v{{ runc_version }}/runc.{{ image_arch }}&quot;<br><br>cni_download_url: &quot;{{ files_repo }}/cni-plugins/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz&quot;<br><br>etcd_download_url: &quot;{{ files_repo }}/etcd/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz&quot;</pre><h4>Handling container images</h4><p>Container images are the trickier part. Kubernetes pulls images constantly during normal operation, not just during initial deployment. The pause image gets pulled for every pod. CoreDNS images, CNI plugin images, metrics-server images — they all need to come from somewhere.</p><p>The solution is a private registry that mirrors the public registries. You need to populate this registry with every image Kubespray deploys, then configure containerd to use your registry instead of the public ones.</p><p>First, identify all required images. Kubespray provides a way to list them:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  --tags download \<br>  -e download_container=true \<br>  -e skip_downloads=false \<br>  --list-tasks 2&gt;&amp;1 | grep -i image</pre><p>For a more complete list, check the download.yml defaults file. The images typically include:</p><pre>registry.k8s.io/pause:3.10<br>registry.k8s.io/coredns/coredns:v1.11.3<br>registry.k8s.io/kube-proxy:v1.32.0<br>registry.k8s.io/metrics-server/metrics-server:v0.7.2<br>quay.io/coreos/flannel:v0.26.1<br>docker.io/flannel/flannel-cni-plugin:v1.6.0</pre><p>The exact list depends on your configuration choices. If you use Calico instead of Flannel, you need Calico images. If you enable the Nginx ingress controller, you need those images too.</p><h4>Populating the private registry</h4><p>On a machine with internet access, pull all the required images, retag them for your private registry, and push them. This is tedious but straightforward:</p><pre># Pull from public registry<br>docker pull registry.k8s.io/pause:3.10<br><br># Tag for your private registry<br>docker tag registry.k8s.io/pause:3.10 registry.internal.example.com/pause:3.10<br><br># Push to your private registry<br>docker push registry.internal.example.com/pause:3.10</pre><p>For dozens of images, you want a script. Here is a simple approach:</p><pre>#!/bin/bash<br><br>PRIVATE_REGISTRY=&quot;registry.internal.example.com&quot;<br><br>IMAGES=(<br>  &quot;registry.k8s.io/pause:3.10&quot;<br>  &quot;registry.k8s.io/coredns/coredns:v1.11.3&quot;<br>  &quot;registry.k8s.io/kube-proxy:v1.32.0&quot;<br>  &quot;quay.io/coreos/flannel:v0.26.1&quot;<br>  &quot;docker.io/flannel/flannel-cni-plugin:v1.6.0&quot;<br>)<br><br>for IMAGE in &quot;${IMAGES[@]}&quot;; do<br>  # Extract image name without registry prefix<br>  NAME=$(echo $IMAGE | sed &#39;s|.*/||&#39;)<br>  <br>  docker pull $IMAGE<br>  docker tag $IMAGE ${PRIVATE_REGISTRY}/${NAME}<br>  docker push ${PRIVATE_REGISTRY}/${NAME}<br>done</pre><p>The output of this process is a tarball or a synchronized registry that you can transfer into the air-gapped network.</p><h4>Configuring containerd to use your registry</h4><p>Containerd needs to know where to find images. The modern approach uses the registry configuration directory at /etc/containerd/certs.d/. Kubespray can set this up automatically if you configure the right variables.</p><p>In inventory/mycluster/group_vars/all/containerd.yml, configure registry mirrors:</p><pre>containerd_registries_mirrors:<br>  - prefix: registry.k8s.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>  - prefix: docker.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>  - prefix: quay.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false</pre><p>This configuration tells containerd to redirect all image pulls to your internal registry. When Kubernetes tries to pull registry.k8s.io/pause:3.10, containerd intercepts that request and fetches from registry.internal.example.com/pause:3.10 instead.</p><p>If your internal registry uses a self-signed certificate, you have two options. You can set skip_verify: true, which is insecure but works. The better approach is to distribute your CA certificate to all nodes and reference it:</p><pre>containerd_registries_mirrors:<br>  - prefix: registry.k8s.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        ca_file: /etc/containerd/certs.d/ca.crt</pre><p>You need to ensure the CA certificate file exists on all nodes before containerd starts. Kubespray can copy this file using the containerd role if you configure it properly, or you can handle it in your pre-deployment automation.</p><p>The resulting directory structure on each node looks like this:</p><pre>/etc/containerd/certs.d/<br>├── registry.k8s.io/<br>│   └── hosts.toml<br>├── docker.io/<br>│   └── hosts.toml<br>└── quay.io/<br>    └── hosts.toml</pre><p>Each hosts.toml file contains the mirror configuration:</p><pre>server = &quot;https://registry.k8s.io&quot;<br><br>[host.&quot;https://registry.internal.example.com&quot;]<br>  capabilities = [&quot;pull&quot;, &quot;resolve&quot;]<br>  ca = &quot;/etc/containerd/certs.d/ca.crt&quot;</pre><h4>Overriding the sandbox image</h4><p>One image deserves special attention: the pause image. Kubernetes uses this image for every pod sandbox. The default location is registry.k8s.io/pause:3.10, but in an air-gapped environment, this pull fails.</p><p>You must override the sandbox image location in your Kubespray configuration. In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml:</p><pre>kube_image_repo: &quot;registry.internal.example.com&quot;</pre><p>Or more specifically for just the pause image:</p><pre>pod_infra_image_repo: &quot;registry.internal.example.com&quot;<br>pod_infra_image_tag: &quot;3.10&quot;</pre><p>Kubespray constructs the full pause image path from these variables. Getting this wrong causes every pod creation to fail, so verify it carefully.</p><h4>Running the offline deployment</h4><p>With binaries staged on your internal HTTP server and images populated in your private registry, you can run the deployment. The command looks almost identical to a normal deployment:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  -e download_run_once=false \<br>  -e download_localhost=false</pre><p>The download_run_once=false flag tells Kubespray to download binaries on each node rather than downloading once and distributing. In an offline setup with a local file server, this is fine because the downloads hit your internal server.</p><p>If you pre-staged all binaries directly on the nodes (perhaps using a separate Ansible playbook or image), you can skip downloads entirely:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml \<br>  -e skip_downloads=true</pre><p>This assumes the binaries already exist in the expected locations on each node.</p><h4>Handling package manager dependencies</h4><p>Kubespray installs some components through the operating system package manager. On RHEL-based systems, this means yum or dnf. On Debian-based systems, apt. In an air-gapped environment, these package repositories are unreachable.</p><p>You have a few options. The cleanest is to set up an internal mirror of the required repositories. Tools like reposync for RHEL or apt-mirror for Debian can create local copies of public repositories. Then configure your nodes to point at the internal mirror.</p><p>On Rocky Linux, for example, you would modify /etc/yum.repos.d/ files to point at your internal server:</p><pre>[baseos]<br>name=Rocky Linux $releasever - BaseOS<br>baseurl=http://repo.internal.example.com/rocky/$releasever/BaseOS/$basearch/os/<br>gpgcheck=1<br>enabled=1<br>gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-Rocky-9</pre><p>Alternatively, you can pre-install all required packages on a golden image and use that image for your Kubernetes nodes. Kubespray tries to install packages like conntrack, socat, and other utilities. If they are already present, the installation tasks succeed without reaching external repositories.</p><p>The third option is to tell Kubespray to skip package management entirely if you have prepared the nodes:</p><pre>override_system_hostname: false</pre><p>This is risky because you might miss required dependencies. The safer approach is proper repository mirroring.</p><h4>Testing the offline deployment</h4><p>Before deploying to production, test your offline setup thoroughly. A common mistake is missing one image or one binary. The deployment proceeds for thirty minutes, then fails when it cannot pull a specific component.</p><p>Run the deployment in check mode first:</p><pre>ansible-playbook -i inventory/mycluster/inventory.ini cluster.yml --check</pre><p>This does not catch all issues because some tasks behave differently in check mode, but it catches obvious configuration problems.</p><p>Then run a real deployment against a test environment. Watch the logs carefully. Any “connection refused” or “timeout” errors indicate something reaching out to the internet. Any “image pull failed” errors indicate missing images in your registry.</p><p>After deployment, verify that no external connections occurred. On each node, check the containerd logs:</p><pre>journalctl -u containerd | grep -i &quot;registry.k8s.io\|docker.io\|quay.io&quot;</pre><p>If you see successful pulls to external registries, your mirror configuration is wrong. The pulls should go to your internal registry.</p><h4>Maintaining an air-gapped cluster</h4><p>Initial deployment is only half the battle. You need to maintain the cluster over time, which means upgrades, new workloads, and security patches.</p><p>For upgrades, repeat the staging process. Download the new versions of binaries and images on a connected machine, transfer them into the air-gap, update your internal servers, then run the Kubespray upgrade playbook.</p><p>For new workloads, every container image your applications need must exist in your private registry. This requires a process: developers request images, someone pulls them on a connected machine, transfers them in, and pushes to the internal registry. Without this workflow, deployments fail.</p><p>Security patches for the underlying operating system require repository updates. Your internal mirror needs regular synchronization with upstream repositories, done through a controlled process that transfers packages into the air-gapped network.</p><p>None of this is automatic. Air-gapped environments trade convenience for isolation. The security benefit comes with an operational cost that you must plan for.</p><h4>Common pitfalls</h4><p>The most frequent mistake is forgetting one image. Kubespray deploys successfully, but later a DaemonSet fails to start because its image is missing from the private registry. Always verify your image list against the actual deployment configuration.</p><p>Certificate issues cause endless frustration. If your private registry uses TLS with a certificate signed by an internal CA, that CA must be trusted by containerd on every node. Missing CA distribution causes “x509: certificate signed by unknown authority” errors.</p><p>DNS resolution inside the air-gap requires attention. Your nodes need to resolve the internal registry hostname. If DNS is not available, use IP addresses or populate /etc/hosts.</p><p>Binary architecture mismatches occur when you download amd64 binaries but deploy to arm64 nodes, or vice versa. Check the architecture of your target nodes and download the correct binaries.</p><p>Version mismatches between staged files and Kubespray configuration cause subtle failures. If you stage containerd 2.0.0 but Kubespray expects 2.0.1, the checksum verification fails. Keep your staged files synchronized with the version variables in your inventory.</p><h4>A working offline.yml example</h4><p>Here is a more complete example of the offline configuration file that ties everything together:</p><pre># inventory/mycluster/group_vars/all/offline.yml<br><br># Internal file server for binaries<br>files_repo: &quot;http://files.internal.example.com&quot;<br><br># Kubernetes binaries<br>kubeadm_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubeadm&quot;<br>kubectl_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubectl&quot;<br>kubelet_download_url: &quot;{{ files_repo }}/kubernetes/{{ kube_version }}/kubelet&quot;<br># Container runtime binaries<br>containerd_download_url: &quot;{{ files_repo }}/containerd/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz&quot;<br>runc_download_url: &quot;{{ files_repo }}/runc/v{{ runc_version }}/runc.{{ image_arch }}&quot;<br>nerdctl_download_url: &quot;{{ files_repo }}/nerdctl/v{{ nerdctl_version }}/nerdctl-{{ nerdctl_version }}-linux-{{ image_arch }}.tar.gz&quot;<br>crictl_download_url: &quot;{{ files_repo }}/crictl/v{{ crictl_version }}/crictl-v{{ crictl_version }}-linux-{{ image_arch }}.tar.gz&quot;<br># CNI plugins<br>cni_download_url: &quot;{{ files_repo }}/cni-plugins/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz&quot;<br># etcd<br>etcd_download_url: &quot;{{ files_repo }}/etcd/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-{{ image_arch }}.tar.gz&quot;<br># Helm (if enabled)<br>helm_download_url: &quot;{{ files_repo }}/helm/v{{ helm_version }}/helm-v{{ helm_version }}-linux-{{ image_arch }}.tar.gz&quot;<br># Internal container registry<br>kube_image_repo: &quot;registry.internal.example.com&quot;<br>gcr_image_repo: &quot;registry.internal.example.com&quot;<br>docker_image_repo: &quot;registry.internal.example.com&quot;<br>quay_image_repo: &quot;registry.internal.example.com&quot;</pre><p>And the corresponding containerd configuration:</p><pre># inventory/mycluster/group_vars/all/containerd.yml<br><br>containerd_registries_mirrors:<br>  - prefix: registry.k8s.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt<br>        <br>  - prefix: docker.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt<br>        <br>  - prefix: quay.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt<br>        <br>  - prefix: ghcr.io<br>    mirrors:<br>      - host: https://registry.internal.example.com<br>        capabilities: [&quot;pull&quot;, &quot;resolve&quot;]<br>        skip_verify: false<br>        ca_file: /etc/pki/ca-trust/source/anchors/internal-ca.crt</pre><p>This configuration assumes you have distributed your internal CA certificate to /etc/pki/ca-trust/source/anchors/ on all nodes and run update-ca-trust to install it.</p><p>The effort required for air-gapped deployment is substantial, but it works. Organizations running Kubernetes in isolated networks do this every day. The key is methodical preparation: identify every external dependency, stage it internally, and verify the configuration before deployment.</p><h3>Chapter 16: Kubespray Execution Flow — The Complete Picture</h3><p>When you run ansible-playbook cluster.yml, you’re not just executing a single script. You’re kicking off a carefully orchestrated sequence of 15 PLAYs containing roughly 559 TASKs that transform bare Linux machines into a production-ready Kubernetes cluster. Understanding this flow isn’t just academic curiosity — it’s essential knowledge for troubleshooting failures, optimizing deployment times, and knowing exactly what changed on your systems.</p><p>Let me walk you through the entire journey, from the moment you hit enter to when your cluster becomes ready.</p><h4>The Starting Point</h4><p>Everything begins with cluster.yml in the Kubespray root directory. This file doesn’t contain much actual logic — it’s primarily an orchestrator that imports other playbooks and assigns roles to host groups. If you open it up, you’ll see something like this:</p><pre>- name: Check Ansible version<br>  hosts: localhost<br>  gather_facts: false<br>  tasks:<br>    - name: Check minimal Ansible version<br>      assert:<br>        that:<br>          - ansible_version.full is version(minimal_ansible_version, &#39;&gt;=&#39;)<br>        msg: &quot;Ansible version must be &gt;= {{ minimal_ansible_version }}&quot;<br><br>- name: Add kube-master nodes to kube_control_plane<br>  hosts: kube-master<br>  gather_facts: false<br>  tasks:<br>    - name: Add nodes to kube_control_plane group<br>      group_by:<br>        key: kube_control_plane<br>- name: Gather facts<br>  import_playbook: facts.yml<br>- name: Prepare for etcd install<br>  hosts: &quot;{{ etcd_hosts | default(&#39;etcd&#39;) }}&quot;<br>  gather_facts: false<br>  any_errors_fatal: &quot;{{ any_errors_fatal | default(true) }}&quot;<br>  environment: &quot;{{ proxy_disable_env }}&quot;<br>  roles:<br>    - { role: kubespray_defaults }<br>    - { role: kubernetes/preinstall, tags: preinstall }<br>    - { role: container-engine, tags: container-engine, when: deploy_container_engine }<br>    - { role: download, tags: download, when: &quot;not skip_downloads&quot; }</pre><p>Notice how every play starts with the kubespray_defaults role. This isn’t optional decoration — it loads hundreds of default variables that subsequent roles depend on. Without it, you’d get undefined variable errors everywhere.</p><h4>Play 1: Ansible Version Verification</h4><p>The very first thing Kubespray does is check whether your Ansible version is compatible. This happens on localhost, not on any remote nodes. The check is simple but critical:</p><pre>- name: Check minimal Ansible version<br>  assert:<br>    that:<br>      - ansible_version.full is version(&#39;2.14.0&#39;, &#39;&gt;=&#39;)<br>      - ansible_version.full is version(&#39;2.18.0&#39;, &#39;&lt;&#39;)<br>    msg: &quot;Ansible version must be &gt;= 2.14.0 and &lt; 2.18.0&quot;</pre><p>If you’re running Ansible 2.12 or Ansible 2.19, the playbook stops right here. I’ve seen people waste hours troubleshooting weird failures only to discover they had an incompatible Ansible version. This check saves you from that.</p><h4>Play 2: Inventory Validation</h4><p>Before touching any remote machine, Kubespray validates your inventory structure. It checks things like:</p><ul><li>Do you have at least one node in kube_control_plane?</li><li>Is your etcd node count an odd number (1, 3, or 5)?</li><li>Are there any duplicate hostnames?</li></ul><p>The etcd check is particularly important:</p><pre>- name: Check that etcd hosts are odd<br>  assert:<br>    that:<br>      - groups[&#39;etcd&#39;] | length is odd<br>    msg: &quot;etcd hosts must be an odd number (1, 3, or 5)&quot;<br>  when: etcd_cluster_setup | default(true)</pre><p>This isn’t Kubespray being pedantic. etcd uses Raft consensus, which requires a majority (quorum) to operate. With 3 nodes, you can lose 1 and still function. With 4 nodes, you can still only lose 1 — so that extra node bought you nothing but more complexity. Stick to odd numbers.</p><h4>Play 3: Bootstrap OS</h4><p>Now we start touching remote machines. The bootstrap-os role prepares the operating system for Kubernetes installation. This is where all those “prerequisites” you’d normally configure manually get handled automatically.</p><p>First, it ensures Python is available (Ansible needs it):</p><pre>- name: Install python3<br>  raw: dnf install -y python3 python3-dnf<br>  when: ansible_os_family == &quot;RedHat&quot;<br>  changed_when: false</pre><p>Then it loads kernel modules required for container networking:</p><pre>- name: Load kernel modules<br>  modprobe:<br>    name: &quot;{{ item }}&quot;<br>    state: present<br>  loop:<br>    - overlay<br>    - br_netfilter</pre><p>The overlay module enables OverlayFS, which containerd uses for efficient container image layer management. The br_netfilter module allows iptables to see bridged traffic — without this, your pod networking won’t work properly.</p><p>Next come the sysctl parameters:</p><pre>- name: Set sysctl parameters<br>  sysctl:<br>    name: &quot;{{ item.name }}&quot;<br>    value: &quot;{{ item.value }}&quot;<br>    state: present<br>    reload: yes<br>  loop:<br>    - { name: &#39;net.bridge.bridge-nf-call-iptables&#39;, value: &#39;1&#39; }<br>    - { name: &#39;net.bridge.bridge-nf-call-ip6tables&#39;, value: &#39;1&#39; }<br>    - { name: &#39;net.ipv4.ip_forward&#39;, value: &#39;1&#39; }</pre><p>That ip_forward setting is crucial. Without it, your node can’t route packets between pods. I’ve debugged many “pods can’t reach each other” issues that came down to this single parameter being 0.</p><p>Swap gets disabled too:</p><pre>- name: Disable swap<br>  command: swapoff -a<br>  when: ansible_swaptotal_mb &gt; 0<br><br>- name: Remove swap from fstab<br>  lineinfile:<br>    path: /etc/fstab<br>    regexp: &#39;^([^#].*\s+swap\s+.*)$&#39;<br>    line: &#39;# \1&#39;<br>    backrefs: yes</pre><p>Kubernetes historically required swap to be off because the scheduler’s resource calculations assumed no swap. While newer versions support swap (with feature gates), Kubespray still disables it by default for compatibility.</p><h4>Play 4: Gather Facts</h4><p>Ansible’s fact gathering collects detailed information about each node — IP addresses, CPU count, memory, disk space, OS version, network interfaces. Kubespray uses this data extensively:</p><pre>- name: Gather facts<br>  hosts: all<br>  gather_facts: true<br>  tasks:<br>    - name: Gather minimal facts<br>      setup:<br>        gather_subset:<br>          - &#39;!all&#39;<br>          - &#39;!min&#39;<br>          - &#39;network&#39;<br>          - &#39;hardware&#39;<br>          - &#39;virtual&#39;</pre><p>The gathered facts get cached to /tmp (as configured in ansible.cfg):</p><pre>gathering = smart<br>fact_caching = jsonfile<br>fact_caching_connection = /tmp<br>fact_caching_timeout = 86400</pre><p>This caching means subsequent plays don’t need to re-gather facts, significantly speeding up the deployment. The 86400-second timeout (24 hours) means you can re-run the playbook multiple times during initial setup without the overhead of fresh fact gathering each time.</p><h4>Play 5: Container Engine Installation</h4><p>This is where things get interesting. The container-engine role installs containerd, runc, and CNI plugins. Let me break down the sequence.</p><p>First, runc gets downloaded and installed:</p><pre>- name: Download runc binary<br>  get_url:<br>    url: &quot;https://github.com/opencontainers/runc/releases/download/v{{ runc_version }}/runc.{{ image_arch }}&quot;<br>    dest: &quot;{{ local_release_dir }}/runc-{{ runc_version }}&quot;<br>    mode: &#39;0755&#39;<br>    checksum: &quot;sha256:{{ runc_checksums[image_arch] }}&quot;<br><br>- name: Copy runc to bin directory<br>  copy:<br>    src: &quot;{{ local_release_dir }}/runc-{{ runc_version }}&quot;<br>    dest: &quot;{{ bin_dir }}/runc&quot;<br>    mode: &#39;0755&#39;<br>    remote_src: yes</pre><p>Notice the checksum verification. Kubespray maintains checksums for every binary it downloads in roles/kubespray_defaults/vars/main/checksums.yml. This prevents both download corruption and supply chain attacks.</p><p>CNI plugins come next:</p><pre>- name: Download CNI plugins<br>  get_url:<br>    url: &quot;https://github.com/containernetworking/plugins/releases/download/v{{ cni_version }}/cni-plugins-linux-{{ image_arch }}-v{{ cni_version }}.tgz&quot;<br>    dest: &quot;{{ local_release_dir }}/cni-plugins-{{ cni_version }}.tgz&quot;<br>    checksum: &quot;sha256:{{ cni_binary_checksums[image_arch] }}&quot;<br><br>- name: Extract CNI plugins<br>  unarchive:<br>    src: &quot;{{ local_release_dir }}/cni-plugins-{{ cni_version }}.tgz&quot;<br>    dest: /opt/cni/bin<br>    remote_src: yes</pre><p>The CNI plugins land in /opt/cni/bin, which is the standard location that kubelet and containerd expect. You’ll find binaries like bridge, host-local, loopback, portmap there after installation.</p><p>Now containerd itself:</p><pre>- name: Download containerd<br>  get_url:<br>    url: &quot;https://github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz&quot;<br>    dest: &quot;{{ local_release_dir }}/containerd-{{ containerd_version }}.tar.gz&quot;<br>    checksum: &quot;sha256:{{ containerd_checksums[image_arch] }}&quot;<br><br>- name: Extract containerd<br>  unarchive:<br>    src: &quot;{{ local_release_dir }}/containerd-{{ containerd_version }}.tar.gz&quot;<br>    dest: /usr/local<br>    remote_src: yes</pre><p>The containerd tarball extracts to /usr/local, putting binaries in /usr/local/bin. This includes containerd itself, containerd-shim-runc-v2, and ctr (the containerd CLI).</p><p>The configuration file gets generated from a template:</p><pre>- name: Generate containerd config<br>  template:<br>    src: config.toml.j2<br>    dest: /etc/containerd/config.toml<br>    mode: &#39;0644&#39;<br>  notify: restart containerd</pre><p>That config.toml is critical. Here’s what a typical generated configuration looks like:</p><pre>version = 2<br><br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  sandbox_image = &quot;registry.k8s.io/pause:3.9&quot;<br>  <br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>    default_runtime_name = &quot;runc&quot;<br>    <br>    [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>      runtime_type = &quot;io.containerd.runc.v2&quot;<br>      <br>      [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>        SystemdCgroup = true<br>        <br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.registry]<br>    config_path = &quot;/etc/containerd/certs.d&quot;</pre><p>That SystemdCgroup = true setting is vital. It tells containerd to use systemd’s cgroup driver instead of cgroupfs. This must match what kubelet uses, and since modern systemd-based distributions (basically everything now) work better with the systemd driver, this is the correct choice.</p><p>Finally, the systemd service gets installed and started:</p><pre>- name: Install containerd systemd service<br>  template:<br>    src: containerd.service.j2<br>    dest: /etc/systemd/system/containerd.service<br>  notify:<br>    - reload systemd<br>    - restart containerd<br><br>- name: Enable and start containerd<br>  systemd:<br>    name: containerd<br>    enabled: yes<br>    state: started<br>    daemon_reload: yes</pre><h4>Play 6: Download Binaries and Images</h4><p>The download role is one of the longer-running parts of deployment. It fetches all the binaries and container images needed for the cluster:</p><pre>- name: Download kubeadm<br>  get_url:<br>    url: &quot;https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubeadm&quot;<br>    dest: &quot;{{ local_release_dir }}/kubeadm-{{ kube_version }}&quot;<br>    checksum: &quot;sha256:{{ kubeadm_checksums[image_arch] }}&quot;<br><br>- name: Download kubelet<br>  get_url:<br>    url: &quot;https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubelet&quot;<br>    dest: &quot;{{ local_release_dir }}/kubelet-{{ kube_version }}&quot;<br>    checksum: &quot;sha256:{{ kubelet_checksums[image_arch] }}&quot;<br>- name: Download kubectl<br>  get_url:<br>    url: &quot;https://dl.k8s.io/release/v{{ kube_version }}/bin/linux/{{ image_arch }}/kubectl&quot;<br>    dest: &quot;{{ local_release_dir }}/kubectl-{{ kube_version }}&quot;<br>    checksum: &quot;sha256:{{ kubectl_checksums[image_arch] }}&quot;</pre><p>Container images get pulled to each node:</p><pre>- name: Pull required images<br>  command: &quot;{{ bin_dir }}/crictl pull {{ item }}&quot;<br>  loop:<br>    - &quot;registry.k8s.io/pause:3.9&quot;<br>    - &quot;registry.k8s.io/kube-apiserver:v{{ kube_version }}&quot;<br>    - &quot;registry.k8s.io/kube-controller-manager:v{{ kube_version }}&quot;<br>    - &quot;registry.k8s.io/kube-scheduler:v{{ kube_version }}&quot;<br>    - &quot;registry.k8s.io/kube-proxy:v{{ kube_version }}&quot;<br>    - &quot;registry.k8s.io/coredns/coredns:v{{ coredns_version }}&quot;<br>  when: inventory_hostname in groups[&#39;kube_control_plane&#39;]</pre><p>Kubespray optimizes this by downloading once and distributing, controlled by the download_run_once variable. In large clusters, this prevents hammering external registries with duplicate requests from every node.</p><h4>Play 7: etcd Installation</h4><p>If you’re using the default etcd_deployment_type: host, etcd gets installed as a systemd service rather than a Kubernetes static pod. This gives you independent etcd management — you can restart etcd without affecting the kubelet, and etcd logs go to journald rather than container logs.</p><p>The etcd role starts by creating the etcd user and directories:</p><pre>- name: Create etcd user<br>  user:<br>    name: etcd<br>    shell: /sbin/nologin<br>    system: yes<br>    create_home: no<br><br>- name: Create etcd directories<br>  file:<br>    path: &quot;{{ item }}&quot;<br>    state: directory<br>    owner: etcd<br>    group: etcd<br>    mode: &#39;0700&#39;<br>  loop:<br>    - /var/lib/etcd<br>    - /etc/ssl/etcd/ssl</pre><p>Certificate generation is the complex part. etcd uses mutual TLS for all communications, meaning both client and server authenticate each other with certificates. Kubespray generates several certificate types:</p><pre># CA certificate - signs all other etcd certificates<br>/etc/ssl/etcd/ssl/ca.pem<br>/etc/ssl/etcd/ssl/ca-key.pem<br><br># Member certificate - used by etcd servers for peer communication<br>/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem<br>/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem<br># Admin certificate - used by etcdctl for management<br>/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem<br>/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem<br># Node certificate - used by kube-apiserver to connect to etcd<br>/etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem<br>/etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem</pre><p>The certificate generation uses openssl commands:</p><pre>- name: Generate etcd member certificate<br>  command: &gt;<br>    openssl req -new -key {{ etcd_cert_dir }}/member-{{ inventory_hostname }}-key.pem<br>    -out {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.csr<br>    -subj &quot;/CN=etcd-member-{{ inventory_hostname }}&quot;<br>    -config {{ etcd_cert_dir }}/openssl.conf<br><br>- name: Sign etcd member certificate<br>  command: &gt;<br>    openssl x509 -req -in {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.csr<br>    -CA {{ etcd_cert_dir }}/ca.pem<br>    -CAkey {{ etcd_cert_dir }}/ca-key.pem<br>    -CAcreateserial<br>    -out {{ etcd_cert_dir }}/member-{{ inventory_hostname }}.pem<br>    -days 3650<br>    -extensions v3_req<br>    -extfile {{ etcd_cert_dir }}/openssl.conf</pre><p>The etcd environment file contains all the configuration:</p><pre># /etc/etcd.env<br>ETCD_NAME={{ inventory_hostname }}<br>ETCD_DATA_DIR=/var/lib/etcd</pre><pre>ETCD_LISTEN_PEER_URLS=https://{{ etcd_address }}:2380<br>ETCD_LISTEN_CLIENT_URLS=https://{{ etcd_address }}:2379,https://127.0.0.1:2379<br>ETCD_ADVERTISE_CLIENT_URLS=https://{{ etcd_address }}:2379<br>ETCD_INITIAL_ADVERTISE_PEER_URLS=https://{{ etcd_address }}:2380</pre><pre>ETCD_INITIAL_CLUSTER={{ etcd_initial_cluster }}<br>ETCD_INITIAL_CLUSTER_STATE={{ etcd_cluster_state }}<br>ETCD_INITIAL_CLUSTER_TOKEN=k8s-etcd-cluster</pre><pre># TLS configuration<br>ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem<br>ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem<br>ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem<br>ETCD_CLIENT_CERT_AUTH=true</pre><pre>ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem<br>ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem<br>ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem<br>ETCD_PEER_CLIENT_CERT_AUTH=true</pre><p>The systemd service file:</p><pre># /etc/systemd/system/etcd.service<br>[Unit]<br>Description=etcd<br>Documentation=https://github.com/etcd-io/etcd<br>After=network.target</pre><pre>[Service]<br>User=etcd<br>Type=notify<br>EnvironmentFile=/etc/etcd.env<br>ExecStart=/usr/local/bin/etcd<br>Restart=always<br>RestartSec=10s<br>LimitNOFILE=65536</pre><pre>[Install]<br>WantedBy=multi-user.target</pre><p>After starting etcd, Kubespray verifies the cluster is healthy:</p><pre>- name: Wait for etcd cluster to be healthy<br>  command: &gt;<br>    {{ bin_dir }}/etcdctl endpoint health<br>    --endpoints=https://{{ etcd_address }}:2379<br>    --cacert={{ etcd_cert_dir }}/ca.pem<br>    --cert={{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem<br>    --key={{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem<br>  register: etcd_health<br>  until: etcd_health.rc == 0<br>  retries: 10<br>  delay: 5</pre><p>Play 8: Kubernetes Node Components</p><p>The kubernetes/node role installs components needed on ALL nodes — both control plane and workers. This includes kubelet, kubectl, and kubeadm.</p><p>Binary installation:</p><pre>- name: Install kubelet<br>  copy:<br>    src: &quot;{{ local_release_dir }}/kubelet-{{ kube_version }}&quot;<br>    dest: &quot;{{ bin_dir }}/kubelet&quot;<br>    mode: &#39;0755&#39;<br>    remote_src: yes</pre><pre>- name: Install kubectl<br>  copy:<br>    src: &quot;{{ local_release_dir }}/kubectl-{{ kube_version }}&quot;<br>    dest: &quot;{{ bin_dir }}/kubectl&quot;<br>    mode: &#39;0755&#39;<br>    remote_src: yes</pre><pre>- name: Install kubeadm<br>  copy:<br>    src: &quot;{{ local_release_dir }}/kubeadm-{{ kube_version }}&quot;<br>    dest: &quot;{{ bin_dir }}/kubeadm&quot;<br>    mode: &#39;0755&#39;<br>    remote_src: yes</pre><p>The kubelet configuration gets generated:</p><pre># /etc/kubernetes/kubelet-config.yaml<br>apiVersion: kubelet.config.k8s.io/v1beta1<br>kind: KubeletConfiguration<br>cgroupDriver: systemd<br>clusterDNS:<br>  - {{ kube_dns_service_ip }}<br>clusterDomain: {{ dns_domain }}<br>containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock<br>maxPods: {{ kubelet_max_pods }}<br>podCIDR: {{ kube_pods_subnet }}<br>resolvConf: /etc/resolv.conf<br>rotateCertificates: true<br>serverTLSBootstrap: true</pre><p>The kubelet systemd service:</p><pre># /etc/systemd/system/kubelet.service<br>[Unit]<br>Description=kubelet: The Kubernetes Node Agent<br>Documentation=https://kubernetes.io/docs/<br>Wants=network-online.target<br>After=network-online.target</pre><pre>[Service]<br>ExecStart={{ bin_dir }}/kubelet \<br>  --config=/etc/kubernetes/kubelet-config.yaml \<br>  --kubeconfig=/etc/kubernetes/kubelet.conf \<br>  --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \<br>  --hostname-override={{ kube_override_hostname | default(inventory_hostname) }}<br>Restart=always<br>StartLimitInterval=0<br>RestartSec=10</pre><pre>[Install]<br>WantedBy=multi-user.target</pre><p>At this point, kubelet is installed but not yet started on most nodes. It needs the cluster to be initialized first.</p><h4>Play 9: Control Plane Initialization</h4><p>This is where the cluster actually comes to life. The kubernetes/control-plane role runs kubeadm init on the first control plane node.</p><p>Kubespray generates a kubeadm configuration file:</p><pre># /etc/kubernetes/kubeadm-config.yaml<br>apiVersion: kubeadm.k8s.io/v1beta3<br>kind: InitConfiguration<br>localAPIEndpoint:<br>  advertiseAddress: {{ ip | default(ansible_default_ipv4.address) }}<br>  bindPort: 6443<br>nodeRegistration:<br>  criSocket: unix:///var/run/containerd/containerd.sock<br>  name: {{ inventory_hostname }}<br>  kubeletExtraArgs:<br>    node-ip: {{ ip | default(ansible_default_ipv4.address) }}<br>---<br>apiVersion: kubeadm.k8s.io/v1beta3<br>kind: ClusterConfiguration<br>apiServer:<br>  certSANs:<br>    - kubernetes<br>    - kubernetes.default<br>    - kubernetes.default.svc<br>    - kubernetes.default.svc.{{ dns_domain }}<br>    - localhost<br>    - 127.0.0.1<br>    - {{ ip | default(ansible_default_ipv4.address) }}<br>  extraArgs:<br>    etcd-cafile: /etc/ssl/etcd/ssl/ca.pem<br>    etcd-certfile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem<br>    etcd-keyfile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem<br>    etcd-servers: {{ etcd_endpoints }}<br>certificatesDir: /etc/kubernetes/pki<br>clusterName: {{ cluster_name }}<br>controlPlaneEndpoint: {{ kube_apiserver_endpoint }}<br>etcd:<br>  external:<br>    endpoints:<br>      {% for host in groups[&#39;etcd&#39;] %}<br>      - https://{{ hostvars[host][&#39;ip&#39;] }}:2379<br>      {% endfor %}<br>    caFile: /etc/ssl/etcd/ssl/ca.pem<br>    certFile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}.pem<br>    keyFile: /etc/ssl/etcd/ssl/node-{{ inventory_hostname }}-key.pem<br>imageRepository: {{ kube_image_repo }}<br>kubernetesVersion: v{{ kube_version }}<br>networking:<br>  dnsDomain: {{ dns_domain }}<br>  podSubnet: {{ kube_pods_subnet }}<br>  serviceSubnet: {{ kube_service_addresses }}</pre><p>Notice the etcd.external section. Because Kubespray deployed etcd separately (with etcd_deployment_type: host), kubeadm doesn’t manage etcd — it just connects to the existing external cluster.</p><p>The actual initialization:</p><pre>- name: Initialize first control plane node<br>  command: &gt;<br>    {{ bin_dir }}/kubeadm init<br>    --config=/etc/kubernetes/kubeadm-config.yaml<br>    --upload-certs<br>    --skip-phases=addon/coredns<br>  register: kubeadm_init<br>  when: inventory_hostname == groups[&#39;kube_control_plane&#39;][0]</pre><p>The — skip-phases=addon/coredns flag is interesting. Kubespray installs CoreDNS separately with its own configuration rather than using kubeadm’s default CoreDNS deployment. This gives more control over DNS configuration.</p><p>After initialization, kubeadm creates static pod manifests in /etc/kubernetes/manifests/:</p><pre>/etc/kubernetes/manifests/<br>├── kube-apiserver.yaml<br>├── kube-controller-manager.yaml<br>└── kube-scheduler.yaml</pre><p>The kubelet watches this directory and automatically starts these pods. Within seconds, your API server should be responding.</p><p>For additional control plane nodes (in HA setups), Kubespray runs kubeadm join with the — control-plane flag:</p><pre>- name: Join additional control plane nodes<br>  command: &gt;<br>    {{ bin_dir }}/kubeadm join {{ kubeadm_join_endpoint }}<br>    --token {{ kubeadm_token }}<br>    --discovery-token-ca-cert-hash sha256:{{ kubeadm_ca_hash }}<br>    --control-plane<br>    --certificate-key {{ kubeadm_certificate_key }}<br>  when: <br>    - inventory_hostname in groups[&#39;kube_control_plane&#39;]<br>    - inventory_hostname != groups[&#39;kube_control_plane&#39;][0]</pre><h4>Play 10: Worker Node Join</h4><p>Worker nodes join the cluster using kubeadm join without the — control-plane flag:</p><pre>- name: Join worker nodes to cluster<br>  command: &gt;<br>    {{ bin_dir }}/kubeadm join {{ kubeadm_join_endpoint }}<br>    --token {{ kubeadm_token }}<br>    --discovery-token-ca-cert-hash sha256:{{ kubeadm_ca_hash }}<br>  when: inventory_hostname in groups[&#39;kube_node&#39;]</pre><p>After joining, kubelet on each worker starts and registers the node with the API server. You can verify with:</p><pre>kubectl get nodes<br>NAME       STATUS     ROLES           AGE   VERSION<br>k8s-ctr    NotReady   control-plane   2m    v1.32.0<br>k8s-w1     NotReady   &lt;none&gt;          1m    v1.32.0<br>k8s-w2     NotReady   &lt;none&gt;          1m    v1.32.0</pre><p>The nodes show NotReady because there’s no CNI plugin yet. Pods can’t get IP addresses, so the node isn’t considered ready for workloads.</p><h4>Play 11: CNI Plugin Installation</h4><p>The network_plugin role installs your chosen CNI. Let’s look at Flannel as an example since that’s what the lab environment uses.</p><p>Kubespray generates a Flannel configuration:</p><pre>- name: Create Flannel configmap<br>  template:<br>    src: flannel-config.yaml.j2<br>    dest: /etc/kubernetes/flannel/flannel-config.yaml</pre><p>The template produces something like:</p><pre>kind: ConfigMap<br>apiVersion: v1<br>metadata:<br>  name: kube-flannel-cfg<br>  namespace: kube-flannel<br>data:<br>  cni-conf.json: |<br>    {<br>      &quot;name&quot;: &quot;cbr0&quot;,<br>      &quot;cniVersion&quot;: &quot;0.3.1&quot;,<br>      &quot;plugins&quot;: [<br>        {<br>          &quot;type&quot;: &quot;flannel&quot;,<br>          &quot;delegate&quot;: {<br>            &quot;hairpinMode&quot;: true,<br>            &quot;isDefaultGateway&quot;: true<br>          }<br>        },<br>        {<br>          &quot;type&quot;: &quot;portmap&quot;,<br>          &quot;capabilities&quot;: {<br>            &quot;portMappings&quot;: true<br>          }<br>        }<br>      ]<br>    }<br>  net-conf.json: |<br>    {<br>      &quot;Network&quot;: &quot;{{ kube_pods_subnet }}&quot;,<br>      &quot;Backend&quot;: {<br>        &quot;Type&quot;: &quot;{{ flannel_backend_type }}&quot;<br>      }<br>    }</pre><p>The DaemonSet gets applied:</p><pre>- name: Apply Flannel manifests<br>  command: &quot;{{ bin_dir }}/kubectl apply -f /etc/kubernetes/flannel/&quot;<br>  environment:<br>    KUBECONFIG: /etc/kubernetes/admin.conf</pre><p>Once Flannel pods are running on each node, they configure the CNI and nodes become Ready:</p><pre>kubectl get nodes<br>NAME       STATUS   ROLES           AGE   VERSION<br>k8s-ctr    Ready    control-plane   5m    v1.32.0<br>k8s-w1     Ready    &lt;none&gt;          4m    v1.32.0<br>k8s-w2     Ready    &lt;none&gt;          4m    v1.32.0</pre><p>If you’re using Calico instead, the process is similar but with more components — Calico deploys a DaemonSet for the node agent plus a Deployment for the Typha component (in larger clusters) and potentially a BGP route reflector.</p><h4>Play 12: Kubernetes Apps (Addons)</h4><p>The kubernetes-apps role installs cluster addons. CoreDNS is the critical one:</p><pre>- name: Apply CoreDNS manifests<br>  command: &quot;{{ bin_dir }}/kubectl apply -f /etc/kubernetes/coredns/&quot;<br>  environment:<br>    KUBECONFIG: /etc/kubernetes/admin.conf</pre><p>The CoreDNS configuration includes the Corefile:</p><pre>apiVersion: v1<br>kind: ConfigMap<br>metadata:<br>  name: coredns<br>  namespace: kube-system<br>data:<br>  Corefile: |<br>    .:53 {<br>        errors<br>        health {<br>          lameduck 5s<br>        }<br>        ready<br>        kubernetes {{ dns_domain }} in-addr.arpa ip6.arpa {<br>          pods insecure<br>          fallthrough in-addr.arpa ip6.arpa<br>        }<br>        prometheus :9153<br>        forward . /etc/resolv.conf {<br>          max_concurrent 1000<br>        }<br>        cache 30<br>        loop<br>        reload<br>        loadbalance<br>    }</pre><p>If metrics_server_enabled is true, Metrics Server gets deployed:</p><pre>- name: Apply Metrics Server manifests<br>  command: &quot;{{ bin_dir }}/kubectl apply -f /etc/kubernetes/metrics-server/&quot;<br>  when: metrics_server_enabled</pre><p>Similarly for other addons like Helm, Ingress controllers, or cert-manager.</p><h4>Play 13: Node Labels and Taints</h4><p>After everything is running, Kubespray applies any configured node labels and taints:</p><pre>- name: Apply node labels<br>  command: &gt;<br>    {{ bin_dir }}/kubectl label node {{ inventory_hostname }}<br>    {{ item.key }}={{ item.value }}<br>    --overwrite<br>  loop: &quot;{{ node_labels | dict2items }}&quot;<br>  when: node_labels is defined</pre><pre>- name: Apply node taints<br>  command: &gt;<br>    {{ bin_dir }}/kubectl taint node {{ inventory_hostname }}<br>    {{ item }}<br>    --overwrite<br>  loop: &quot;{{ node_taints }}&quot;<br>  when: node_taints is defined</pre><p>Control plane nodes get the standard taint by default:</p><pre>node-role.kubernetes.io/control-plane:NoSchedule</pre><p>This prevents regular workloads from running on control plane nodes. In a single-node cluster, Kubespray removes this taint so pods can actually schedule somewhere.</p><h4>Play 14: DNS Configuration Finalization</h4><p>The final play cleans up DNS configuration on all nodes:</p><pre>- name: Update resolv.conf<br>  template:<br>    src: resolv.conf.j2<br>    dest: /etc/resolv.conf<br>  when: <br>    - dns_mode != &#39;none&#39;<br>    - resolvconf_mode == &#39;host_resolvconf&#39;</pre><p>This ensures nodes can resolve cluster DNS names if needed.</p><p>The Complete Timeline</p><p>From start to finish, here’s approximately what happens during a typical single-node deployment:</p><pre>0:00  - Start playbook<br>0:01  - Ansible version check, inventory validation<br>0:02  - Bootstrap OS begins (kernel modules, sysctl)<br>0:03  - Fact gathering<br>0:04  - Container engine installation starts<br>0:06  - Runc, containerd installed and running<br>0:07  - Binary downloads begin (kubeadm, kubelet, kubectl)<br>0:10  - Image pulls begin<br>0:15  - etcd certificate generation<br>0:16  - etcd installation and startup<br>0:17  - etcd health check passes<br>0:18  - Kubernetes node components installed<br>0:19  - kubeadm init runs<br>0:20  - Control plane static pods starting<br>0:21  - API server responding<br>0:22  - CNI plugin deployment<br>0:23  - Nodes become Ready<br>0:24  - CoreDNS deployment<br>0:25  - Addon deployment<br>0:26  - Final DNS configuration<br>0:27  - Playbook complete</pre><p>A single-node cluster typically takes 25–30 minutes. A three-node HA cluster might take 35–45 minutes. The bulk of the time is usually spent downloading binaries and container images.</p><p>Watching the Execution</p><p>When you run the playbook, you’ll see output like this:</p><pre>PLAY [Check Ansible version] ***************************************************</pre><pre>TASK [Check minimal Ansible version] *******************************************<br>ok: [localhost]</pre><pre>PLAY [Add kube-master nodes to kube_control_plane] *****************************</pre><pre>TASK [Add nodes to kube_control_plane group] ***********************************<br>ok: [k8s-ctr]</pre><pre>PLAY [Gather facts] ************************************************************</pre><pre>TASK [Gathering Facts] *********************************************************<br>ok: [k8s-ctr]</pre><pre>PLAY [Prepare for etcd install] ************************************************</pre><pre>TASK [kubespray_defaults : Load defaults] **************************************<br>ok: [k8s-ctr]</pre><pre>TASK [kubernetes/preinstall : Load kernel modules] *****************************<br>changed: [k8s-ctr] =&gt; (item=overlay)<br>changed: [k8s-ctr] =&gt; (item=br_netfilter)</pre><pre>...</pre><pre>PLAY RECAP *********************************************************************<br>k8s-ctr    : ok=487  changed=112  unreachable=0  failed=0  skipped=847<br>localhost  : ok=4    changed=0    unreachable=0  failed=0  skipped=0</pre><p>That PLAY RECAP tells you what happened. 487 tasks ran successfully, 112 made changes, 847 were skipped (because they didn’t apply to this configuration), and critically, zero failed.</p><p>The skipped count is high because Kubespray supports many configurations — Calico, Cilium, Docker, multiple cloud providers. Tasks for features you’re not using get skipped.</p><p>If something fails, you’ll see:</p><pre>TASK [etcd : Wait for etcd cluster to be healthy] ******************************<br>FAILED - RETRYING: Wait for etcd cluster to be healthy (4 retries left).<br>FAILED - RETRYING: Wait for etcd cluster to be healthy (3 retries left).<br>FAILED - RETRYING: Wait for etcd cluster to be healthy (2 retries left).<br>FAILED - RETRYING: Wait for etcd cluster to be healthy (1 retries left).<br>fatal: [k8s-ctr]: FAILED! =&gt; {&quot;attempts&quot;: 4, &quot;changed&quot;: false, &quot;cmd&quot;: [&quot;etcdctl&quot;, &quot;endpoint&quot;, &quot;health&quot;...</pre><p>The task name tells you exactly where it failed. In this case, etcd didn’t become healthy. You’d then check etcd logs:</p><pre>journalctl -u etcd -f</pre><p>One of the best things about Kubespray is that you can fix the issue and re-run the playbook. Ansible’s idempotency means already-completed tasks won’t run again (they’ll show “ok” instead of “changed”), and it’ll pick up where it left off.</p><p>Understanding the flow — what happens in what order and why — makes troubleshooting dramatically easier. When you know that CNI comes after kubeadm init, you won’t waste time debugging pod networking on a cluster that hasn’t finished initializing yet.</p><h3>From Cluster Operations to Service Lifecycle Management</h3><p>The Kubernetes infrastructure ran on a conventional setup. An Ansible server sat at the center, establishing passwordless SSH authentication to every node in the clusters. A typical four-node cluster looked like this:</p><pre>Ansible Server<br>      │<br>      ├── Passwordless Auth ──► k8s-node1 (Master + Node + etcd)<br>      ├── Passwordless Auth ──► k8s-node2 (Master + Node + etcd)<br>      ├── Passwordless Auth ──► k8s-node3 (Node)<br>      └── Passwordless Auth ──► k8s-node4 (Node)</pre><p>The first two nodes pulled double duty as both masters and workers while also running etcd. The remaining nodes served purely as workers. Underneath all of this sat the Infrastructure Layer, with the Container Layer running on top of it.</p><p>Kubespray handled provisioning and configuration management across these machines.</p><h4>The Problems with Kubespray</h4><p>Managing cluster nodes through Kubespray introduced several pain points that compounded over time. Here is a snippet of what a typical Kubespray playbook looks like:</p><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: bastion-ssh-config, tags: [&quot;localhost&quot;, &quot;bastion&quot;] }</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: bootstrap-os, tags: bootstrap-os }</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - role: etcd</pre><pre>roles:<br>  - { role: kubespray-defaults, when: &quot;not skip_downloads and download_run_once and not download_localhost&quot; }<br>  - { role: kubernetes/preinstall, tags: preinstall, when: &quot;not skip_downloads and download_run_once and not download_localhost&quot; }<br>  - { role: download, tags: download, when: &quot;not skip_downloads and download_run_once and not download_localhost&quot; }</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: kubernetes/preinstall, tags: preinstall }<br>  - { role: container-engine, tags: &quot;container-engine&quot;, when: deploy_container_engine }<br>  - { role: download, tags: download, when: &quot;not skip_downloads&quot; }<br>  - role: etcd</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: kubernetes/node, tags: node }</pre><pre>roles:<br>  - { role: kubespray-defaults }</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: kubernetes/kubeadm, tags: kubeadm }<br>  - { role: kubernetes/node-label, tags: node-label }<br>  - { role: network_plugin, tags: network }</pre><pre>roles:<br>  - { role: kubespray-defaults }<br>  - { role: kubernetes/preinstall, when: &quot;dns_mode != &#39;none&#39; and resolvconf_mode == &#39;host_resolvconf&#39;&quot;, tags: resolvconf, dns_late: true }</pre><p>Every role has conditions, tags, and dependencies. When something breaks, tracing through this maze takes time.</p><h4>The Node Exclusion Workflow</h4><p>Whenever a node needed to be removed from the cluster for maintenance, the process turned into a coordination dance between two teams:</p><pre>┌─────────────────────────────────────────────────────────────────────┐<br>│                         PRE-WORK (Infra Team)                       │<br>├─────────────────────────────────────────────────────────────────────┤<br>│  Node Exclusion Request                                             │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  Select Exclusion Group                                             │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  Perform Upgrade                                                    │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  Node Exclusion                                                     │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  Hand Over Node Information ─────────────────────────────────────┐  │<br>└─────────────────────────────────────────────────────────────────────┘<br>                                                                   │<br>┌─────────────────────────────────────────────────────────────────────┐<br>│                       MAIN WORK (DevOps Team)                    │◄─┘<br>├─────────────────────────────────────────────────────────────────────┤<br>│  Analysis &amp; Remediation                                             │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  Node Join (takes ~1 hour)                                          │<br>│         │                                                           │<br>│         ▼                                                           │<br>│  System Check ──────► Normal ──────► Done                           │<br>│         │                                                           │<br>│         ▼                                                           │<br>│      Abnormal                                                       │<br>│         │                                                           │<br>│         └──────► Loop back to Analysis                              │<br>└─────────────────────────────────────────────────────────────────────┘</pre><p>The Infra team handled the pre-work: accepting the exclusion request, selecting the right group, performing the upgrade, actually excluding the node, then handing off node information. The DevOps team then analyzed any issues, attempted to join the node back (which alone took about an hour), ran system checks, and if anything came back abnormal, looped back through analysis and remediation.</p><h4>The Kubernetes Version Upgrade Problem</h4><p>Kubernetes enforces a strict policy: upgrades can only happen one minor version at a time. Running v1.17 and needing to reach v1.20 means stepping through each intermediate version:</p><pre>v1.17 ──► v1.18 ──► v1.19 ──► v1.20<br>(current)                    (target)</pre><p>Attempting to skip versions fails:</p><pre>v1.17 ────────────X────────────► v1.20<br>       (not allowed)</pre><p>Sequential upgrades are mandatory:</p><pre>v1.17 ──✓──► v1.18 ──✓──► v1.19 ──✓──► v1.20</pre><h4>Time Costs with Kubespray</h4><p>Using Kubespray, each operation had predictable but painful time costs:</p><pre># Operation Time Required<br>- Node Join 30 minutes per node group<br>- OS/Kernel Upgrade 30 minutes per node group<br>K- ubernetes Upgrade 2 hours per version</pre><p>Simple operations took at least 30 minutes. Complex multi-version upgrades could consume 6 to 12 hours. The question became: how can this be reduced?</p><h4>Evaluating Amazon EKS</h4><p>Managed Kubernetes offerings like Amazon EKS have clear appeal:</p><pre>┌─────────────────────────────────────────────────────────────────┐<br>│                         AWS Cloud                               │<br>├─────────────────────────────────────────────────────────────────┤<br>│  ┌─────────────────────────────────────────────────────────┐    │<br>│  │                    EKS VPC                              │    │<br>│  │  ┌──────────────────────────────────────────────────┐   │    │<br>│  │  │         EKS Managed Control Plane                │   │    │<br>│  │  │                                                  │   │    │<br>│  │  │    ┌─────┐    ┌─────┐    ┌─────┐                │   │    │<br>│  │  │    │etcd │    │ api │    │     │                │   │    │<br>│  │  │    └─────┘    └─────┘    └─────┘                │   │    │<br>│  │  │      AZ-1       AZ-2       AZ-3                 │   │    │<br>│  │  └──────────────────────────────────────────────────┘   │    │<br>│  └─────────────────────────────────────────────────────────┘    │<br>│                              │                                  │<br>│  ┌─────────────────────────────────────────────────────────┐    │<br>│  │                   Customer VPC                          │    │<br>│  │  ┌────────────────────┐  ┌────────────────────┐         │    │<br>│  │  │ Managed Node Group │  │ Managed Node Group │         │    │<br>│  │  │    EKS Add-ons     │  │    EKS Add-ons     │         │    │<br>│  │  └────────────────────┘  └────────────────────┘         │    │<br>│  │           EKS Managed Data Plane                        │    │<br>│  └─────────────────────────────────────────────────────────┘    │<br>└─────────────────────────────────────────────────────────────────┘</pre><p>AWS manages the control plane (etcd, API server) in their own VPC. Workloads run in managed node groups within the customer VPC. EKS add-ons handle integration, and multi-AZ distribution comes without managing the complexity directly.</p><h4>The Requirements</h4><p>Before choosing a solution, three requirements guided the decision:</p><ol><li><strong>Automated node joining without DevOps engineer involvement</strong> — Nodes should join clusters without someone running playbooks manually every time.</li><li><strong>Simple architecture that is easy to debug</strong> — When things go wrong, tracing problems should be quick without wading through layers of abstraction.</li><li><strong>GitOps-capable configuration</strong> — All cluster configuration should live in Git, enabling version control, code review, and automated deployment.</li></ol><h4>Building a Custom Bootstrap System</h4><p>Rather than adopting a fully managed solution, a custom bootstrap system addressed these specific requirements:</p><pre>┌────────────────────────────────────────────────────────────────────┐<br>│                         Kubernetes Cluster                         │<br>├────────────────────────────────────────────────────────────────────┤<br>│                                                                    │<br>│   ┌─────────┐  ┌─────────┐  ┌─────────┐                           │<br>│   │  node   │  │  node   │  │  node   │   (existing workers)      │<br>│   └─────────┘  └─────────┘  └─────────┘                           │<br>│                                                                    │<br>│   ┌────────────────────────────────────────────────────────────┐  │<br>│   │                 public namespace                           │  │<br>│   │   ┌─────┐  ┌─────────┐  ┌───────┐                          │  │<br>│   │   │ api │  │ cronjob │  │ sched │                          │  │<br>│   │   └─────┘  └─────────┘  └───────┘                          │  │<br>│   └────────────────────────────────────────────────────────────┘  │<br>│                                                                    │<br>│   ┌────────────────────────────────────────────────────────────┐  │<br>│   │            master (control plane)                          │  │<br>│   │                                                            │  │<br>│   │   create-k8s-material                                      │  │<br>│   │          │                                                 │  │<br>│   │          ▼                                                 │  │<br>│   │   kubernetes-bootstrap.datacenter1.example.com             │  │<br>│   │          │                                                 │  │<br>│   │          ▼                                                 │  │<br>│   │   kubernetes-material (ConfigMap)                          │  │<br>│   │                                                            │  │<br>│   └────────────────────────────────────────────────────────────┘  │<br>│                                                                    │<br>│   ┌────────────────────────────────────────────────────────────┐  │<br>│   │         kubernetes-bootstrap namespace                     │  │<br>│   │                                                            │  │<br>│   │   /kubernetes    /store                                    │  │<br>│   │                                                            │  │<br>│   └────────────────────────────────────────────────────────────┘  │<br>│                                                                    │<br>│   ┌─────────────────────────────────────────────────────────────┐ │<br>│   │        New Node Group               │  ◄── pulls bootstrap  │ │<br>│   │   ┌──────┐  ┌──────┐  ┌──────┐      │      materials and    │ │<br>│   │   │ node │  │ node │  │ node │      │      joins auto       │ │<br>│   │   └──────┘  └──────┘  └──────┘      │                       │ │<br>│   └─────────────────────────────────────────────────────────────┘ │<br>│                                                                    │<br>└────────────────────────────────────────────────────────────────────┘</pre><p>A CronJob periodically generates the Kubernetes materials needed for nodes to join: certificates, tokens, and configuration files. These materials get stored in a ConfigMap within the kubernetes-bootstrap namespace. The bootstrap endpoint is accessible at a predictable URL like kubernetes-bootstrap.datacenter1.example.com.</p><p>When a new node group needs to join, it pulls the necessary bootstrap materials from this endpoint and completes the join process automatically. No DevOps engineer needs to intervene.</p><h4>Node Naming Convention</h4><p>A naming convention encodes cluster membership directly in the hostname:</p><pre>kube-{ClusterA}-{DataCenter}-worker01  =  {DataCenter}.{ClusterA}.example.com</pre><p>For example, a node named kube-serviceA-dc1-worker01 automatically maps to dc1.serviceA.example.com. The bootstrap process determines which cluster a node should join based purely on its hostname. No manual configuration required.</p><h4>File System Structure</h4><p>The bootstrap materials are organized by Kubernetes version:</p><pre>- v1.21.6<br>  - etc<br>  - NetworkManager/conf.d<br>  - calico/certs<br>  - cni/net.d<br>  - containerd<br>  - kubernetes<br>  - nginx<br>  - systemd/system</pre><pre>- v1.25.6<br>  - etc<br>  - NetworkManager/conf.d<br>  - cni/net.d<br>  - containerd<br>  - kubernetes<br>  - nginx<br>  - systemd/system</pre><p>Each version directory contains the complete set of configuration files needed for that Kubernetes release. Note that v1.21.6 includes calico/certs while v1.25.6 does not — different versions may have different requirements. When a node bootstraps, it pulls the appropriate version’s files based on its configured target.</p><h4>Cluster Configuration</h4><p>A central configuration file defines cluster-wide settings and per-node overrides:</p><pre>common:<br>  cluster: 1.25.6<br>  runcVersion: 1.x.x<br>  criVersion: 1.x.x<br>  cniPluginVersion: 1.x.x<br>  cudaVersion: &quot;12.2.1_535.86.10_linux&quot;<br>  clusterDNS: &quot;8.8.8.8&quot;</pre><pre>master:<br>  k8s-master-dc2:<br>    address: x.x.x.x</pre><pre>worker:<br>  kube-serviceA-dc1-gpu-worker01:<br>    gpu: true<br>  previous-version-node:<br>    version:<br>      kubernetes: 1.21.6</pre><p>The common section sets defaults for the entire cluster: target Kubernetes version, versions for runc, CRI, CNI plugins, and CUDA drivers (for GPU nodes), plus the cluster DNS server.</p><p>The master section lists master nodes with their addresses.</p><p>The worker section allows per-node overrides. A GPU worker can be flagged with gpu: true to ensure CUDA drivers get installed. A node that needs to stay on an older Kubernetes version can specify that explicitly—see previous-version-node which overrides the cluster default of 1.25.6 with 1.21.6.</p><h4>Automated Notifications</h4><p>The system sends Slack notifications whenever a node join succeeds or fails. A successful join looks like:</p><pre>Node A                                                    10:10 AM<br>─────────────────────────────────────────────────────────────────────<br>[Cluster A] Successfully joined the cluster as a service node.<br>kubelet: v1.25.6</pre><p>When something goes wrong, the failure notification includes the actual error output:</p><pre>Node B                                                    10:10 AM<br>─────────────────────────────────────────────────────────────────────<br>[Cluster A] Failed to join the cluster.<br><br>[preflight] Running pre-flight checks<br>error execution phase preflight: [preflight] Some fatal errors occurred:<br>  [ERROR CRI]: container runtime is not running: output: <br>  time=&quot;2024-04-02T00:14:27+09:00&quot; level=fatal msg=&quot;unable to determine <br>  runtime API version: rpc error: code = Unavailable desc = connection <br>  error: desc = \&quot;transport: Error while dialing dial unix <br>  /var/run/containerd/containerd.sock: connect: no such file or directory\&quot;&quot;<br>, error: exit status 1<br><br>[preflight] If you know what you are doing, you can make a check <br>non-fatal with --ignore-preflight-errors=...<br><br>To see the stack trace of this error execute with --v=5 or higher</pre><p>Engineers can diagnose the problem directly from the notification without SSH-ing into the machine. In this example, containerd is not running — the socket file does not exist.</p><h4>Results from the New Bootstrap System</h4><pre>Operation Before After Improvement Node Join 30 min/node group 0 min/node group 100% OS/Kernel Upgrade 30 min/node group 2 min/node group 94% Kubernetes Upgrade 2 hr/version 30 min/version 75%</pre><p>Node joining went from 30 minutes to zero because it happens automatically without human intervention. OS and kernel upgrades dropped from 30 minutes to about 2 minutes — nodes upgrade, reboot, and rejoin on their own. Kubernetes version upgrades went from 2 hours to 30 minutes per version.</p><p>Every time a new service came online, a predictable set of tasks followed:</p><ul><li>Development and production clusters each needed their own deployment pipelines</li><li>Different communication interfaces (HTTP, gRPC, Kafka) required different deployment configurations</li><li>Services written in different languages needed different build settings</li></ul><p>The combinatorial explosion meant a lot of pipeline creation work for each new service.</p><p>A pipeline-as-code approach puts deployment configuration in each service repository. A service at git@git.example.com:example/ includes a pipeline definition file:</p><pre>name: serviceA<br>repository:<br>  name: serviceA<br>  branch: develop<br>pipeline:<br>  type: service<br>  platform: linux/arm64<br>build:<br>  environments:<br>    JAVA_HOME: /opt/jdk-21<br>  batch:<br>    enabled: false<br>  docker:<br>    dockerfile: Dockerfile<br>resources:<br>  alpha:<br>    instance: 1<br>    cpu: &#39;0.05&#39;<br>    memory: 1024MB</pre><p>This configuration captures everything needed to build and deploy the service:</p><ul><li>repository: which repo and branch to track</li><li>pipeline: service type and target platform architecture</li><li>build.environments: build-time environment variables like JAVA_HOME</li><li>build.batch.enabled: whether batch processing is enabled</li><li>build.docker.dockerfile: which Dockerfile to use</li><li>resources: resource allocations for different environments (alpha shown here)</li></ul><p>The UI reads these files directly and generates appropriate pipelines without manual configuration. Developers define their service requirements; the system handles the rest.</p><p>The deployment story had grown complicated. Service developers would request tag updates for batch jobs. DevOps engineers would manually update tags. A service might have multiple versions deployed simultaneously:</p><pre>Service A<br>    │<br>    ├── Version 1<br>    ├── Version 2<br>    └── Version 3</pre><p>Each version required its own tag management, and coordination between developers and DevOps created delays.</p><p>Two main deployment strategies were supported:</p><p><strong>Blue-Green Deployment</strong>: Zero-downtime releases with instant rollback capability. The new version (green) runs alongside the old version (blue). Once validated, traffic switches entirely to green. If something goes wrong, switching back to blue is immediate.</p><p><strong>Canary Deployment</strong>: Gradual rollouts that test new versions with a subset of traffic before full deployment. Maybe 5% of traffic goes to the new version initially. If metrics look good, the percentage increases until the new version handles all traffic.</p><p>One problem that emerged was lag between canary and blue-green environments. A canary deployment might run successfully, but the corresponding blue-green deployment would lag behind by days or even weeks. This created inconsistency in what code was actually serving production traffic.</p><p>An automated system detects and alerts on these situations:</p><pre>vivasystem                                                10:10 AM<br>─────────────────────────────────────────────────────────────────────<br>Lag detected between canary and blue-green deployments.</pre><pre>Service A - 12 day lag  @kim-A<br>Deployment owners: @kim-B, @kim-C, @kim-D<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service B - 12 day lag  @kim-B<br>Deployment owners: @kim-E, @kim-F, @kim-G<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service C - 12 day lag  @kim-C<br>Deployment owners: @kim-H, @kim-I<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service D - 12 day lag  @kim-D<br>Deployment owners: @kim-J, @kim-K, @kim-L<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service E - 12 day lag  @kim-E<br>Deployment owners: @kim-M, @kim-N, @kim-O<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service F - 12 day lag  @kim-F<br>Deployment owners: @kim-P<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service G - 12 day lag  @kim-G<br>Deployment owners: @kim-Q, @kim-R, @kim-S<br>[Blue-green page]  [Trigger blue-green]</pre><pre>Service H - 12 day lag  @kim-H<br>Deployment owners: @kim-T, @kim-U<br>[Blue-green page]  [Trigger blue-green]</pre><p>Each entry shows the service name, how many days it has lagged, who originally deployed the canary, and the deployment owners responsible for the service. Direct links let teams jump to the blue-green page or trigger the deployment immediately.</p><p>Over time, services accumulate. Some get deprecated, others get replaced, and some just stop receiving traffic. Without active cleanup, these zombie services continue consuming cluster resources.</p><p>A systematic approach identifies services marked for deletion versus active services:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EDwqLuXikMCMbroLgMd8BQ.png" /></figure><p>Services communicate in different ways, so multiple detection methods are needed:</p><pre>┌────────────────────────────────────────────────────────────────────┐<br>│                                                                    │<br>│   ┌──────────────────┐                                            │<br>│   │  Consumer Group  │──┐                                         │<br>│   └──────────────────┘  │                                         │<br>│                         │                                         │<br>│   ┌──────────────────┐  │     ┌─────────┐     ┌─────────────┐    │<br>│   │  Consumer Group  │──┼────►│  batch  │────►│ API Server  │    │<br>│   │  (not deleted)   │  │     └─────────┘     └─────────────┘    │<br>│   └──────────────────┘  │                            │            │<br>│                         │                            ▼            │<br>│   ┌──────────────────┐  │                     ┌───────────┐       │<br>│   │  Consumer Group  │──┘                     │   Kafka   │       │<br>│   └──────────────────┘                        └───────────┘       │<br>│                                                                    │<br>└────────────────────────────────────────────────────────────────────┘</pre><p>Consumer groups pull messages from Kafka, trigger batch processing, and make API calls. Any of these communication paths could indicate an active service.</p><p>For HTTP-based services, the Istio service mesh handles service-to-service communication:</p><pre>┌────────────────────────────────────────────────────────────────────┐<br>│                        Istio Mesh                                  │<br>├────────────────────────────────────────────────────────────────────┤<br>│                                                                    │<br>│  Ingress        ┌─────────────────────────────────┐    Egress     │<br>│  traffic ──────►│         Data Plane              │───► traffic   │<br>│                 │                                 │                │<br>│                 │  ┌───────────┐  ┌───────────┐  │                │<br>│                 │  │ Service A │  │ Service B │  │                │<br>│                 │  │  ┌─────┐  │  │  ┌─────┐  │  │                │<br>│                 │  │  │Proxy│  │  │  │Proxy│  │  │                │<br>│                 │  │  └─────┘  │  │  └─────┘  │  │                │<br>│                 │  └───────────┘  └───────────┘  │                │<br>│                 │         │              │       │                │<br>│                 │         └──────┬───────┘       │                │<br>│                 │                │               │                │<br>│                 │         Mesh traffic           │                │<br>│                 └─────────────────────────────────┘                │<br>│                                  │                                 │<br>│  ┌───────────────────────────────────────────────────────────────┐│<br>│  │                      Control Plane                            ││<br>│  │  ┌─────────────────────────────────────────────────────────┐  ││<br>│  │  │                       istiod                            │  ││<br>│  │  │                                                         │  ││<br>│  │  │   Discovery    Configuration    Certificates           │  ││<br>│  │  └─────────────────────────────────────────────────────────┘  ││<br>│  └───────────────────────────────────────────────────────────────┘│<br>└────────────────────────────────────────────────────────────────────┘</pre><p>Each service has an Envoy sidecar proxy that handles mesh traffic. The control plane (istiod) manages discovery, configuration, and certificates. Istio collects metrics on all traffic flowing through the mesh.</p><p>Aggregating the istio_request_total metric over a 7-day window:</p><pre>istio_request_total aggregated for 7 days<br>    │<br>    ├── Metric present ──► Service is receiving traffic<br>    │<br>    └── No metric ──► Service has no HTTP communication history</pre><p>Any service showing zero requests during that period gets flagged as potentially unused.</p><p>Services communicating via Kafka rather than HTTP need different detection. Spring Actuator’s Kafka metrics work here:</p><pre>kafka_consumer_fetch_manager_records_consumed_rate</pre><p>This metric shows the rate at which a consumer processes messages. A sustained rate of zero indicates an idle consumer that is not actually doing any work.</p><p>Batch jobs run on schedules rather than responding to traffic, so traffic-based metrics do not help. Instead, checking the kube_cronjob_spec_suspend metric:</p><pre>kube_cronjob_spec_suspend<br>    │<br>    ├── Value = 1 ──► CronJob is suspended (not running)<br>    │<br>    └── Value = 0 ──► CronJob is active</pre><p>Services whose batch jobs have been suspended for extended periods are candidates for cleanup.</p><p>The system generates weekly notifications identifying potentially unused services. For Kafka-based services:</p><pre>vivasystem                                                10:10 AM<br>─────────────────────────────────────────────────────────────────────<br>Unused Service Verification Request<br><br>Based on KAFKA requests, the following services have had no traffic <br>for one week. Please verify whether these services are still in use.</pre><pre>[@kim-A, @kim-B, @kim-C, @kim-D] Service A<br>[@kim-E, @kim-F, @kim-G, @kim-H] Service B<br>[@kim-I, @kim-J, @kim-K, @kim-L] Service C<br>[@kim-M, @kim-N, @kim-O] Service D<br>[@kim-P, @kim-Q] Service E<br>[@kim-R, @kim-S, @kim-T] Service F<br>[@kim-U, @kim-V, @kim-W, @kim-X] Service G<br>[@kim-Y, @kim-Z] Service H</pre><p>For HTTP-based services:</p><pre>vivasystem                                                10:10 AM<br>─────────────────────────────────────────────────────────────────────<br>Unused HTTP Service Verification Request<br><br>Based on HTTP requests, the following services have had no traffic <br>for one week. Please verify whether these services are still in use.</pre><pre>[@kim-A, @kim-B, @kim-C, @kim-D] Service I<br>[@kim-E, @kim-F, @kim-G, @kim-H] Service J<br>[@kim-I, @kim-J, @kim-K, @kim-L] Service K<br>[@kim-M, @kim-N, @kim-O] Service L<br>[@kim-P, @kim-Q] Service M<br>[@kim-R, @kim-S, @kim-T] Service N<br>[ ] Service O<br>[ ] Service P</pre><p>Each notification tags the service owners so they can confirm whether the service should be decommissioned. Services without owner information (empty brackets) require additional investigation.</p><p>Eighty services removed. Over a thousand CPU cores and a terabyte of memory returned to the cluster. Fewer services also means fewer potential alert sources, simpler dependency graphs, and easier capacity planning.</p><h3>Summary</h3><p>The improvements span two levels.</p><p>At the cluster level, replacing Kubespray with a custom bootstrap system eliminated hours of manual node management work. Node join time dropped by 100%, OS upgrades by 94%, and Kubernetes upgrades by 75%.</p><p>At the service level, pipeline-as-code reduced the friction of onboarding new services. Automated deployment lag detection kept environments consistent. Systematic unused service detection recovered resources that would otherwise sit idle — 80 services removed, 1,280 cores and 1,280 GB of memory reclaimed.</p><p>The bootstrap system runs as a CronJob. The notification system queries existing metrics stores. The pipeline definitions are YAML files in Git repositories. Identifying specific pain points, measuring them, and building targeted solutions made the difference.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=89111e0e1636" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Kubernetes Cluster with kubeadm: From Theory to Practice]]></title>
            <link>https://sigridjin.medium.com/building-a-kubernetes-cluster-with-kubeadm-from-theory-to-practice-faebea2eebd5?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/faebea2eebd5</guid>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 24 Jan 2026 19:58:57 GMT</pubDate>
            <atom:updated>2026-01-24T19:58:57.196Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KpRGGO8eWW4EOTIyZ9lQ3Q.jpeg" /></figure><h3>Chapter 1: Introduction</h3><p>If you’ve ever tried setting up a Kubernetes cluster manually, following something like Kelsey Hightower’s “Kubernetes The Hard Way”, you know the pain. Generating certificates with OpenSSL, writing systemd unit files, configuring etcd, setting up each control plane component one by one… It’s educational, sure, but it’s also tedious and error-prone. Nobody wants to do that every time they need a new cluster.</p><p>That’s where kubeadm comes in.</p><p>kubeadm is the official cluster bootstrapping tool provided by the Kubernetes project. It’s maintained by the Cluster Lifecycle SIG (Special Interest Group), which means it’s not some third-party tool — it’s part of the Kubernetes ecosystem itself.</p><p>The official documentation describes it like this:</p><blockquote><em>“Kubeadm is a tool built to provide kubeadm init and kubeadm join as best-practice ‘fast paths’ for creating Kubernetes clusters.”</em></blockquote><p>In plain English: kubeadm lets you spin up a working Kubernetes cluster with just two commands. kubeadm init on your control plane node, kubeadm join on your worker nodes, and you&#39;re done. Well, mostly done—but we&#39;ll get to that.</p><p>Here’s the thing about kubeadm that trips people up at first: it’s intentionally limited in scope.</p><p>The official docs put it this way:</p><blockquote><em>“kubeadm performs the actions necessary to get a minimum viable cluster up and running.”</em></blockquote><p>Notice the phrase “minimum viable cluster.” kubeadm isn’t trying to be an all-in-one solution. It’s not Ansible. It’s not Terraform. It doesn’t provision your VMs, install Docker, or set up your network plugin. It does exactly one thing: bootstrap a Kubernetes cluster. Nothing more, nothing less.</p><p>This design choice is deliberate. By keeping kubeadm focused on bootstrapping, the Kubernetes team made it flexible enough to work anywhere — on-premises data centers, cloud VMs, bare metal servers, your laptop, whatever. The idea is that higher-level tools (like kubespray, Cluster API, or even minikube) can build on top of kubeadm to provide more complete solutions.</p><p>So what exactly happens when you run kubeadm init? Quite a lot, actually:</p><p><strong>Certificate Generation</strong></p><p>kubeadm creates all the TLS certificates your cluster needs. This includes:</p><ul><li>The cluster CA (Certificate Authority)</li><li>API server certificates</li><li>etcd certificates (server, peer, and client certs)</li><li>Front proxy certificates for API aggregation</li><li>Service account signing keys</li></ul><p>If you’ve done this manually before, you know this alone saves hours of work.</p><p><strong>kubeconfig Files</strong></p><p>It generates kubeconfig files for various components:</p><ul><li>admin.conf for cluster administrators</li><li>kubelet.conf for the kubelet</li><li>controller-manager.conf for the controller manager</li><li>scheduler.conf for the scheduler</li></ul><p>Each of these contains the credentials and cluster information needed for that component to talk to the API server.</p><p><strong>Static Pod Manifests</strong></p><p>Here’s something interesting: kubeadm doesn’t run the control plane components as regular system services. Instead, it creates Static Pod manifests in /etc/kubernetes/manifests/. The kubelet watches this directory and automatically runs whatever Pod definitions it finds there.</p><p>This means etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all run as Pods managed by the kubelet — even though the kubelet itself runs as a systemd service. It’s a clever approach that makes the control plane self-healing to some degree.</p><p><strong>Bootstrap Token Setup</strong></p><p>kubeadm sets up the infrastructure for worker nodes to join the cluster securely. This includes creating bootstrap tokens and configuring the RBAC rules needed for the TLS bootstrap process. We’ll dive deeper into this when we cover kubeadm join.</p><p><strong>Core Addons</strong></p><p>Finally, kubeadm installs two essential addons:</p><ul><li><strong>CoreDNS</strong>: The cluster DNS service</li><li><strong>kube-proxy</strong>: The network proxy that runs on every node</li></ul><p>These are deployed as regular Kubernetes workloads (Deployment and DaemonSet respectively), not as Static Pods.</p><p>This is equally important to understand. kubeadm explicitly does NOT handle:</p><p><strong>Machine Provisioning</strong></p><p>kubeadm won’t create VMs or provision bare metal servers for you. You need to have your machines ready before running kubeadm. Whether you’re using Vagrant, Terraform, cloud console, or physically racking servers — that’s on you.</p><p><strong>Container Runtime Installation</strong></p><p>Before kubeadm can do anything, you need a CRI-compatible container runtime already installed and running. containerd, CRI-O, whatever you prefer — kubeadm expects it to be there. It won’t install it for you.</p><p><strong>kubelet Installation</strong></p><p>Same story with kubelet. You need to install the kubelet binary and set up its systemd service yourself. kubeadm will configure the kubelet, but it won’t install it.</p><p><strong>CNI Plugin Installation</strong></p><p>This one catches a lot of people off guard. After kubeadm init completes, your node will show as &quot;NotReady&quot; and CoreDNS pods will be stuck in &quot;Pending&quot; state. Why? Because kubeadm doesn&#39;t install a CNI (Container Network Interface) plugin.</p><p>You need to install Flannel, Calico, Cilium, or another CNI plugin yourself. Without it, pods can’t get IP addresses and the cluster network doesn’t work.</p><p><strong>Additional Addons</strong></p><p>Things like the Kubernetes Dashboard, monitoring solutions (Prometheus, Grafana), ingress controllers, storage provisioners — none of these are kubeadm’s concern. Install them yourself after the cluster is up.</p><p>The documentation explicitly states:</p><blockquote><em>“Instead, we expect higher-level and more tailored tooling to be built on top of kubeadm.”</em></blockquote><p>And that’s exactly what happened. Tools like minikube, kind, kubespray, and Cluster API all use kubeadm under the hood while adding their own automation layers on top.</p><p>If you’ve gone through Kubernetes The Hard Way, it’s helpful to understand how kubeadm maps to that manual process:</p><p>Manual Step (Hard Way) kubeadm Equivalent Generate CA and TLS certificates with OpenSSL kubeadm init phase certs Create kubeconfig files manually kubeadm init phase kubeconfig Set up etcd as a systemd service kubeadm init phase etcd (Static Pod) Configure and run control plane as systemd services kubeadm init phase control-plane (Static Pods) Configure kubelet kubeadm init phase kubelet-start Distribute certificates to worker nodes manually TLS Bootstrap with bootstrap tokens</p><p>There are some notable differences beyond just automation:</p><p><strong>Static Pods vs systemd Services</strong></p><p>The Hard Way runs control plane components directly as systemd services. kubeadm uses Static Pods instead. This means the kubelet manages these components, which has implications for how you troubleshoot and manage them.</p><p><strong>etcd Communication</strong></p><p>The Hard Way (for educational simplicity) often configures etcd with plain HTTP. kubeadm always uses HTTPS with mutual TLS. Every etcd connection requires proper certificates — even health checks.</p><p><strong>Worker Node Authentication</strong></p><p>The Hard Way has you manually copying certificates to worker nodes. kubeadm uses a much more elegant approach called TLS Bootstrap, where worker nodes use a temporary token to authenticate and then request their own certificates through a CSR (Certificate Signing Request) process.</p><p><strong>Control Plane kubelet</strong></p><p>Here’s a subtle but important difference: in The Hard Way, the control plane node doesn’t necessarily need kubelet since you’re running everything as systemd services. With kubeadm, kubelet is required on the control plane because it’s responsible for running the control plane components as Static Pods.</p><p>kubeadm’s command structure is pretty straightforward. The commands fall into three categories:</p><p><strong>Lifecycle Commands</strong></p><p>These manage the cluster lifecycle:</p><ul><li>kubeadm init — Initialize a control plane node</li><li>kubeadm join — Join a node to the cluster (worker or additional control plane)</li><li>kubeadm upgrade — Upgrade the cluster to a new Kubernetes version</li><li>kubeadm reset — Tear down what init or join created</li></ul><p><strong>Security Commands</strong></p><p>These handle certificates and authentication:</p><ul><li>kubeadm certs — Manage cluster certificates (renew, check expiration, etc.)</li><li>kubeadm kubeconfig — Manage kubeconfig files</li><li>kubeadm token — Manage bootstrap tokens</li></ul><p><strong>Utility Commands</strong></p><p>These are helpers:</p><ul><li>kubeadm config — View or migrate cluster configuration</li><li>kubeadm version — Print kubeadm version</li></ul><p>The relationship between these commands looks something like this:</p><pre>init ──→ join ──→ upgrade<br>                      │<br>                      ↓<br>                    reset</pre><p>You start with init, expand with join, maintain with upgrade, and clean up with reset if needed.</p><p>Understanding kubeadm’s scope and design philosophy saves you from frustration later. When your node shows “NotReady” after init, you won’t panic — you’ll know it’s because you haven’t installed a CNI plugin yet. When you need to set up monitoring, you won’t waste time looking for a kubeadm flag that doesn’t exist.</p><p>kubeadm is a foundation, not a complete house. It gives you a solid base to build on, but the finishing work is up to you, or the higher-level tools you choose to use.</p><h3>2. Environment &amp; Prerequisites</h3><p>Before we can run kubeadm init, there’s quite a bit of groundwork to lay. Kubernetes has specific requirements about the underlying system, and skipping any of these steps will cause problems later. Everything in this section needs to be done on all nodes, both the control plane and workers.</p><p>For this setup, I’m using three Rocky Linux 10 virtual machines provisioned with Vagrant. Rocky Linux is a RHEL derivative, so if you’re on CentOS, AlmaLinux, or RHEL itself, the commands will be nearly identical.</p><p>Here’s what the node setup looks like:</p><pre>k8s-ctr    192.168.10.100    Control Plane<br>k8s-w1     192.168.10.101    Worker Node<br>k8s-w2     192.168.10.102    Worker Node</pre><p>The target versions are containerd 2.1.5 and Kubernetes 1.32.11. I chose containerd 2.x specifically because it supports Kubernetes versions from 1.32 through 1.35, which gives us room for future upgrades without changing the container runtime.</p><p>First, let’s verify that our system meets the minimum requirements. kubeadm needs at least 2 CPUs and 2GB of RAM per node.</p><pre>lscpu<br># CPU(s): 4<br># Architecture: aarch64<br><br>free -h<br># Mem: 2.8Gi</pre><p>Four CPUs and 2.8GB of memory. That’s more than enough. Note that the architecture here is aarch64 because I’m running on Apple Silicon. If you’re on x86_64, the commands are the same but container images will be pulled for a different architecture automatically.</p><p>For disk space, check that you have enough room for container images and logs:</p><pre>df -hT /<br># /dev/sda3   xfs   60G   2.5G   58G   5%   /</pre><p>58GB free on the root partition is plenty.</p><p>This is where things get a bit tricky, especially in Vagrant environments. Most VMs have multiple network interfaces, and you need to know which one to use for cluster communication.</p><pre>ip -br -c -4 addr<br># lo        UNKNOWN   127.0.0.1/8<br># enp0s8    UP        10.0.2.15/24<br># enp0s9    UP        192.168.10.100/24</pre><p>In my setup, enp0s8 is the NAT interface that Vagrant uses for internet access, and enp0s9 is the host-only network for communication between VMs. The cluster should use enp0s9 (192.168.10.x addresses), not the NAT interface. This becomes important later when we configure kubeadm init and join.</p><p>Kubernetes uses Linux cgroups to manage container resources like CPU and memory limits. There are two versions of cgroups, and Kubernetes 1.25+ prefers cgroup v2.</p><pre>stat -fc %T /sys/fs/cgroup<br># cgroup2fs</pre><p>If you see cgroup2fs, you&#39;re on cgroup v2. If it says tmpfs, you&#39;re on v1. Rocky Linux 10 uses v2 by default, so we&#39;re good.</p><p>The next question is which cgroup driver to use. There are two options: cgroupfs and systemd. Since Rocky Linux uses systemd as its init system, we should use the systemd cgroup driver. This avoids conflicts where both systemd and kubelet try to manage cgroups independently. We’ll configure this later when setting up containerd.</p><p>All nodes in a Kubernetes cluster must have synchronized time. Certificate validation, log timestamps, and various distributed system operations depend on it.</p><pre>timedatectl status<br># Time zone: UTC<br># NTP service: active<br># RTC in local TZ: yes</pre><p>If you see a warning about RTC being in local timezone, fix it by setting RTC to UTC:</p><pre>timedatectl set-local-rtc 0</pre><p>Set your timezone (I’m using Seoul):</p><pre>timedatectl set-timezone Asia/Seoul</pre><p>Make sure NTP is enabled:</p><pre>timedatectl set-ntp true</pre><p>You can verify that time synchronization is working by checking chrony:</p><pre>chronyc sources -v<br># ^* 175.210.18.47   2   9   377   152   +490us[+478us] +/-   11ms</pre><p>The ^* marker indicates the currently selected NTP source. A Stratum 2 server with sub-millisecond offset is perfectly fine for our purposes.</p><p>SELinux provides mandatory access control on Linux systems. While it’s great for security, it can interfere with Kubernetes operations because containers need to access host filesystems for things like pod networking and volumes.</p><p>The recommended approach is to set SELinux to permissive mode. This logs policy violations without blocking them, so you can still see what would have been denied.</p><pre>getenforce<br># Enforcing</pre><pre>setenforce 0<br>getenforce<br># Permissive</pre><p>That changes the runtime setting, but it won’t survive a reboot. To make it permanent:</p><pre>sed -i &#39;s/^SELINUX=enforcing/SELINUX=permissive/&#39; /etc/selinux/config</pre><p>For a lab environment, the simplest approach is to disable the firewall entirely:</p><pre>systemctl disable --now firewalld</pre><p>If you’re setting this up in production and need the firewall enabled, you’ll need to open these ports:</p><p>On the control plane: 6443 (API server), 2379–2380 (etcd), 10250 (kubelet), 10259 (scheduler), 10257 (controller manager).</p><p>On worker nodes: 10250 (kubelet), 30000–32767 (NodePort services).</p><p>This is a hard requirement. Kubernetes will refuse to start if swap is enabled.</p><p>The reasoning is straightforward. The scheduler makes decisions based on how much memory each node has available. If swap is enabled, a node might appear to have free memory when it’s actually swapping, which leads to poor scheduling decisions. There’s also a performance argument: when a container starts swapping, it becomes extremely slow, and you’d rather have the OOM killer terminate it so Kubernetes can reschedule it elsewhere.</p><p>Check current swap status:</p><pre>free -h | grep -i swap<br># Swap:   3.8Gi   0B   3.8Gi</pre><p>Disable it:</p><pre>swapoff -a</pre><p>To make this permanent across reboots, remove the swap entry from /etc/fstab:</p><pre>sed -i &#39;/swap/d&#39; /etc/fstab</pre><p>Verify it’s gone:</p><pre>free -h | grep -i swap<br># Swap:   0B   0B   0B</pre><p>Kubernetes networking requires two kernel modules: overlay for container filesystem layering, and br_netfilter for bridge network traffic to pass through iptables.</p><p>Load them:</p><pre>modprobe overlay<br>modprobe br_netfilter</pre><p>Verify they’re loaded:</p><pre>lsmod | grep -E &#39;overlay|br_netfilter&#39;<br># br_netfilter   32768   0<br># bridge         327680  1 br_netfilter<br># overlay        200704  0</pre><p>To ensure they load automatically on boot, create a configuration file:</p><pre>cat &lt;&lt;EOF | tee /etc/modules-load.d/k8s.conf<br>overlay<br>br_netfilter<br>EOF</pre><p>With the modules loaded, we need to configure some kernel parameters for Kubernetes networking to work properly.</p><pre>cat &lt;&lt;EOF | tee /etc/sysctl.d/k8s.conf<br>net.bridge.bridge-nf-call-iptables  = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward                 = 1<br>EOF</pre><p>The bridge-nf-call settings make sure that traffic crossing a bridge (which is how container networking works) goes through iptables rules. This is essential for Service networking and network policies to function. The ip_forward setting allows the node to act as a router, forwarding packets between pods on different nodes.</p><p>Apply the settings:</p><pre>sysctl --system</pre><p>You should see your k8s.conf file being applied in the output. Verify the values took effect:</p><pre>sysctl net.bridge.bridge-nf-call-iptables<br># net.bridge.bridge-nf-call-iptables = 1<br><br>sysctl net.ipv4.ip_forward<br># net.ipv4.ip_forward = 1</pre><p>Each node needs to be able to resolve the hostnames of other nodes. The simplest way is to add entries to /etc/hosts.</p><p>First, there’s a quirk with Vagrant that we need to handle. Vagrant often adds an entry like 127.0.1.1 hostname to /etc/hosts, which causes the hostname to resolve to localhost instead of the actual IP. This will break kubelet registration.</p><pre>cat /etc/hosts<br># 127.0.0.1   localhost localhost.localdomain<br># 127.0.1.1   k8s-ctr k8s-ctr    &lt;- problematic line</pre><p>Remove it:</p><pre>sed -i &#39;/^127\.0\.\(1\|2\)\.1/d&#39; /etc/hosts</pre><p>Now add entries for all cluster nodes:</p><pre>cat &lt;&lt;EOF &gt;&gt; /etc/hosts<br>192.168.10.100 k8s-ctr<br>192.168.10.101 k8s-w1<br>192.168.10.102 k8s-w2<br>EOF</pre><p>Test connectivity:</p><pre>ping -c 1 k8s-ctr<br>ping -c 1 k8s-w1<br>ping -c 1 k8s-w2</pre><p>All three should respond from their respective 192.168.10.x addresses.</p><p>At this point, the system is ready for installing the container runtime and Kubernetes components. All the kernel-level prerequisites are in place: cgroups are configured, necessary modules are loaded, networking parameters are set, and the nodes can communicate with each other. In the next section, we’ll install containerd and the kubeadm toolchain.</p><h3>3. Component Installation</h3><p>With the system prerequisites configured, we can now install the core components needed to run Kubernetes. This section covers two main parts: the container runtime (containerd) and the Kubernetes tools (kubeadm, kubelet, kubectl).</p><p>All commands in this section should be run on every node in your cluster, both control plane and workers.</p><p>Kubernetes needs a container runtime that implements the Container Runtime Interface (CRI). We’re using containerd, which is a CNCF graduated project and the industry standard container runtime. It focuses on simplicity, robustness, and portability.</p><p>Before we dive into installation, let’s talk about version compatibility. This is something that can bite you later if you don’t pay attention. Since we’re installing Kubernetes 1.32 and planning to upgrade to 1.33 and 1.34 later, we need a containerd version that supports all of them.</p><p>Looking at the compatibility matrix, containerd 2.1.5 works with Kubernetes 1.32 through 1.35. That gives us plenty of room for future upgrades.</p><p>One thing to watch out for: containerd 2.x uses a different configuration file format (version 3) compared to containerd 1.x (version 2). If you’re coming from an older setup or following outdated tutorials, this can cause confusion. We’re starting fresh with 2.x, so we don’t have to worry about migration.</p><h4>Adding the Docker Repository</h4><p>containerd packages are distributed through Docker’s repository. Let’s add it first.</p><pre>dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo</pre><p>Verify the repository was added:</p><pre>dnf repolist</pre><p>You should see docker-ce-stable in the list. Now refresh the metadata cache:</p><pre>dnf makecache</pre><p>Let’s check what versions are available:</p><pre>dnf list --showduplicates containerd.io</pre><p>You’ll see both 1.7.x and 2.x versions listed. We want 2.1.5.</p><h4>Installing containerd</h4><pre>dnf install -y containerd.io-2.1.5-1.el10</pre><p>After installation, verify the components:</p><pre>containerd --version<br>runc --version<br>ctr --version</pre><p>You should see containerd v2.1.5 and runc v1.3.3 (runc is the low-level container runtime that containerd uses internally).</p><p>Now here’s something important. The default configuration file that ships with the package has a problem for our use case:</p><pre>cat /etc/containerd/config.toml | grep disabled_plugins</pre><p>You’ll see disabled_plugins = [&quot;cri&quot;]. This means the CRI plugin is disabled by default, which makes containerd useless for Kubernetes. We need to generate a proper configuration.</p><h4>Configuring containerd</h4><p>Generate the default configuration:</p><pre>containerd config default | tee /etc/containerd/config.toml</pre><p>This creates a complete configuration file with all the defaults. Now check the version:</p><pre>head /etc/containerd/config.toml</pre><p>You should see version = 3 at the top, confirming we&#39;re using the containerd 2.x configuration format.</p><p>The critical setting we need to change is the cgroup driver. By default, SystemdCgroup is set to false, but we need it to be true. Why? Because our system uses systemd as the init system, and having both systemd and containerd trying to manage cgroups independently leads to conflicts. When SystemdCgroup is true, containerd delegates cgroup management to systemd, keeping everything consistent.</p><pre>cat /etc/containerd/config.toml | grep -i systemdcgroup</pre><p>You’ll see SystemdCgroup = false. Let&#39;s fix that:</p><pre>sed -i &#39;s/SystemdCgroup = false/SystemdCgroup = true/g&#39; /etc/containerd/config.toml</pre><p>Verify the change:</p><pre>cat /etc/containerd/config.toml | grep -i systemdcgroup</pre><p>Now it should show SystemdCgroup = true.</p><h4>Starting containerd</h4><pre>systemctl daemon-reload<br>systemctl enable --now containerd<br>systemctl status containerd --no-pager</pre><p>The service should be active and running. Let’s check the logs to make sure our SystemdCgroup setting is being applied:</p><pre>journalctl -u containerd.service --no-pager | grep -i systemdcgroup</pre><p>You should see something like &quot;SystemdCgroup&quot;:true in the output.</p><p>You might also notice an error about CNI in the logs:</p><pre>failed to load cni during init... error=&quot;cni config load failed: no network config found in /etc/cni/net.d&quot;</pre><p>Don’t worry about this. It’s expected because we haven’t installed a CNI plugin yet. That comes after kubeadm init.</p><h4>Verifying the Installation</h4><p>Let’s make sure everything is working. Check the socket:</p><pre>ls -l /run/containerd/containerd.sock</pre><p>This Unix socket is how kubelet, ctr, and crictl communicate with containerd.</p><p>Check if the socket is listening:</p><pre>ss -xl | grep containerd</pre><p>You should see two entries: one for the main socket and one for the ttrpc socket.</p><p>Now let’s verify the plugins are loaded correctly:</p><pre>ctr plugins ls</pre><p>Look for these key plugins with “ok” status:</p><ul><li>io.containerd.cri.v1 (images and runtime) - this is the CRI plugin that kubelet uses</li><li>io.containerd.snapshotter.v1 overlayfs - manages container filesystem layers</li><li>io.containerd.runtime.v2 task - handles actual container execution via runc</li></ul><p>Finally, let’s check the cgroup hierarchy to confirm containerd is running under systemd’s management:</p><pre>systemd-cgls --no-pager | grep -A2 containerd.service</pre><p>You should see containerd running under system.slice/containerd.service, which confirms systemd is managing its cgroup.</p><p>Now we install the three essential Kubernetes binaries:</p><ul><li><strong>kubeadm</strong>: the cluster bootstrapping tool we’ll use to initialize the control plane and join worker nodes</li><li><strong>kubelet</strong>: the node agent that runs on every node and manages pods</li><li><strong>kubectl</strong>: the command-line tool for interacting with the cluster</li></ul><h4>Adding the Kubernetes Repository</h4><p>Kubernetes packages are distributed through their own repository. We need to add it and configure version pinning to prevent accidental upgrades.</p><pre>cat &lt;&lt;EOF | tee /etc/yum.repos.d/kubernetes.repo<br>[kubernetes]<br>name=Kubernetes<br>baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/<br>enabled=1<br>gpgcheck=1<br>gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key<br>exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni<br>EOF</pre><p>Notice the exclude line at the end. This prevents dnf from automatically upgrading these packages when you run dnf update. Kubernetes version upgrades should be deliberate and controlled, not accidental.</p><p>Refresh the cache:</p><pre>dnf makecache</pre><h4>Checking Available Versions</h4><p>Because of the exclude setting, a normal dnf list won&#39;t show the packages:</p><pre>dnf list kubelet</pre><p>This returns nothing. To see what’s available, temporarily bypass the exclusion:</p><pre>dnf list --showduplicates kubelet --disableexcludes=kubernetes</pre><p>You’ll see all the 1.32.x versions available. The same works for kubeadm and kubectl.</p><h4>Installing the Packages</h4><pre>dnf install -y kubelet kubeadm kubectl --disableexcludes=kubernetes</pre><p>This also installs cri-tools (which provides crictl) and kubernetes-cni (the basic CNI binaries) as dependencies.</p><p>Enable the kubelet service:</p><pre>systemctl enable --now kubelet</pre><p>Now, here’s something that confuses a lot of people. If you check the kubelet status right now:</p><pre>systemctl status kubelet --no-pager</pre><p>You’ll see it’s in a crash loop, constantly restarting. Check the logs:</p><pre>journalctl -u kubelet --no-pager | tail -20</pre><p>You’ll see errors like:</p><pre>failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: open /var/lib/kubelet/config.yaml: no such file or directory</pre><p>This is completely normal. The kubelet needs configuration files that don’t exist yet. These files are created by kubeadm init on the control plane or kubeadm join on worker nodes. Until then, kubelet will keep trying and failing. Don&#39;t try to fix this; it resolves itself after we run kubeadm.</p><h4>Verifying the Installation</h4><p>Check the versions:</p><pre>kubeadm version -o yaml<br>kubectl version --client=true<br>kubelet --version</pre><p>All three should show v1.32.11 (or whatever the latest patch version is in the 1.32 series).</p><h4>Configuring crictl</h4><p>crictl is a CLI tool for CRI-compatible container runtimes. It’s useful for debugging and inspecting containers at a lower level than kubectl.</p><p>If you run crictl right now:</p><pre>crictl version</pre><p>You’ll see warnings about missing configuration. Let’s fix that:</p><pre>cat &lt;&lt; EOF &gt; /etc/crictl.yaml<br>runtime-endpoint: unix:///run/containerd/containerd.sock<br>image-endpoint: unix:///run/containerd/containerd.sock<br>EOF</pre><p>Now crictl knows where to find containerd. Run it again:</p><pre>crictl version</pre><p>No more warnings. You can also check the runtime info:</p><pre>crictl info | head -30</pre><p>Look for these important values:</p><ul><li>runtimeReady: true - containerd is working</li><li>networkReady: false - CNI not installed yet (expected)</li></ul><h4>CNI Binaries</h4><p>The kubernetes-cni package installed basic CNI plugins in /opt/cni/bin/:</p><pre>ls /opt/cni/bin/</pre><p>You’ll see plugins like bridge, loopback, portmap, and others. These are the building blocks that CNI plugins like Flannel or Calico use under the hood.</p><p>The configuration directory is empty for now:</p><pre>ls /etc/cni/net.d/</pre><p>This directory will be populated when we install a CNI plugin after kubeadm init.</p><h4>Understanding the kubelet Service Configuration</h4><p>Let’s look at how the kubelet service is set up:</p><pre>cat /usr/lib/systemd/system/kubelet.service</pre><p>This is a basic service definition. The interesting part is the drop-in file:</p><pre>cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf</pre><p>This file shows how kubeadm integrates with kubelet. It references several files that kubeadm will create:</p><ul><li>/etc/kubernetes/bootstrap-kubelet.conf - used during TLS bootstrap</li><li>/etc/kubernetes/kubelet.conf - the main kubeconfig for kubelet</li><li>/var/lib/kubelet/config.yaml - kubelet configuration</li><li>/var/lib/kubelet/kubeadm-flags.env - additional flags from kubeadm</li></ul><p>None of these exist yet:</p><pre>tree /etc/kubernetes<br>tree /var/lib/kubelet</pre><p>Both directories are essentially empty. This explains why kubelet is crash-looping. After we run kubeadm init or join, these directories will be populated with all the necessary configuration.</p><p>At this point, we have everything installed and ready. The containerd runtime is running and waiting for containers. The kubelet is installed and trying to start (but failing because it has no configuration yet). In the next section, we’ll run kubeadm init to bring up the control plane, which will generate all the missing configuration and finally let kubelet do its job.</p><h3>4. Control Plane Initialization (kubeadm init)</h3><p>Now we get to the main event. With all the prerequisites in place, it’s time to actually initialize the control plane. This is where kubeadm does its magic, automating all those tedious steps we would otherwise have to do manually.</p><p>You have two ways to configure kubeadm init: command line flags or a YAML configuration file. The command line approach is quick and dirty, good for testing:</p><pre>kubeadm init \<br>  --apiserver-advertise-address=192.168.10.100 \<br>  --pod-network-cidr=10.244.0.0/16 \<br>  --service-cidr=10.96.0.0/12 \<br>  --kubernetes-version=1.32.11</pre><p>But I prefer the configuration file approach. It’s easier to version control, reproduce, and understand what you’re actually deploying. Here’s what I use:</p><pre>cat &lt;&lt; EOF &gt; kubeadm-init.yaml<br>apiVersion: kubeadm.k8s.io/v1beta4<br>kind: InitConfiguration<br>bootstrapTokens:<br>- token: &quot;123456.1234567890123456&quot;<br>  ttl: &quot;0s&quot;<br>  usages:<br>  - signing<br>  - authentication<br>nodeRegistration:<br>  kubeletExtraArgs:<br>    - name: node-ip<br>      value: &quot;192.168.10.100&quot;<br>  criSocket: &quot;unix:///run/containerd/containerd.sock&quot;<br>localAPIEndpoint:<br>  advertiseAddress: &quot;192.168.10.100&quot;<br>---<br>apiVersion: kubeadm.k8s.io/v1beta4<br>kind: ClusterConfiguration<br>kubernetesVersion: &quot;1.32.11&quot;<br>networking:<br>  podSubnet: &quot;10.244.0.0/16&quot;<br>  serviceSubnet: &quot;10.96.0.0/16&quot;<br>EOF</pre><p>Let me explain the important bits here.</p><p>The advertiseAddress is the IP address the API Server will advertise to other components. In my Vagrant environment, there are multiple network interfaces. If I don&#39;t specify this, kubeadm might pick the wrong one (like the NAT interface at 10.0.2.15), and then nothing works because other nodes can&#39;t reach the API Server.</p><p>The node-ip in kubeletExtraArgs serves a similar purpose but for the kubelet. Again, this is critical in multi-NIC environments. Without it, the kubelet might register itself with the wrong IP address, and you&#39;ll spend hours debugging why nodes can&#39;t communicate.</p><p>The podSubnet is set to 10.244.0.0/16 because I&#39;m planning to use Flannel as my CNI plugin, and that&#39;s Flannel&#39;s default. If you&#39;re using Calico, you&#39;d typically use 192.168.0.0/16 instead. Make sure this doesn&#39;t overlap with your actual node network.</p><p>The bootstrap token is something I hardcoded for this lab environment. Normally kubeadm generates a random token, but having a fixed token makes it easier to script the worker node join later. In production, you’d let kubeadm generate a random one and set a proper TTL (like 24 hours) instead of “0s” which means it never expires.</p><p>Before we run the init, let me highlight a few options that often confuse people.</p><p>The --control-plane-endpoint option isn&#39;t in my config because this is a single control plane setup. But if you ever plan to expand to a high availability setup with multiple control planes, you must set this from the beginning. It should point to your load balancer address. You can&#39;t easily add this later because it gets baked into certificates and kubeconfig files. So plan ahead.</p><p>The --apiserver-cert-extra-sans option lets you add additional names or IPs to the API Server certificate&#39;s Subject Alternative Name field. You need this if you plan to access the API Server through a load balancer, custom domain name, or external IP that isn&#39;t automatically detected. For example:</p><pre>kubeadm init --apiserver-cert-extra-sans=kubernetes.example.com,10.0.0.100</pre><p>This step is optional but I recommend it, especially if your network is slow or you want faster initialization:</p><pre>kubeadm config images list</pre><p>This shows you what images kubeadm needs:</p><pre>registry.k8s.io/kube-apiserver:v1.32.11<br>registry.k8s.io/kube-controller-manager:v1.32.11<br>registry.k8s.io/kube-scheduler:v1.32.11<br>registry.k8s.io/kube-proxy:v1.32.11<br>registry.k8s.io/coredns/coredns:v1.11.3<br>registry.k8s.io/pause:3.10<br>registry.k8s.io/etcd:3.5.24-0</pre><p>Then pull them:</p><pre>kubeadm config images pull</pre><p>The benefit is that when you actually run init, it doesn’t have to wait for image downloads. This is especially useful during upgrades where you want to minimize control plane downtime.</p><p>You can do a dry run first to see what would happen without actually doing anything:</p><pre>kubeadm init --config=&quot;kubeadm-init.yaml&quot; --dry-run</pre><p>This creates temporary files in /etc/kubernetes/tmp/kubeadm-init-dryrun* so you can inspect the certificates and manifests that would be generated. Useful for debugging or understanding what&#39;s going on.</p><p>When you’re ready for real:</p><pre>kubeadm init --config=&quot;kubeadm-init.yaml&quot;</pre><p>Now let’s walk through what happens. The output tells the whole story.</p><p>kubeadm init runs through about 14 phases. Understanding these helps tremendously when troubleshooting.</p><p><strong>Preflight checks</strong> come first:</p><pre>[init] Using Kubernetes version: v1.32.11<br>[preflight] Running pre-flight checks<br>[preflight] Pulling images required for setting up a Kubernetes cluster</pre><p>This validates that your system meets requirements. It checks ports, container runtime, kernel parameters, and so on. If you didn’t pre-pull images, it downloads them here.</p><p><strong>Certificate generation</strong> is next:</p><pre>[certs] Using certificateDir folder &quot;/etc/kubernetes/pki&quot;<br>[certs] Generating &quot;ca&quot; certificate and key<br>[certs] Generating &quot;apiserver&quot; certificate and key<br>[certs] apiserver serving cert is signed for DNS names [k8s-ctr kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.10.100]<br>[certs] Generating &quot;apiserver-kubelet-client&quot; certificate and key<br>[certs] Generating &quot;front-proxy-ca&quot; certificate and key<br>[certs] Generating &quot;front-proxy-client&quot; certificate and key<br>[certs] Generating &quot;etcd/ca&quot; certificate and key<br>[certs] Generating &quot;etcd/server&quot; certificate and key<br>[certs] Generating &quot;etcd/peer&quot; certificate and key<br>[certs] Generating &quot;etcd/healthcheck-client&quot; certificate and key<br>[certs] Generating &quot;apiserver-etcd-client&quot; certificate and key<br>[certs] Generating &quot;sa&quot; key and public key</pre><p>This is all the stuff you’d do manually with OpenSSL in Kubernetes The Hard Way. kubeadm creates three separate CA hierarchies: one for the main cluster components, one for etcd, and one for the front proxy (used by API aggregation). The apiserver certificate output is worth noting because it shows all the SANs that clients can use to connect.</p><p><strong>Kubeconfig files</strong> get created:</p><pre>[kubeconfig] Using kubeconfig folder &quot;/etc/kubernetes&quot;<br>[kubeconfig] Writing &quot;admin.conf&quot; kubeconfig file<br>[kubeconfig] Writing &quot;super-admin.conf&quot; kubeconfig file<br>[kubeconfig] Writing &quot;kubelet.conf&quot; kubeconfig file<br>[kubeconfig] Writing &quot;controller-manager.conf&quot; kubeconfig file<br>[kubeconfig] Writing &quot;scheduler.conf&quot; kubeconfig file</pre><p>Each component needs its own kubeconfig to authenticate with the API Server. The admin.conf is what you’ll copy to ~/.kube/config for kubectl access.</p><p><strong>Static Pod manifests</strong> for etcd and control plane components:</p><pre>[etcd] Creating static Pod manifest for local etcd in &quot;/etc/kubernetes/manifests&quot;<br>[control-plane] Using manifest folder &quot;/etc/kubernetes/manifests&quot;<br>[control-plane] Creating static Pod manifest for &quot;kube-apiserver&quot;<br>[control-plane] Creating static Pod manifest for &quot;kube-controller-manager&quot;<br>[control-plane] Creating static Pod manifest for &quot;kube-scheduler&quot;</pre><p>This is a key design choice in kubeadm. Instead of running etcd and control plane components as systemd services (like in Kubernetes The Hard Way), kubeadm deploys them as Static Pods. The kubelet watches the /etc/kubernetes/manifests directory and automatically runs any pod manifests it finds there. This means the kubelet manages the control plane components, which is elegant but also means you need a running kubelet on control plane nodes.</p><p><strong>Kubelet starts and waits for control plane</strong>:</p><pre>[kubelet-start] Writing kubelet environment file with flags to file &quot;/var/lib/kubelet/kubeadm-flags.env&quot;<br>[kubelet-start] Writing kubelet configuration to file &quot;/var/lib/kubelet/config.yaml&quot;<br>[kubelet-start] Starting the kubelet<br>[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods<br>[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz<br>[kubelet-check] The kubelet is healthy after 1.002214971s<br>[api-check] Waiting for a healthy API server. This can take up to 4m0s<br>[api-check] The API server is healthy after 3.003031359s</pre><p>Remember how the kubelet was crashlooping before init? That’s because /var/lib/kubelet/config.yaml didn’t exist. Now kubeadm creates it, and the kubelet can finally start properly. It then waits for the API Server (running as a Static Pod) to become healthy.</p><p><strong>Configuration upload to the cluster</strong>:</p><pre>[upload-config] Storing the configuration used in ConfigMap &quot;kubeadm-config&quot; in the &quot;kube-system&quot; Namespace<br>[kubelet] Creating a ConfigMap &quot;kubelet-config&quot; in namespace kube-system</pre><p>Once the API Server is up, kubeadm stores its configuration as ConfigMaps. This is important because when you later run kubeadm join on worker nodes, they fetch these ConfigMaps to get consistent settings.</p><p><strong>Control plane marking</strong>:</p><pre>[mark-control-plane] Marking the node k8s-ctr as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]<br>[mark-control-plane] Marking the node k8s-ctr as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]</pre><p>The label makes the node show up as “control-plane” in kubectl get nodes. The taint prevents regular workload pods from being scheduled on control plane nodes. You want your control plane dedicated to running cluster infrastructure, not your applications.</p><p><strong>Bootstrap token setup</strong>:</p><pre>[bootstrap-token] Using token: 123456.1234567890123456<br>[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles<br>[bootstrap-token] Creating the &quot;cluster-info&quot; ConfigMap in the &quot;kube-public&quot; namespace</pre><p>This creates the infrastructure for worker nodes to join the cluster. The cluster-info ConfigMap in kube-public is special because it’s readable without authentication. This allows new nodes (which don’t have certificates yet) to discover the cluster’s CA certificate and API Server address.</p><p><strong>Addon installation</strong>:</p><pre>[addons] Applied essential addon: CoreDNS<br>[addons] Applied essential addon: kube-proxy</pre><p>kubeadm installs two essential addons. CoreDNS provides cluster DNS (so pods can resolve service names), and kube-proxy handles service networking (the iptables or IPVS rules that make Services work).</p><p><strong>Finally, the completion message</strong>:</p><pre>To start using your cluster, you need to run the following as a regular user:</pre><pre>  mkdir -p $HOME/.kube<br>  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config<br>  sudo chown $(id -u):$(id -g) $HOME/.kube/config</pre><pre>You should now deploy a pod network to the cluster.<br>Run &quot;kubectl apply -f [podnetwork].yaml&quot; with one of the options listed at:<br>  <a href="https://kubernetes.io/docs/concepts/cluster-administration/addons/">https://kubernetes.io/docs/concepts/cluster-administration/addons/</a></pre><pre>Then you can join any number of worker nodes by running the following on each as root:</pre><pre>kubeadm join 192.168.10.100:6443 --token 123456.1234567890123456 \<br>        --discovery-token-ca-cert-hash sha256:bd763182471f1ed47780644230f234a89061a29041a922a74c849a48342c797d</pre><p>Save that join command. You’ll need it for the worker nodes.</p><h4>Post-init Setup</h4><p>First thing: set up kubectl access. Since I’m running as root:</p><pre>mkdir -p /root/.kube<br>cp -i /etc/kubernetes/admin.conf /root/.kube/config<br>chown $(id -u):$(id -g) /root/.kube/config</pre><p>Verify it works:</p><pre>kubectl cluster-info</pre><pre>Kubernetes control plane is running at https://192.168.10.100:6443<br>CoreDNS is running at https://192.168.10.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy</pre><h4>Initial State Verification</h4><p>Let’s see what we’ve got:</p><pre>kubectl get nodes -o wide</pre><pre>NAME      STATUS     ROLES           AGE   VERSION    INTERNAL-IP      OS-IMAGE                        CONTAINER-RUNTIME<br>k8s-ctr   NotReady   control-plane   2m    v1.32.11   192.168.10.100   Rocky Linux 10.0 (Red Quartz)   containerd://2.1.5</pre><p>The node shows NotReady. This is expected because we haven’t installed a CNI plugin yet. Without CNI, the node can’t set up pod networking, so Kubernetes considers it not ready.</p><p>Check the pods:</p><pre>kubectl get pod -n kube-system -o wide</pre><pre>NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE<br>coredns-668d6bf9bc-n8jxf          0/1     Pending   0          2m    &lt;none&gt;           &lt;none&gt;<br>coredns-668d6bf9bc-z6h69          0/1     Pending   0          2m    &lt;none&gt;           &lt;none&gt;<br>etcd-k8s-ctr                      1/1     Running   0          2m    192.168.10.100   k8s-ctr<br>kube-apiserver-k8s-ctr            1/1     Running   0          2m    192.168.10.100   k8s-ctr<br>kube-controller-manager-k8s-ctr   1/1     Running   0          2m    192.168.10.100   k8s-ctr<br>kube-proxy-5p6jx                  1/1     Running   0          2m    192.168.10.100   k8s-ctr<br>kube-scheduler-k8s-ctr            1/1     Running   0          2m    192.168.10.100   k8s-ctr</pre><p>The Static Pods (etcd, apiserver, controller-manager, scheduler) are running. kube-proxy is running too since it uses hostNetwork. But CoreDNS pods are Pending because they need pod network IPs, which requires CNI.</p><p>You can also verify with crictl directly:</p><pre>crictl ps</pre><pre>CONTAINER      IMAGE          CREATED         STATE     NAME                      POD<br>dc8f81e24dff7  dcdb790dc2bfe  3 minutes ago   Running   kube-proxy                kube-proxy-5p6jx<br>28856e606823f  58951ea1a0b5d  3 minutes ago   Running   kube-apiserver            kube-apiserver-k8s-ctr<br>e7593756117ad  1211402d28f58  3 minutes ago   Running   etcd                      etcd-k8s-ctr<br>61a09c44673c6  cfa17ff3d6634  3 minutes ago   Running   kube-scheduler            kube-scheduler-k8s-ctr<br>65d20308c4200  82766e5f2d560  3 minutes ago   Running   kube-controller-manager   kube-controller-manager-k8s-ctr</pre><p>All the control plane containers are running. The control plane is up, but we need CNI before we can call it fully functional.</p><h4>What Got Created</h4><p>After init, here’s what the /etc/kubernetes directory looks like:</p><pre>/etc/kubernetes/<br>├── admin.conf<br>├── controller-manager.conf<br>├── kubelet.conf<br>├── scheduler.conf<br>├── super-admin.conf<br>├── manifests/<br>│   ├── etcd.yaml<br>│   ├── kube-apiserver.yaml<br>│   ├── kube-controller-manager.yaml<br>│   └── kube-scheduler.yaml<br>└── pki/<br>    ├── ca.crt<br>    ├── ca.key<br>    ├── apiserver.crt<br>    ├── apiserver.key<br>    ├── apiserver-kubelet-client.crt<br>    ├── apiserver-kubelet-client.key<br>    ├── front-proxy-ca.crt<br>    ├── front-proxy-ca.key<br>    ├── front-proxy-client.crt<br>    ├── front-proxy-client.key<br>    ├── sa.key<br>    ├── sa.pub<br>    └── etcd/<br>        ├── ca.crt<br>        ├── ca.key<br>        ├── server.crt<br>        ├── server.key<br>        ├── peer.crt<br>        ├── peer.key<br>        ├── healthcheck-client.crt<br>        └── healthcheck-client.key</pre><p>The kubeconfig files are at the top level. The manifests directory contains the Static Pod definitions that kubelet is running. The pki directory has all the certificates and keys.</p><p>Compare this to what you’d create manually in Kubernetes The Hard Way. kubeadm did all of that in about 10 seconds.</p><h3>5. CNI Plugin Installation</h3><p>At this point, we have a control plane that is technically running, but if you check the node status, you will see something that might concern you at first.</p><pre>kubectl get nodes<br>NAME      STATUS     ROLES           AGE   VERSION<br>k8s-ctr   NotReady   control-plane   2m    v1.32.11</pre><p>The node shows NotReady. And if you look at the pods in kube-system namespace, you will notice CoreDNS pods are stuck in Pending state.</p><pre>kubectl get pod -n kube-system<br>NAME                              READY   STATUS    RESTARTS   AGE<br>coredns-668d6bf9bc-n8jxf          0/1     Pending   0          2m<br>coredns-668d6bf9bc-z6h69          0/1     Pending   0          2m<br>etcd-k8s-ctr                      1/1     Running   0          2m<br>kube-apiserver-k8s-ctr            1/1     Running   0          2m<br>kube-controller-manager-k8s-ctr   1/1     Running   0          2m<br>kube-proxy-5p6jx                  1/1     Running   0          2m<br>kube-scheduler-k8s-ctr            1/1     Running   0          2m</pre><p>This is completely expected and not a bug. The reason is simple: we have not installed a CNI plugin yet.</p><h4>Why CNI is Required</h4><p>Kubernetes itself does not implement pod networking. Instead, it delegates this responsibility to CNI (Container Network Interface) plugins. The CNI plugin is responsible for assigning IP addresses to pods and setting up the network so that pods can communicate with each other across nodes.</p><p>Without a CNI plugin, the kubelet cannot configure networking for pods. When kubelet reports the node status to the API server, it includes a condition called NetworkUnavailable. If no CNI plugin is present, this condition is set to True, which causes the node to be marked as NotReady.</p><p>You can verify this by describing the node.</p><pre>kubectl describe node k8s-ctr | grep -A5 Conditions<br>Conditions:<br>  Type                 Status  Reason                       Message<br>  ----                 ------  ------                       -------<br>  NetworkUnavailable   True    CalicoIsNotReady             Calico is not ready: ...<br>  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory<br>  DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure<br>  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID<br>  Ready                False   KubeletNotReady              container runtime network not ready</pre><p>The message is clear: container runtime network not ready.</p><p>Now, why is CoreDNS stuck in Pending? CoreDNS is deployed as a regular Deployment, which means it needs to be scheduled on a node and run as a normal pod. Unlike the control plane components (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) that run as Static Pods with hostNetwork: true, CoreDNS needs a pod IP address from the pod network. Since there is no CNI to allocate that IP, CoreDNS cannot be scheduled.</p><p>The control plane components and kube-proxy can run because they use the host network directly. They bind to the node’s IP address rather than requiring a pod IP.</p><h4>Choosing a CNI Plugin</h4><p>There are many CNI plugins available: Calico, Flannel, Cilium, Weave, and others. Each has different features and trade-offs. For this setup, we will use Flannel because it is simple and lightweight. It creates a VXLAN overlay network that allows pods on different nodes to communicate with each other.</p><p>Remember when we ran kubeadm init with pod-network-cidr set to 10.244.0.0/16? This value is the default for Flannel. If you are using Calico, the default is 192.168.0.0/16. The pod network CIDR must match what the CNI plugin expects, or you need to modify the CNI configuration accordingly.</p><h4>Installing Flannel</h4><p>Flannel installation is straightforward. We just need to apply the manifest from the official repository.</p><pre>kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml</pre><p>This will create several resources: a namespace called kube-flannel, a ServiceAccount, ConfigMap with the Flannel configuration, and most importantly, a DaemonSet that runs the Flannel agent on every node.</p><pre>namespace/kube-flannel created<br>serviceaccount/flannel created<br>clusterrole.rbac.authorization.k8s.io/flannel created<br>clusterrolebinding.rbac.authorization.k8s.io/flannel created<br>configmap/kube-flannel-cfg created<br>daemonset.apps/kube-flannel-ds created</pre><p>After a few seconds, you can check the Flannel pods.</p><pre>kubectl get pod -n kube-flannel<br>NAME                    READY   STATUS    RESTARTS   AGE<br>kube-flannel-ds-xxxxx   1/1     Running   0          30s</pre><p>Now let’s check if the node status has changed.</p><pre>kubectl get nodes<br>NAME      STATUS   ROLES           AGE   VERSION<br>k8s-ctr   Ready    control-plane   5m    v1.32.11</pre><p>The node is now Ready. Let’s also verify that CoreDNS pods are running.</p><pre>kubectl get pod -n kube-system<br>NAME                              READY   STATUS    RESTARTS   AGE<br>coredns-668d6bf9bc-n8jxf          1/1     Running   0          5m<br>coredns-668d6bf9bc-z6h69          1/1     Running   0          5m<br>etcd-k8s-ctr                      1/1     Running   0          5m<br>kube-apiserver-k8s-ctr            1/1     Running   0          5m<br>kube-controller-manager-k8s-ctr   1/1     Running   0          5m<br>kube-proxy-5p6jx                  1/1     Running   0          5m<br>kube-scheduler-k8s-ctr            1/1     Running   0          5m</pre><p>Both CoreDNS pods are now Running. They have been assigned pod IPs from the 10.244.0.0/24 range (the subnet allocated to this node).</p><pre>kubectl get pod -n kube-system -o wide | grep coredns<br>coredns-668d6bf9bc-n8jxf   1/1   Running   0   5m   10.244.0.2   k8s-ctr   &lt;none&gt;   &lt;none&gt;<br>coredns-668d6bf9bc-z6h69   1/1   Running   0   5m   10.244.0.3   k8s-ctr   &lt;none&gt;   &lt;none&gt;</pre><p>You can also check the NetworkUnavailable condition again.</p><pre>kubectl describe node k8s-ctr | grep NetworkUnavailable<br>  NetworkUnavailable   False   FlannelIsUp   Flannel is running on this node</pre><p>The condition is now False, meaning the network is available.</p><p>Flannel creates a virtual network interface called flannel.1 on each node. This interface is used for the VXLAN overlay network.</p><pre>ip addr show flannel.1<br>4: flannel.1: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1450 qdisc noqueue state UNKNOWN group default <br>    link/ether 3a:2b:1c:4d:5e:6f brd ff:ff:ff:ff:ff:ff<br>    inet 10.244.0.0/32 scope global flannel.1<br>       valid_lft forever preferred_lft forever</pre><p>Flannel also adds routes to the kernel routing table. Right now, with only one node, you will not see routes to other nodes yet. But once you add worker nodes, Flannel will automatically add routes for their pod subnets.</p><pre>ip route | grep flannel<br>10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1</pre><p>The cni0 interface is a bridge that Flannel creates to connect pods on the same node. Pods on this node get IP addresses from 10.244.0.0/24 and are connected to this bridge.</p><p>When Flannel starts, it writes CNI configuration files to /etc/cni/net.d/. Before Flannel installation, this directory was empty.</p><pre>ls -la /etc/cni/net.d/<br>total 8<br>drwxr-xr-x. 2 root root   54 Jan 23 11:00 .<br>drwxr-xr-x. 3 root root   19 Jan 23 01:19 ..<br>-rw-r--r--. 1 root root  292 Jan 23 11:00 10-flannel.conflist</pre><p>The 10-flannel.conflist file tells the container runtime how to set up networking for new pods. When a pod is created, containerd reads this configuration and calls the appropriate CNI binaries (which were installed earlier in /opt/cni/bin/) to configure the pod’s network namespace.</p><pre>cat /etc/cni/net.d/10-flannel.conflist<br>{<br>  &quot;name&quot;: &quot;cbr0&quot;,<br>  &quot;cniVersion&quot;: &quot;0.3.1&quot;,<br>  &quot;plugins&quot;: [<br>    {<br>      &quot;type&quot;: &quot;flannel&quot;,<br>      &quot;delegate&quot;: {<br>        &quot;hairpinMode&quot;: true,<br>        &quot;isDefaultGateway&quot;: true<br>      }<br>    },<br>    {<br>      &quot;type&quot;: &quot;portmap&quot;,<br>      &quot;capabilities&quot;: {<br>        &quot;portMappings&quot;: true<br>      }<br>    }<br>  ]<br>}</pre><h4>Summary</h4><p>The control plane is now fully functional. The node is Ready, CoreDNS is running, and we have a working pod network. At this point, you could deploy workloads to the cluster, but they would all run on the control plane node. In a production environment, you typically do not want to run application workloads on the control plane. That is why the control plane node has a taint (node-role.kubernetes.io/control-plane:NoSchedule) that prevents regular pods from being scheduled there.</p><p>In the next section, we will add worker nodes to the cluster using kubeadm join. Once the worker nodes join, Flannel will automatically configure routes so that pods on different nodes can communicate with each other through the VXLAN overlay network.</p><h3>6. Worker Node Join (kubeadm join)</h3><p>With the control plane up and running, we now need to add worker nodes to the cluster. This is where kubeadm join comes in.</p><p>At first glance, joining a node might seem straightforward: just run a command and the node becomes part of the cluster. But there&#39;s actually a lot happening under the hood, and understanding this process will help you troubleshoot issues when things go wrong.</p><h4>The Trust Problem</h4><p>When a new node wants to join the cluster, we face an interesting chicken-and-egg problem. Think about it from both perspectives.</p><p>From the new node’s perspective: “How do I know this API server is legitimate? What if someone is running a fake control plane trying to steal my credentials?”</p><p>From the control plane’s perspective: “How do I know this node is allowed to join? What if someone unauthorized is trying to sneak into my cluster?”</p><p>This is why kubeadm implements what’s called bidirectional trust. Both sides need to verify each other before the node can join.</p><h4>Discovery: The Node Validates the Control Plane</h4><p>The first half of bidirectional trust is called Discovery. This is where the new node confirms that the API server it’s connecting to is actually the legitimate control plane for the cluster.</p><p>When you run kubeadm join, the command typically looks something like this:</p><pre>kubeadm join 192.168.10.100:6443 --token 123456.1234567890123456 \<br>    --discovery-token-ca-cert-hash sha256:bd763182471f1ed47780644230f234a89061a29041a922a74c849a48342c797d</pre><p>That --discovery-token-ca-cert-hash flag is doing the heavy lifting here. The new node connects to the API server and retrieves the cluster&#39;s CA certificate. Then it computes a hash of that certificate and compares it against the hash you provided. If they match, the node knows it&#39;s talking to the real control plane.</p><p>You might wonder where this hash comes from. It’s computed from the control plane’s CA public key. You can calculate it yourself on the control plane:</p><pre>openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | \<br>    openssl rsa -pubin -outform der 2&gt;/dev/null | \<br>    openssl dgst -sha256 -hex | sed &#39;s/^.* //&#39;</pre><p>This approach is sometimes called CA pinning. Even if someone manages to intercept your network traffic, they can’t impersonate the control plane because they don’t have the matching CA certificate.</p><p>There’s also an unsafe option called --discovery-token-unsafe-skip-ca-verification that skips this check entirely. As the name suggests, you really shouldn&#39;t use this in production. It opens you up to man-in-the-middle attacks where an attacker could set up a fake API server and trick your nodes into joining the wrong cluster.</p><h4>TLS Bootstrap: The Control Plane Validates the Node</h4><p>The second half of bidirectional trust is TLS Bootstrap. Now we need to solve the opposite problem: how does the control plane know this new node is authorized to join?</p><p>Here’s where it gets interesting. The new node needs a kubelet certificate to communicate securely with the API server. But to request that certificate, it needs to authenticate to the API server first. And to authenticate, it would normally need… a certificate. See the chicken-and-egg problem?</p><p>The solution is the bootstrap token. This is a temporary credential that allows a new node to authenticate just long enough to request its real certificate.</p><p>When you ran kubeadm init, it created a bootstrap token and stored it as a Secret in the kube-system namespace. The token format is [token-id].[token-secret], like 123456.1234567890123456. The new node presents this token to the API server, which validates it against the stored Secret.</p><p>Once authenticated with the bootstrap token, the new node generates a key pair locally and submits a Certificate Signing Request (CSR) to the API server. The kube-controller-manager has been configured to automatically approve CSRs from nodes authenticated via bootstrap tokens. After approval, the node receives its signed certificate and can now communicate with the API server using proper mTLS.</p><p>The flow looks like this:</p><ol><li>New node authenticates with bootstrap token</li><li>Node generates local key pair</li><li>Node submits CSR to API server</li><li>Controller manager auto-approves the CSR</li><li>Node receives signed certificate</li><li>Node switches to using the real certificate for all future communication</li></ol><h4>The cluster-info ConfigMap</h4><p>You might be wondering how the new node even knows where to find the CA certificate for validation. The answer is the cluster-info ConfigMap in the kube-public namespace.</p><p>During kubeadm init, a ConfigMap called cluster-info is created that contains the API server address and CA certificate. What&#39;s special about this ConfigMap is that it&#39;s readable by unauthenticated users. This is intentional. A new node that hasn&#39;t joined yet obviously can&#39;t authenticate, but it still needs to retrieve this bootstrap information.</p><p>You can verify this yourself from a worker node before joining:</p><pre>curl -s -k https://192.168.10.100:6443/api/v1/namespaces/kube-public/configmaps/cluster-info | jq &#39;.data | keys&#39;</pre><p>This should return:</p><pre>[<br>  &quot;jws-kubeconfig-123456&quot;,<br>  &quot;kubeconfig&quot;<br>]</pre><p>The kubeconfig field contains the CA certificate and API server address. The jws-kubeconfig-123456 field is a JSON Web Signature that proves this ConfigMap hasn&#39;t been tampered with.</p><p>Try accessing any other resource without authentication and you’ll get a 403 Forbidden. The cluster-info ConfigMap is the only exception.</p><h4>Preparing the Worker Node</h4><p>Before running kubeadm join, the worker node needs the same prerequisites as the control plane: containerd installed, kubelet installed, swap disabled, kernel modules loaded, and so on. I&#39;m assuming you&#39;ve already done this based on the earlier sections.</p><p>Let’s save some baseline information so we can compare before and after:</p><pre>crictl images | tee -a crictl_images-1.txt<br>crictl ps -a | tee -a crictl_ps-1.txt<br>tree /etc/kubernetes | tee -a etc_kubernetes-1.txt<br>tree /var/lib/kubelet | tee -a var_lib_kubelet-1.txt</pre><p>At this point, both directories should be essentially empty.</p><h4>Join Configuration</h4><p>Just like kubeadm init, you can run kubeadm join with command-line flags or with a configuration file. The configuration file approach is better for reproducibility and version control.</p><p>First, figure out which IP address this worker node should use for cluster communication. In my Vagrant environment, there are multiple network interfaces, so I need to be explicit:</p><pre>NODEIP=$(ip -4 addr show enp0s9 | grep -oP &#39;(?&lt;=inet\s)\d+(\.\d+){3}&#39;)<br>echo $NODEIP</pre><p>Now create the JoinConfiguration file:</p><pre>cat &lt;&lt; EOF &gt; kubeadm-join.yaml<br>apiVersion: kubeadm.k8s.io/v1beta4<br>kind: JoinConfiguration<br>discovery:<br>  bootstrapToken:<br>    token: &quot;123456.1234567890123456&quot;<br>    apiServerEndpoint: &quot;192.168.10.100:6443&quot;<br>    unsafeSkipCAVerification: true<br>nodeRegistration:<br>  criSocket: &quot;unix:///run/containerd/containerd.sock&quot;<br>  kubeletExtraArgs:<br>    - name: node-ip<br>      value: &quot;$NODEIP&quot;<br>EOF</pre><p>A few things to note about this configuration.</p><p>The token must match what was created during kubeadm init. In my setup, I hardcoded it to 123456.1234567890123456 for simplicity. In production, you&#39;d typically use kubeadm token list on the control plane to find valid tokens, or create a new one with kubeadm token create.</p><p>I’m using unsafeSkipCAVerification: true here for lab convenience. In production, you should use caCertHashes instead and provide the actual CA hash.</p><p>The node-ip setting is critical in multi-interface environments like Vagrant. Without it, kubelet might pick the wrong IP address (like the NAT interface at 10.0.2.15) and nodes won&#39;t be able to communicate properly.</p><h4>Running the Join</h4><p>You can do a dry run first to see what would happen without actually making changes:</p><pre>kubeadm join --config=&quot;kubeadm-join.yaml&quot; --dry-run</pre><p>This shows you the API calls that would be made and the files that would be created. It’s a good sanity check.</p><p>When you’re ready, run the actual join:</p><pre>kubeadm join --config=&quot;kubeadm-join.yaml&quot;</pre><p>The output walks through the phases:</p><pre>[preflight] Running pre-flight checks<br>[preflight] Reading configuration from the &quot;kubeadm-config&quot; ConfigMap in namespace &quot;kube-system&quot;...</pre><p>First, preflight checks validate that the system meets requirements and can reach the API server.</p><pre>[kubelet-start] Writing kubelet configuration to file &quot;/var/lib/kubelet/config.yaml&quot;<br>[kubelet-start] Writing kubelet environment file with flags to file &quot;/var/lib/kubelet/kubeadm-flags.env&quot;<br>[kubelet-start] Starting the kubelet<br>[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s<br>[kubelet-check] The kubelet is healthy after 501.164948ms<br>[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap</pre><p>Then kubelet starts up and performs the TLS Bootstrap we discussed earlier. It authenticates with the bootstrap token, submits a CSR, and receives its certificate.</p><pre>This node has joined the cluster:<br>* Certificate signing request was sent to apiserver and a response was received.<br>* The Kubelet was informed of the new secure connection details.<br><br>Run &#39;kubectl get nodes&#39; on the control-plane to see this node join the cluster.</pre><p>That’s it. The node is now part of the cluster.</p><h4>What Changed on the Worker Node</h4><p>Let’s look at what files were created. The /etc/kubernetes directory now has content:</p><pre>tree /etc/kubernetes</pre><pre>/etc/kubernetes<br>├── kubelet.conf<br>├── manifests<br>└── pki<br>    └── ca.crt</pre><p>Compare this to the control plane, which has admin.conf, controller-manager.conf, scheduler.conf, and a full pki directory with many certificates and keys. The worker node only needs kubelet.conf for the kubelet to talk to the API server, and ca.crt to validate the API server&#39;s certificate. The manifests directory is empty because worker nodes don&#39;t run Static Pods.</p><p>The kubelet configuration lives in /var/lib/kubelet:</p><pre>tree /var/lib/kubelet</pre><p>You’ll see config.yaml (kubelet configuration), kubeadm-flags.env (extra flags including node-ip), and a pki directory containing the kubelet&#39;s client certificate that was issued through TLS Bootstrap.</p><p>Looking at kubelet.conf, you can see it references the certificate files:</p><pre>cat /etc/kubernetes/kubelet.conf</pre><pre>apiVersion: v1<br>clusters:<br>- cluster:<br>    certificate-authority-data: LS0tLS1CRUdJTi...<br>    server: https://192.168.10.100:6443<br>  name: default-cluster<br>contexts:<br>- context:<br>    cluster: default-cluster<br>    namespace: default<br>    user: default-auth<br>  name: default-context<br>current-context: default-context<br>kind: Config<br>users:<br>- name: default-auth<br>  user:<br>    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem<br>    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem</pre><p>The kubelet-client-current.pem file is a symlink that gets updated when certificates are rotated. This supports automatic certificate renewal.</p><h4>Verifying from the Control Plane</h4><p>Back on the control plane, check the node status:</p><pre>kubectl get nodes</pre><pre>NAME      STATUS   ROLES           AGE     VERSION<br>k8s-ctr   Ready    control-plane   30h     v1.32.11<br>k8s-w1    Ready    &lt;none&gt;          7m29s   v1.32.11<br>k8s-w2    Ready    &lt;none&gt;          119s    v1.32.11</pre><p>Both worker nodes show Ready status. This is because Flannel (the CNI plugin we installed earlier) automatically deploys to new nodes via DaemonSet, so network is ready immediately.</p><p>Check what pods are running on the worker nodes:</p><pre>kubectl get pod -A -owide | grep k8s-w1</pre><pre>kube-flannel   kube-flannel-ds-8vmb6   1/1   Running   0   10m   192.168.10.101   k8s-w1<br>kube-system    kube-proxy-dkczx        1/1   Running   0   10m   192.168.10.101   k8s-w1</pre><p>The kube-proxy and kube-flannel pods were automatically scheduled because they’re DaemonSets. Every node gets one.</p><p>Each node has been assigned its own Pod CIDR:</p><pre>kubectl get nodes -o jsonpath=&#39;{range .items[*]}{.metadata.name}{&quot;\t&quot;}{.spec.podCIDR}{&quot;\n&quot;}{end}&#39;</pre><pre>k8s-ctr	10.244.0.0/24<br>k8s-w1	10.244.1.0/24<br>k8s-w2	10.244.2.0/24</pre><p>The kube-controller-manager allocated these from the cluster’s pod network CIDR (10.244.0.0/16), giving each node a /24 subnet.</p><p>You can also verify cross-node networking is working. On the control plane, check the routing table:</p><pre>ip -c route | grep flannel</pre><pre>10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink<br>10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink</pre><p>Flannel has set up routes so traffic destined for other nodes’ pod networks goes through the VXLAN overlay. You can test connectivity:</p><pre>ping -c 1 10.244.1.1</pre><p>If that works, your cluster networking is healthy.</p><h4>Troubleshooting Join Issues</h4><p>The most common problem I’ve seen is kubeadm join hanging at the preflight stage. It just sits there doing nothing. This is almost always a network connectivity issue.</p><p>First, check if you can reach the API server from the worker node:</p><pre>curl -k https://192.168.10.100:6443/healthz</pre><p>If this times out or fails to connect, you have a network problem. Check the basics: can you ping the control plane? Is the firewall blocking port 6443?</p><p>In my lab, I once spent an embarrassing amount of time debugging a join failure. Ping worked fine, but the API server was unreachable. Turned out firewalld had been re-enabled on the control plane after a VM snapshot restore. The fix was simple:</p><pre># On control plane<br>systemctl disable --now firewalld</pre><p>Or if you want to keep the firewall, open the required ports:</p><pre>firewall-cmd --permanent --add-port=6443/tcp<br>firewall-cmd --permanent --add-port=10250/tcp<br>firewall-cmd --reload</pre><p>Another issue is when the node shows up but stays in NotReady status. This usually means the CNI plugin isn’t working. Check if the flannel or calico pods are running on that node. If they’re crashing, look at their logs.</p><h4>Summary</h4><p>Adding worker nodes to a kubeadm cluster involves a trust negotiation between the new node and the control plane. The node validates the control plane through CA certificate pinning (Discovery), and the control plane validates the node through bootstrap tokens and automatic CSR approval (TLS Bootstrap).</p><p>Once joined, the worker node has a minimal footprint compared to the control plane: just a kubelet.conf, a CA certificate, and its own kubelet client certificate. DaemonSets like kube-proxy and your CNI plugin automatically deploy to the new node, and it becomes ready to schedule workloads.</p><h3>7. Cluster Verification</h3><p>At this point, we have a control plane running and two worker nodes joined to the cluster. Before we call it done, let’s walk through a proper verification to make sure everything is working as expected. This is not just about running kubectl get nodes and seeing &quot;Ready&quot; status. We want to understand what actually happened under the hood.</p><p>The first thing most people do after setting up a cluster is check the nodes. Fair enough.</p><pre>kubectl get nodes -o wide</pre><pre>NAME      STATUS   ROLES           AGE     VERSION    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                        KERNEL-VERSION                  CONTAINER-RUNTIME<br>k8s-ctr   Ready    control-plane   30h     v1.32.11   192.168.10.100   &lt;none&gt;        Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5<br>k8s-w1    Ready    &lt;none&gt;          7m29s   v1.32.11   192.168.10.101   &lt;none&gt;        Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5<br>k8s-w2    Ready    &lt;none&gt;          119s    v1.32.11   192.168.10.102   &lt;none&gt;        Rocky Linux 10.0 (Red Quartz)   6.12.0-55.39.1.el10_0.aarch64   containerd://2.1.5</pre><p>All three nodes are showing Ready status. Notice how the control plane node has the control-plane role while worker nodes show &lt;none&gt;. This is just a label thing. The worker nodes are perfectly functional, they just don&#39;t have a role label assigned by default.</p><p>The INTERNAL-IP column is important here. Remember when we set node-ip in our kubeadm configuration? This is where it shows up. If you see 10.0.2.15 here instead of your expected IP, something went wrong with the network configuration. In Vagrant environments with multiple network interfaces, this is a common gotcha.</p><p>The kubelet on each node reports its capacity to the API server. The scheduler uses this information when deciding where to place pods.</p><pre>kubectl get nodes -o json | jq &quot;.items[] | {name:.metadata.name} + .status.capacity&quot;</pre><pre>{<br>  &quot;name&quot;: &quot;k8s-ctr&quot;,<br>  &quot;cpu&quot;: &quot;4&quot;,<br>  &quot;ephemeral-storage&quot;: &quot;60970Mi&quot;,<br>  &quot;memory&quot;: &quot;2893976Ki&quot;,<br>  &quot;pods&quot;: &quot;110&quot;<br>}<br>{<br>  &quot;name&quot;: &quot;k8s-w1&quot;,<br>  &quot;cpu&quot;: &quot;4&quot;,<br>  &quot;ephemeral-storage&quot;: &quot;60970Mi&quot;,<br>  &quot;memory&quot;: &quot;2893976Ki&quot;,<br>  &quot;pods&quot;: &quot;110&quot;<br>}<br>{<br>  &quot;name&quot;: &quot;k8s-w2&quot;,<br>  &quot;cpu&quot;: &quot;4&quot;,<br>  &quot;ephemeral-storage&quot;: &quot;60970Mi&quot;,<br>  &quot;memory&quot;: &quot;2893976Ki&quot;,<br>  &quot;pods&quot;: &quot;110&quot;<br>}</pre><p>Each node can run up to 110 pods by default. This is a kubelet configuration that can be changed if needed. The memory and CPU values reflect what we provisioned for our VMs.</p><p>This is where things get interesting. When we ran kubeadm init, we specified podSubnet: 10.244.0.0/16 in the configuration. The kube-controller-manager takes this big /16 block and carves out smaller /24 subnets for each node.</p><pre>kubectl get nodes -o jsonpath=&#39;{range .items[*]}{.metadata.name}{&quot;\t&quot;}{.spec.podCIDR}{&quot;\n&quot;}{end}&#39;</pre><pre>k8s-ctr	10.244.0.0/24<br>k8s-w1	10.244.1.0/24<br>k8s-w2	10.244.2.0/24</pre><p>So the control plane got 10.244.0.0/24, the first worker got 10.244.1.0/24, and the second worker got 10.244.2.0/24. Any pod scheduled on k8s-w1 will get an IP address from the 10.244.1.0/24 range. This is how Kubernetes ensures that every pod gets a unique IP address across the entire cluster.</p><p>You can also see this information in the node’s annotations, which Flannel populates:</p><pre>kubectl describe node k8s-w1 | grep -A5 Annotations</pre><p>Flannel writes its VXLAN configuration here, including the public IP and MAC address it uses for the overlay network.</p><p>Now here is where the CNI plugin earns its keep. For pods on different nodes to talk to each other, there needs to be a route. Let’s check the routing table on the control plane node:</p><pre>ip -c route | grep flannel</pre><pre>10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink<br>10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink</pre><p>Flannel has automatically added routes for the other nodes’ pod CIDRs. Traffic destined for 10.244.1.0/24 (k8s-w1’s pods) goes through the flannel.1 interface, which is a VXLAN tunnel endpoint. The same applies for k8s-w2’s pod network.</p><p>Let’s verify that we can actually reach another node’s pod CIDR:</p><pre>ping -c 1 10.244.1.0</pre><pre>PING 10.244.1.0 (10.244.1.0) 56(84) bytes of data.<br>64 bytes from 10.244.1.0: icmp_seq=1 ttl=64 time=0.765 ms</pre><p>The ping works because the VXLAN overlay is functioning correctly. This is the foundation that allows pods to communicate across nodes.</p><p>Let’s see what pods are running across the cluster:</p><pre>kubectl get pod -A -o wide</pre><pre>NAMESPACE      NAME                              READY   STATUS    RESTARTS   AGE   IP               NODE      <br>kube-flannel   kube-flannel-ds-8vmb6             1/1     Running   0          10m   192.168.10.101   k8s-w1<br>kube-flannel   kube-flannel-ds-wtdsc             1/1     Running   0          4m    192.168.10.102   k8s-w2<br>kube-flannel   kube-flannel-ds-xxxxx             1/1     Running   0          30h   192.168.10.100   k8s-ctr<br>kube-system    coredns-668d6bf9bc-n8jxf          1/1     Running   0          30h   10.244.0.2       k8s-ctr<br>kube-system    coredns-668d6bf9bc-z6h69          1/1     Running   0          30h   10.244.0.3       k8s-ctr<br>kube-system    etcd-k8s-ctr                      1/1     Running   0          30h   192.168.10.100   k8s-ctr<br>kube-system    kube-apiserver-k8s-ctr            1/1     Running   0          30h   192.168.10.100   k8s-ctr<br>kube-system    kube-controller-manager-k8s-ctr   1/1     Running   0          30h   192.168.10.100   k8s-ctr<br>kube-system    kube-proxy-5p6jx                  1/1     Running   0          30h   192.168.10.100   k8s-ctr<br>kube-system    kube-proxy-dkczx                  1/1     Running   0          10m   192.168.10.101   k8s-w1<br>kube-system    kube-proxy-frb9n                  1/1     Running   0          4m    192.168.10.102   k8s-w2<br>kube-system    kube-scheduler-k8s-ctr            1/1     Running   0          30h   192.168.10.100   k8s-ctr</pre><p>A few things to notice here. The control plane components (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) are only running on k8s-ctr and their IP addresses are the node’s IP, not pod network IPs. This is because they run with hostNetwork: true as Static Pods.</p><p>The kube-proxy and kube-flannel pods are running on every node. That’s the DaemonSet doing its job. Every node needs kube-proxy for service networking and kube-flannel for pod networking.</p><p>CoreDNS pods have IP addresses from the 10.244.0.0/24 range, which means they are running on the control plane node. These are the only “normal” pods using the pod network on the control plane. They got scheduled there because we only had one node when kubeadm init completed and installed the CoreDNS deployment.</p><p>You might wonder why regular workloads don’t get scheduled on the control plane. The answer is taints:</p><pre>kubectl describe node k8s-ctr | grep Taints</pre><pre>Taints:             node-role.kubernetes.io/control-plane:NoSchedule</pre><p>The control plane has a taint that prevents normal pods from being scheduled there. The system pods like CoreDNS tolerate this taint, which is why they can run on the control plane. Worker nodes have no taints:</p><pre>kubectl describe node k8s-w1 | grep Taints</pre><pre>Taints:             &lt;none&gt;</pre><p>This is why when you deploy a regular application, it will land on worker nodes and not on the control plane.</p><pre>kubectl get svc -A</pre><pre>NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE<br>default       kubernetes   ClusterIP   10.96.0.1    &lt;none&gt;        443/TCP                  30h<br>kube-system   kube-dns     ClusterIP   10.96.0.10   &lt;none&gt;        53/UDP,53/TCP,9153/TCP   30h</pre><p>Two services exist by default. The kubernetes service in the default namespace is how pods talk to the API server without knowing its actual IP. The kube-dns service points to the CoreDNS pods and provides cluster DNS resolution.</p><p>Notice the service IPs are from 10.96.0.0/12 range, which is the service CIDR we configured. The first IP (10.96.0.1) always goes to the kubernetes API service, and 10.96.0.10 is hardcoded for DNS. If you exec into any pod and check /etc/resolv.conf, you will see 10.96.0.10 as the nameserver.</p><p>After everything is set up, here is what the key directories look like on each node type.</p><p>On the control plane:</p><pre>tree /etc/kubernetes</pre><pre>/etc/kubernetes<br>├── admin.conf<br>├── controller-manager.conf<br>├── kubelet.conf<br>├── scheduler.conf<br>├── super-admin.conf<br>├── manifests<br>│   ├── etcd.yaml<br>│   ├── kube-apiserver.yaml<br>│   ├── kube-controller-manager.yaml<br>│   └── kube-scheduler.yaml<br>└── pki<br>    ├── ca.crt<br>    ├── ca.key<br>    ├── apiserver.crt<br>    ├── apiserver.key<br>    ├── ... (many more certificates)<br>    └── etcd<br>        ├── ca.crt<br>        ├── ca.key<br>        └── ... (etcd certificates)</pre><p>On a worker node:</p><pre>tree /etc/kubernetes</pre><pre>/etc/kubernetes<br>├── kubelet.conf<br>├── manifests<br>└── pki<br>    └── ca.crt</pre><p>The difference is significant. Worker nodes only need the kubelet.conf to talk to the API server and the CA certificate to verify the API server’s identity. They don’t have the private keys, admin credentials, or Static Pod manifests. This is by design. A compromised worker node should not have the keys to the kingdom.</p><p>Finally, let’s do a quick sanity check that everything is actually working:</p><pre>kubectl run test-pod --image=nginx --restart=Never<br>kubectl get pod test-pod -o wide</pre><pre>NAME       READY   STATUS    RESTARTS   AGE   IP           NODE     <br>test-pod   1/1     Running   0          30s   10.244.1.5   k8s-w1</pre><p>The pod got scheduled on a worker node (not the control plane, because of the taint) and received an IP from that node’s pod CIDR. If you can see this, your cluster is working properly.</p><p>Clean up:</p><pre>kubectl delete pod test-pod</pre><p>At this point, we have verified that our cluster is functioning correctly. The nodes are healthy, pod networking is working, services are reachable, and workloads can be scheduled. The cluster is ready for actual use.</p><h3>8. Troubleshooting Tips</h3><p>When things go wrong during cluster setup, the error messages are not always helpful. Here are the most common issues you will encounter and how to diagnose them.</p><h4>kubeadm join hangs at preflight checks</h4><p>This is probably the most frustrating issue because there is no error message. You run the join command and it just sits there doing nothing.</p><pre>kubeadm join --config=&quot;kubeadm-join.yaml&quot;<br>[preflight] Running pre-flight checks</pre><p>And then nothing happens. The cursor just blinks.</p><p>The problem is almost always network connectivity. The worker node cannot reach the API server on port 6443. Let me walk you through how to diagnose this.</p><p>First, test the API server connection directly from the worker node.</p><pre>curl -k https://192.168.10.100:6443/healthz</pre><p>If everything is working, you should see “ok” as the response. If you see “Could not connect to server” or the command hangs, you have a network problem.</p><p>Next, check if basic network connectivity exists. Can you even ping the control plane?</p><pre>ping -c 3 192.168.10.100</pre><p>If ping works but the curl command fails, the issue is likely a firewall blocking port 6443. This is exactly what happened in my lab environment. Ping was successful but the API server was unreachable.</p><pre># This worked fine<br>ping -c 3 192.168.10.100<br># 64 bytes from 192.168.10.100: icmp_seq=1 ttl=64 time=0.882 ms<br><br># But this failed<br>curl -k https://192.168.10.100:6443/healthz<br># curl: (7) Failed to connect to 192.168.10.100 port 6443</pre><p>On the control plane node, check if the firewall is active.</p><pre># For Rocky/CentOS<br>systemctl is-active firewalld<br><br># For Ubuntu<br>ufw status</pre><p>If firewalld shows “active”, that is your culprit. You can either disable it completely for lab environments or open the specific ports needed.</p><pre># Option 1: Disable firewall entirely (lab only)<br>systemctl disable --now firewalld<br><br># Option 2: Open only required ports (better for production)<br>firewall-cmd --permanent --add-port=6443/tcp<br>firewall-cmd --permanent --add-port=10250/tcp<br>firewall-cmd --reload</pre><p>One more thing to check is whether the API server is binding to the correct interface. On the control plane, run this command.</p><pre>ss -tlnp | grep 6443</pre><p>You want to see it listening on all interfaces, shown as *:6443 or 0.0.0.0:6443. If it shows 127.0.0.1:6443, the API server is only accepting local connections and external nodes cannot reach it.</p><p>In Vagrant environments with multiple network interfaces, there is another gotcha. If you did not specify node-ip in your kubeadm configuration, kubelet might register with the wrong IP address (typically 10.0.2.15 from the NAT interface instead of your host-only network IP). This causes all sorts of communication problems between nodes.</p><h4>Node stays in NotReady status</h4><p>After kubeadm init completes successfully, you check the node status and see this.</p><pre>kubectl get nodes<br># NAME      STATUS     ROLES           AGE   VERSION<br># k8s-ctr   NotReady   control-plane   5m    v1.32.11</pre><p>NotReady. This is not an error. This is expected behavior when you have not installed a CNI plugin yet.</p><p>Kubernetes needs a network plugin to assign IP addresses to pods and enable communication between them. Without CNI, the node reports that its network is not ready. You can confirm this by checking the node conditions.</p><pre>kubectl describe node k8s-ctr | grep -A5 Conditions</pre><p>Look for the NetworkUnavailable condition. It will show True, meaning the network is not configured.</p><p>The fix is simple. Install a CNI plugin like Flannel or Calico. Once the CNI daemonset is running on the node, the NetworkUnavailable condition changes to False and the node becomes Ready.</p><h4>CoreDNS pods stuck in Pending</h4><p>Right after running kubeadm init, you check the pods in kube-system namespace.</p><pre>kubectl get pod -n kube-system<br># NAME                              READY   STATUS    RESTARTS   AGE<br># coredns-668d6bf9bc-n8jxf          0/1     Pending   0          3m<br># coredns-668d6bf9bc-z6h69          0/1     Pending   0          3m<br># etcd-k8s-ctr                      1/1     Running   0          3m<br># kube-apiserver-k8s-ctr            1/1     Running   0          3m<br># kube-controller-manager-k8s-ctr   1/1     Running   0          3m<br># kube-proxy-5p6jx                  1/1     Running   0          3m<br># kube-scheduler-k8s-ctr            1/1     Running   0          3m</pre><p>CoreDNS is Pending while everything else is Running. Why?</p><p>The control plane components (etcd, apiserver, controller-manager, scheduler) and kube-proxy all run with hostNetwork: true. They use the node&#39;s network directly, so they do not need pod IP addresses.</p><p>CoreDNS is different. It is a regular deployment that needs a pod network IP. Without a CNI plugin, the scheduler cannot place the pods because no node can provide network connectivity for them.</p><p>This is the same root cause as the NotReady node issue. Install your CNI plugin and CoreDNS will start running within seconds.</p><h4>kubelet keeps restarting before init or join</h4><p>You installed kubeadm and kubelet, enabled the kubelet service, and now the logs are full of errors.</p><pre>systemctl status kubelet<br># Active: activating (auto-restart)<br><br>journalctl -u kubelet --no-pager<br># failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, <br># error: open /var/lib/kubelet/config.yaml: no such file or directory</pre><p>The kubelet is crashlooping. It starts, fails, waits 10 seconds, and tries again. Over and over.</p><p>This is completely normal. Do not try to fix it.</p><p>The kubelet needs configuration files that do not exist yet. Specifically, it needs /var/lib/kubelet/config.yaml and /etc/kubernetes/kubelet.conf. These files are created by kubeadm init on the control plane or kubeadm join on worker nodes.</p><p>The systemd service is configured with Restart=always, so kubelet will keep trying until the configuration exists. Once you run kubeadm init or kubeadm join, the files get created and kubelet starts properly.</p><p>If you look at the kubelet service drop-in file, you can see why this happens.</p><pre>cat /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf</pre><p>The service references environment files and configuration paths that kubeadm generates. Before kubeadm runs, those paths point to nothing.</p><h3>9. Conclusion</h3><p>We now have a working three-node Kubernetes cluster. One control plane node running etcd, the API server, controller manager, and scheduler. Two worker nodes ready to run application workloads. Flannel providing pod networking across all nodes.</p><p>The control plane components run as static pods, managed directly by kubelet through manifest files in /etc/kubernetes/manifests/. This is different from how you might run them as systemd services, but it has a nice property. The kubelet monitors these manifests and automatically restarts the containers if they crash.</p><p>Each node got its own pod CIDR allocation from the cluster’s 10.244.0.0/16 range. The control plane has 10.244.0.0/24, the first worker has 10.244.1.0/24, and the second worker has 10.244.2.0/24. Flannel set up VXLAN tunnels between nodes so pods can communicate across the cluster.</p><p>All the TLS certificates were generated automatically. The cluster CA, API server certificates, etcd certificates, front-proxy certificates for API aggregation, and service account signing keys. kubeadm even set up certificate rotation for kubelet so the credentials get renewed before they expire.</p><p>If you have done Kubernetes The Hard Way, you know how much work goes into setting up a cluster manually. Let me highlight what kubeadm automated for us.</p><p>For certificates, we would have written OpenSSL configuration files, generated a CA, then generated and signed certificates for every component. That is about 10 different certificate pairs with specific SANs and key usages. kubeadm did this in the certs phase with a single command.</p><p>For kubeconfig files, we would have used kubectl to create five separate files with embedded certificates and cluster endpoints. kubeadm generated all of them in the kubeconfig phase.</p><p>For etcd, we would have downloaded the binary, written a systemd service file, configured the data directory, and set up peer communication. kubeadm created a static pod manifest instead, which is actually easier to manage.</p><p>The same story for the API server, controller manager, and scheduler. Instead of systemd services, we get static pods with all the correct flags and volume mounts.</p><p>The worker node join process is where kubeadm really shines. In a manual setup, you would need to copy certificates to each worker, write kubeconfig files, configure kubelet, and start services. With kubeadm, the bootstrap token mechanism handles all of this. A new node can join with just a token and a CA hash. It fetches the cluster CA from a public ConfigMap, submits a certificate signing request, gets a signed certificate back, and configures itself. All automated.</p><p>Here is a quick comparison.</p><p>Manual setup for certificates: write ca.conf, generate CA, generate 10+ certificate pairs with correct SANs. kubeadm: kubeadm init phase certs all</p><p>Manual setup for etcd: download binary, write systemd unit, configure flags, manage data directory. kubeadm: static pod manifest in /etc/kubernetes/manifests/etcd.yaml</p><p>Manual worker join: copy certificates, write kubeconfig, configure kubelet, start services. kubeadm: kubeadm join with a token</p><p>One notable difference is how etcd communication is secured. In Kubernetes The Hard Way, etcd often runs with HTTP for simplicity during learning. kubeadm always configures etcd with mutual TLS. Every connection to etcd requires a client certificate, including health checks.</p><pre># Even health checks need certificates with kubeadm&#39;s etcd<br>etcdctl \<br>  --endpoints=https://127.0.0.1:2379 \<br>  --cacert=/etc/kubernetes/pki/etcd/ca.crt \<br>  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \<br>  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \<br>  endpoint health</pre><p>The cluster is functional but not production ready. Here are the natural next steps.</p><p>Monitoring is essential. You cannot operate what you cannot observe. kube-prometheus-stack gives you Prometheus for metrics collection, Grafana for dashboards, and Alertmanager for notifications. You should also monitor certificate expiration. kubeadm certificates are valid for one year by default, and you do not want to discover they expired when your cluster stops working.</p><p>Cluster upgrades are something kubeadm handles well. The kubeadm upgrade command can move your cluster from one Kubernetes version to the next. You upgrade the control plane first, then the worker nodes one by one. The process is documented but requires careful planning, especially around workload disruption.</p><p>High availability is the big gap in our current setup. We have a single control plane node, which means a single point of failure. For production, you want at least three control plane nodes with a load balancer in front. kubeadm supports this through the --control-plane-endpoint flag and the ability to join additional control plane nodes. If you plan to go HA later, you should set the control plane endpoint from the beginning, because changing it afterward is painful.</p><p>Storage is another consideration. We did not configure any persistent storage. For stateful applications, you will need a storage solution like local-path-provisioner for simple cases, or something like Longhorn or Rook-Ceph for distributed storage.</p><h3>Appendix: Claude Code Skills for kubeadm</h3><p><a href="https://github.com/sigridjineth/claude-kubeadm-skills">GitHub - sigridjineth/claude-kubeadm-skills: A collection of Claude Code skills for setting up and troubleshooting Kubernetes clusters using kubeadm.</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Dop9s4RgMPGPRbvJ6jQ6DQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZP1OcGkZXSfPONwrmgfVuw.png" /></figure><p>This appendix explains a collection of skills designed for Claude Code that provide systematic guidance for Kubernetes cluster setup and troubleshooting. If you are not familiar with Claude Code skills, think of them as reference documents that Claude can consult when helping you with specific tasks.</p><p>Skills are markdown files that live in your ~/.claude/skills/ directory. When you ask Claude Code to help with a task, it can discover and read relevant skills to provide more accurate and contextual assistance. Instead of relying solely on general knowledge, Claude gets access to curated, tested procedures for specific scenarios.</p><p>For kubeadm operations, this matters because cluster setup involves many steps where the order matters and small mistakes cause cascading failures. A skill document captures the exact sequence of commands, the verification steps, and the common failure modes all in one place.</p><p>The collection contains four skills that map to the cluster lifecycle.</p><p>The first skill is kubeadm-prerequisites. This covers everything you need to do before running kubeadm. Disabling swap, loading kernel modules, configuring containerd with systemd cgroup driver, installing kubeadm and kubelet. When you tell Claude &quot;prepare this node for Kubernetes&quot; or &quot;I am getting CRI connection errors&quot;, this skill provides the reference.</p><p>The second skill is kubeadm-init. This covers control plane initialization. The 14 phases of kubeadm init, the critical configuration options like apiserver-advertise-address and pod-network-cidr, certificate generation, and post-init verification. When you need to understand why a particular certificate was created or what the control-plane-endpoint option does for HA, this skill has the answers.</p><p>The third skill is kubeadm-join. This covers adding nodes to an existing cluster. The bidirectional trust model, how TLS bootstrap works, token management, and the differences between joining a worker node versus an additional control plane node. When your join command hangs or you need to regenerate an expired token, this skill explains the process.</p><p>The fourth skill is kubeadm-troubleshooting. This is a systematic debugging guide organized by symptom. Node stuck in NotReady, CoreDNS pending, kubelet crashlooping, certificate errors. Each problem has a diagnostic sequence and solution.</p><p>Each skill follows a consistent format that makes it easy for both humans and Claude to find information quickly.</p><p>The frontmatter at the top contains the skill name and description. This is what triggers skill discovery. When you mention “kubeadm init” or “node won’t join the cluster”, Claude matches these phrases against skill descriptions.</p><pre>---<br>name: kubeadm-init<br>description: Control plane initialization with kubeadm init, covering the 14 phases, certificate setup, and configuration options<br>---</pre><p>The overview section states the core principle in one or two sentences. No fluff, just the essential concept.</p><p>The “when to use” section lists specific scenarios and symptoms. This is critical for skill discovery. Instead of vague descriptions like “helps with kubeadm”, it lists concrete situations like “kubeadm init fails with port 6443 already in use” or “need to add extra SANs to apiserver certificate”.</p><p>Quick reference tables provide scannable information for common operations. You do not need to read paragraphs of text to find the command for listing tokens.</p><pre>| Task | Command |<br>|------|---------|<br>| List tokens | kubeadm token list |<br>| Create new token | kubeadm token create --print-join-command |<br>| Delete token | kubeadm token delete &lt;token-id&gt; |</pre><p>The common issues section maps symptoms to causes to fixes. This is the most valuable part for troubleshooting.</p><pre>Symptom: kubeadm join hangs at preflight<br>Cause: Firewall blocking port 6443<br>Diagnosis: curl -k https://&lt;api-server&gt;:6443/healthz<br>Fix: systemctl disable --now firewalld</pre><p>Finally, a verification checklist tells you how to confirm the operation succeeded. After running kubeadm init, what should you check? Node status, pod status in kube-system, certificate files in /etc/kubernetes/pki.</p><p>When setting up a new cluster, you would work through the skills in order.</p><p>Start with kubeadm-prerequisites on all nodes. This ensures every machine has the correct kernel parameters, container runtime configuration, and kubeadm installation. The skill includes verification commands so you can confirm each step before moving on.</p><pre># Verify containerd is configured correctly<br>cat /etc/containerd/config.toml | grep SystemdCgroup<br># Should show: SystemdCgroup = true</pre><pre># Verify kernel modules are loaded<br>lsmod | grep br_netfilter<br># Should show br_netfilter in output</pre><p>Next, use kubeadm-init on your control plane node. The skill walks through configuration file creation, explains which options matter for your environment, and shows what to expect at each phase. After init completes, it tells you exactly what to verify.</p><pre># Post-init verification<br>kubectl get nodes<br># Should show control plane node (NotReady is expected before CNI)</pre><pre>kubectl get pods -n kube-system<br># Should show etcd, apiserver, controller-manager, scheduler running<br># CoreDNS will be Pending until CNI is installed</pre><p>Install your CNI plugin. The kubeadm-init skill includes the common CNI installation commands for Flannel and Calico.</p><p>Finally, use kubeadm-join on your worker nodes. The skill explains the join command options, what the bootstrap token does, and how to verify the node joined successfully.</p><p>The troubleshooting skill is organized by symptom because that is how you encounter problems. You do not know the cause yet. You just know that the node is NotReady or that kubeadm join is hanging.</p><p>Each symptom section follows a diagnostic sequence. Check this first, then this, then this. The sequence is ordered by likelihood based on real-world experience.</p><p>For example, the “kubeadm join hangs” section starts with API server connectivity because that is the most common cause. It provides the exact curl command to test, explains what each response means, and gives the fix for each scenario.</p><pre># Test 1: Can you reach the API server?<br>curl -k https://192.168.10.100:6443/healthz<br># &quot;ok&quot; = API server reachable, problem is elsewhere<br># Connection refused = API server not running or wrong address<br># Timeout = Firewall or routing issue<br><br># Test 2: Is it a firewall?<br># On control plane:<br>ss -tlnp | grep 6443<br># Should show LISTEN on *:6443<br><br>systemctl is-active firewalld<br># If &quot;active&quot;, firewall may be blocking</pre><p>The skill also includes less common issues that are harder to diagnose. Certificate SAN mismatches, token expiration, cgroup driver conflicts between kubelet and containerd. These problems produce confusing error messages, and having the symptom-to-cause mapping saves significant debugging time.</p><p>Skills are only useful if they reflect reality. The tested versions section documents which Kubernetes, containerd, and OS versions the procedures were validated against.</p><pre>| Component | Version |<br>|-----------|---------|<br>| Kubernetes | 1.32.x |<br>| containerd | 2.1.x |<br>| OS | Rocky Linux 10 / RHEL 9 |</pre><p>When Kubernetes or containerd releases a new version that changes behavior, the skills need updates. The most common changes are configuration file format changes (like containerd’s config.toml version 2 versus version 3) and new kubeadm phases or options.</p><p>If you encounter a scenario that the skill does not cover, that is feedback for improvement. Document the exact error message, the diagnostic steps you followed, and the solution you found. This becomes a new entry in the common issues section.</p><p>These skills follow principles from the superpowers skill methodology.</p><p>First, skills are optimized for discovery. The description and “when to use” sections contain the exact phrases someone would use when they have a problem. “Node shows NotReady” rather than “node status troubleshooting”. Claude searches skill content, so including the actual error messages and symptoms improves matching.</p><p>Second, skills are reference documents, not tutorials. They assume you have context and need specific information quickly. Tables for scanning, code blocks for copying, symptom-cause-fix mappings for troubleshooting. No lengthy explanations of concepts unless necessary for the fix.</p><p>Third, verification is mandatory. Every procedure ends with commands to confirm success. This catches problems early and builds confidence that the procedure worked.</p><p>Fourth, skills are tested with realistic scenarios before deployment. You do not write a skill from documentation alone. You set up a cluster, break it in various ways, and document what actually works to fix it.</p><p>Claude Code discovers skills automatically based on your request, but you can also reference them explicitly.</p><pre>Use the kubeadm-prerequisites skill to check if this node is ready for Kubernetes.<br>$ Check the kubeadm-troubleshooting skill for why this node is NotReady.</pre><p>This is useful when Claude’s automatic discovery does not find the right skill, or when you want to be explicit about which reference to use.</p><p>The skills reference several command-line tools that you should be familiar with.</p><p>kubeadm is the cluster bootstrapping tool. It runs on each node during setup.</p><p>kubectl is the cluster interaction tool. It runs from any machine with network access to the API server and a valid kubeconfig.</p><p>crictl is the container runtime debugging tool. It talks directly to containerd (or other CRI runtimes) without going through Kubernetes. Useful for checking if containers are actually running when kubectl cannot reach the API server.</p><p>systemctl and journalctl are the systemd service management and logging tools. kubelet and containerd run as systemd services, so these are essential for checking service status and reading logs.</p><pre># Check kubelet service status<br>systemctl status kubelet<br><br># Read kubelet logs<br>journalctl -u kubelet -f<br><br># Check containerd service<br>systemctl status containerd<br><br># Read containerd logs<br>journalctl -u containerd --no-pager</pre><p>These skills are derived from the tutorial content in this post. The step-by-step cluster setup, the certificate explanations, the troubleshooting scenarios. All of it was tested on Rocky Linux 10 with Kubernetes 1.32 and containerd 2.1.</p><p>The advantage of converting tutorial content into skills is reusability. A tutorial is read once during learning. A skill is consulted repeatedly during actual work. The format is different because the use case is different.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=faebea2eebd5" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ralph is a form of porn]]></title>
            <link>https://sigridjin.medium.com/ralph-is-a-form-of-porn-ffc027b93c95?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/ffc027b93c95</guid>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 00:20:22 GMT</pubDate>
            <atom:updated>2026-01-22T00:20:22.638Z</atom:updated>
            <content:encoded><![CDATA[<h4>the ralph technique is a form of engineering porn because it presents a seductive, almost hyper-real vision of the future where software builds itself while you sleep</h4><p>the ralph technique, popularized by geoffrey huntley, represents a fundamental shift from manual software development toward autonomous ai orchestration. named after the simpsons character who is often misunderstood, the concept suggests that even a seemingly clumsy or non-linear ai can achieve professional results if it is placed inside a continuous, automated loop.</p><p>technically, ralph is often just a simple bash loop that feeds a set of instructions into a coding agent and allows it to run indefinitely. it operates on the philosophy of eventual consistency, where the ai writes code, runs tests, fails, and tries again until the software settles into a functional state. huntley describes this as treating software like clay on a pottery wheel; the engineer’s job is not to shape the clay by hand, but to tune the speed and direction of the wheel by refining the prompts that guide the ai.</p><p>while the speed and autonomy of this method are impressive, it faces significant criticism regarding its lack of human oversight. many argue that unless a goal is strictly deterministic, meaning there is a clear, mathematical way to verify success like a passing test suite, an unsupervised loop can quickly spiral out of control. without a human in the loop to provide sanity checks, an ai might find “hacky” ways to pass tests while producing code that is unmaintainable, insecure, or structurally flawed. critics point out that an ai lacks the long-term intuition to recognize when a specific architectural path is a dead end. it might spend hours refining a solution that a human engineer would have discarded in seconds for being fundamentally wrong.</p><p>for ralph to be truly effective in a professional environment, most believe there must be human intervention at critical decision points. while the loop can handle the repetitive labor of writing and fixing code, a human is still needed to act as the strategic judge. this human-in-the-loop approach ensures that the ai doesn’t just produce something that “works” on a surface level, but something that actually aligns with the broader goals of the project. in this new era, the role of the software engineer is evolving from a builder of code into an orchestrator of systems who knows exactly when to let the loop run and when to step in to provide the moral and technical compass necessary for high-quality engineering.</p><p><a href="https://github.com/anthropics/original_performance_takehome">GitHub - anthropics/original_performance_takehome: Anthropic&#39;s original performance take-home, now open for you to try!</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ffc027b93c95" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ansible Basics Lab: A Hands-On Guide for Infrastructure Automation]]></title>
            <link>https://sigridjin.medium.com/ansible-basics-lab-a-hands-on-guide-for-infrastructure-automation-7bd6ec84e07c?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/7bd6ec84e07c</guid>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 17 Jan 2026 12:26:18 GMT</pubDate>
            <atom:updated>2026-01-17T12:26:18.546Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/875/0*GYCEpy973rZ5-Ck1.jpeg" /></figure><h3>Before Reading This Lab: Use Ansible Claude Skills</h3><p><a href="https://github.com/sigridjineth/hello-ansible-skills">GitHub - sigridjineth/hello-ansible-skills: Ansible Playbook Tutorial with Basic Claude Skills</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lI17rcFmqDb30kaaJU1pZg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AnzY1Jx2_mL2ETvernLahQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PoItZZ3mmR1upBDi-HrmDQ.png" /></figure><p>this repository contains custom skills for claude code that help you work with ansible more effectively. when you place skill files in the .claude/skills/ directory of your project, claude code reads them and uses that knowledge to provide better assistance when you&#39;re working on ansible-related tasks.</p><p>the idea is simple: instead of explaining ansible best practices every time you ask for help, you encode that knowledge into skill files once, and claude code applies it automatically. this means more consistent outputs, fewer mistakes, and playbooks that follow established patterns from the start.</p><h4>what skills are included</h4><p>the repository includes four main skill files, each covering a different aspect of ansible development.</p><p>the first skill is ansible-playbook.md, which covers playbook authoring best practices. it explains how to structure your project directories, including where to put inventory files, group variables, host variables, and roles. it describes naming conventions for tasks and variables, how to use tags effectively, and when to use handlers versus regular tasks. it also covers security topics like using ansible-vault for sensitive data, and testing strategies including check mode, diff mode, and molecule for role testing.</p><p>the second skill is ansible-modules.md, which serves as a quick reference for commonly used ansible modules. instead of looking up documentation every time, claude code can reference this skill to use the right module for each task. it covers file operations like creating directories, copying files, and modifying configuration files. it covers package management across different distributions, service management with systemd, user and group management, downloading files from the web, making api calls, and working with archives.</p><p>the third skill is shell-to-ansible.md, which helps convert legacy shell scripts into proper ansible playbooks. this is probably the most practically useful skill for teams migrating from manual scripts to infrastructure as code. it provides a mapping table showing which shell command corresponds to which ansible module. for example, mkdir -p becomes the file module with state: directory, apt-get install becomes the apt module, and systemctl restart becomes the service module. the skill includes a complete worked example showing a 70-line deployment script being converted into a structured playbook with proper error handling, handlers, and templates.</p><p>the fourth skill is ansible-interactive.md, which defines a conversational workflow for building ansible projects step by step. this is useful when you&#39;re starting from scratch and want claude code to guide you through the process. it breaks down the development into phases: first analyzing your environment and gathering information about your servers, then creating the initial project structure and inventory, testing connectivity, writing a simple playbook, gradually adding features based on your requirements, refactoring into roles when the code gets complex enough, and finally documenting everything properly.</p><h4>how to use it</h4><p>first clone the repository to your local machine. then open the directory with claude code. once you’re in the project, you can start asking for ansible-related help in natural language.</p><p>for example, you might say “create a playbook that installs nginx and configures it as a reverse proxy”. claude code will read the skills and generate a playbook that follows the directory structure conventions, uses appropriate modules instead of shell commands, includes proper error handling, and sets up handlers for service restarts.</p><p>another example: you could paste a shell script and ask “convert this to ansible”. claude code will use the shell-to-ansible skill to map each command to the appropriate module, extract hardcoded values into variables, add idempotency where the original script lacked it, and structure the result as a proper playbook or role.</p><p>you can also take the interactive approach by saying something like “help me set up ansible for my servers step by step”. claude code will walk you through the phases defined in the interactive skill, asking about your server inventory, testing connections, and building up the automation gradually.</p><h4>why this matters</h4><p>the combination of claude code and ansible is powerful because it addresses the main pain points of infrastructure automation. writing ansible playbooks requires knowing which modules exist, what parameters they accept, and how to structure everything properly. this knowledge takes time to acquire and is easy to forget. by encoding it into skills, you get consistent, high-quality outputs without having to remember everything yourself.</p><p>the shell-to-ansible conversion skill is particularly valuable for teams with legacy automation. most organizations have accumulated shell scripts over the years that work but are fragile, not idempotent, and hard to maintain. converting them to ansible manually is tedious and error-prone. having claude code do the heavy lifting while following a consistent conversion methodology speeds up the migration significantly.</p><p>the interactive workflow skill recognizes that infrastructure automation is often an iterative process. you don’t always know exactly what you need upfront. being able to develop incrementally, testing each step before moving on, reduces the risk of building something that doesn’t work in your specific environment.</p><h3>Introduction</h3><p>The post comes from Week 2 of the K8S Deploy study hosted by the CloudNet@ community. The focus this week is <strong>Ansible Basics, </strong>and I’ve put together everything we covered into this comprehensive guide.</p><p>Last week, we went through Kubernetes The Hard Way, manually building a Kubernetes cluster from scratch. It was painful, tedious, and honestly? Eye-opening. You really start to appreciate automation tools when you’ve had to do everything by hand.</p><p>This week, we flip the script. Instead of doing things manually, we’re diving into Ansible, an open-source automation tool that lets you manage multiple servers over SSH without installing agents on them. You write playbooks (basically scripts in YAML format), and Ansible handles the rest. It’s declarative, it’s idempotent, and once you get the hang of it, you’ll wonder how you ever lived without it.</p><p>By the end of this guide, you’ll understand:</p><ul><li>What Ansible is and how it works</li><li>How to write playbooks with variables, loops, and conditionals</li><li>How to handle errors gracefully</li><li>How to structure reusable code with roles</li><li>How to leverage Ansible Galaxy for community-contributed roles</li></ul><p>Let’s get started.</p><h3>Lab Environment Setup</h3><p>Before we touch any Ansible commands, we need an environment to play with. For this tutorial, we’re spinning up four virtual machines using Vagrant:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AQyH9lvyZBVAfKeCGMPhtg.png" /></figure><p>The <strong>server</strong> node is our Ansible control node, this is where we’ll install Ansible and run all our commands from. The three <strong>tnode</strong> machines are managed nodes, these are the servers Ansible will configure.</p><p>Notice that tnode3 runs Rocky Linux instead of Ubuntu. This is intentional. In the real world, you often deal with heterogeneous environments, and Ansible needs to handle that gracefully. We’ll see how to write playbooks that work across different operating systems.</p><h3>Spinning Up the Environment</h3><p>Create a directory and download the necessary files:</p><pre>mkdir ansible<br>cd ansible<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/Vagrantfile<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/init_cfg.sh<br>curl -O https://raw.githubusercontent.com/gasida/vagrant-lab/refs/heads/main/ansible/init_cfg2.sh<br>vagrant up</pre><p>Here’s what each file does:</p><p><strong>Vagrantfile</strong>: The Vagrant configuration that defines all four VMs — their resources, network settings, and which initialization script to run.</p><p><strong>init_cfg.sh</strong>: Initialization script for Ubuntu nodes (server, tnode1, tnode2). It sets the timezone, disables firewalls (AppArmor, ufw), installs essential packages, configures passwords, enables SSH password authentication and root login, and populates /etc/hosts with all node entries.</p><p><strong>init_cfg2.sh</strong>: Same thing, but for Rocky Linux (tnode3). Uses yum instead of apt, disables firewalld and SELinux instead of AppArmor and ufw.</p><p>After vagrant up completes, you should see all four VMs running in VirtualBox.</p><h3>Verifying the Environment</h3><p>SSH into the server node and run some basic checks:</p><pre># Check current user<br>whoami<br># root<br><br># Check user ID and groups<br>id<br># uid=0(root) gid=0(root) groups=0(root)<br><br># Check kernel version<br>uname -r<br># 6.8.0-86-generic<br><br># Check system info<br>hostnamectl<br># Static hostname: server<br># Operating System: Ubuntu 24.04.3 LTS<br># Kernel: Linux 6.8.0-86-generic<br># Architecture: arm64<br><br># Check memory<br>free -h<br># total: 1.3Gi, used: 226Mi, free: 515Mi, available: 1.1Gi<br><br># Check disk<br>lsblk<br>df -hT /<br># Filesystem: /dev/mapper/ubuntu--vg-ubuntu--lv<br># Type: ext4, Size: 30G, Used: 5.3G, Avail: 24G<br><br># Check network interfaces<br>ip -c addr<br># eth0: 10.0.2.15/24 (NAT network)<br># eth1: 10.10.1.10/24 (private network - this is what we&#39;ll use)<br><br># Verify /etc/hosts has all nodes<br>cat /etc/hosts<br># 127.0.0.1 localhost<br># 10.10.1.10 server<br># 10.10.1.11 tnode1<br># 10.10.1.12 tnode2<br># 10.10.1.13 tnode3<br><br># Test connectivity to all nodes<br>for i in {1..3}; do ping -c 1 tnode$i; done<br># All should respond with 0% packet loss</pre><p>If everything checks out, you’re ready to install Ansible.</p><h3>Ansible Overview</h3><p>Before we start typing commands, let’s understand what we’re working with.</p><p>Ansible follows a simple architecture:</p><pre>┌─────────────────────────────────────┐<br>│         Control Node (server)       │<br>│  ┌───────────┐  ┌────────────────┐  │<br>│  │ Inventory │  │   Playbooks    │  │<br>│  │ (hosts)   │  │ (YAML scripts) │  │<br>│  └───────────┘  └────────────────┘  │<br>│         │              │            │<br>│         └──────┬───────┘            │<br>│                ▼                    │<br>│          ┌──────────┐               │<br>│          │ Modules  │               │<br>│          └──────────┘               │<br>└────────────────┬────────────────────┘<br>                 │ SSH<br>    ┌────────────┼────────────┐<br>    ▼            ▼            ▼<br>┌────────┐  ┌────────┐  ┌────────┐<br>│ tnode1 │  │ tnode2 │  │ tnode3 │<br>└────────┘  └────────┘  └────────┘<br>    Managed Nodes (no agent needed)</pre><p>The <strong>control node</strong> is where Ansible lives. It has three main components:</p><ul><li><strong>Inventory</strong>: A list of servers you want to manage</li><li><strong>Playbooks</strong>: YAML files that describe what you want to do</li><li><strong>Modules</strong>: The actual code that performs tasks (install packages, copy files, restart services, etc.)</li></ul><p>The <strong>managed nodes</strong> are the servers you’re configuring. Here’s the beautiful part: they don’t need any special software installed. As long as they have SSH access and Python available, Ansible can manage them.</p><p>A few things make Ansible stand out:</p><p><strong>Agentless</strong>: No daemon running on managed nodes. No ports to open (besides SSH). No agent to update. This alone eliminates a whole class of operational headaches.</p><p><strong>Idempotent</strong>: Run the same playbook ten times, and you get the same result. Ansible checks the current state before making changes. If a package is already installed, it won’t try to install it again. If a file already has the right content, it won’t rewrite it. This makes Ansible safe to run repeatedly.</p><p><strong>YAML-based</strong>: Playbooks are written in YAML, which is human-readable. No need to learn a new programming language. If you can read a config file, you can read a playbook.</p><p><strong>Massive module library</strong>: There are modules for just about everything — managing packages, files, services, users, cloud resources, databases, network devices, you name it.</p><p>There are two flavors of Ansible:</p><p><strong>Community Edition</strong>: Free, open-source, what we’re using in this tutorial. You install it on any Linux machine and you’re good to go.</p><p><strong>Red Hat Ansible Automation Platform</strong>: The enterprise version. Adds features like a web UI, role-based access control, centralized logging, and official support. Requires a subscription.</p><p>For learning purposes and small-to-medium deployments, the community edition is more than enough.</p><h3>Key Ansible Concepts</h3><p>Let’s define the terminology you’ll encounter throughout this guide.</p><p>An inventory is simply a list of hosts that Ansible manages. At its most basic, it’s a text file with hostnames or IP addresses, one per line:</p><pre>web1.example.com<br>web2.example.com<br>db1.example.com<br>192.168.1.50</pre><p>But inventories can get sophisticated. You can organize hosts into groups, define variables per host or per group, and even nest groups within groups. We’ll explore all of this shortly.</p><p>A playbook is a YAML file containing one or more <strong>plays</strong>. Each play maps a set of hosts to a set of tasks. When you run a playbook, Ansible executes the tasks on the specified hosts in order.</p><p>Think of a playbook as a recipe. It describes the desired state of your infrastructure, “these packages should be installed, these files should exist, these services should be running”, and Ansible figures out how to get there.</p><p>A play is a single unit within a playbook. It specifies:</p><ul><li>Which hosts to target</li><li>What tasks to run on those hosts</li><li>Any variables, handlers, or roles to use</li></ul><p>A playbook can contain multiple plays. For example, you might have one play that configures web servers and another play that configures database servers, all in the same file.</p><p>A task is a single action: install a package, copy a file, restart a service, create a user. Each task calls a module with specific arguments.</p><pre>- name: Install nginx<br>  ansible.builtin.apt:<br>    name: nginx<br>    state: present</pre><p>The name field is a human-readable description. The rest specifies which module to use (ansible.builtin.apt) and what arguments to pass.</p><p>Modules are the workhorses of Ansible. They’re units of code that perform specific actions. When you write a task, you’re really just telling Ansible which module to run and with what parameters.</p><p>Some commonly used modules:</p><ul><li>ansible.builtin.apt / ansible.builtin.yum: Package management</li><li>ansible.builtin.copy: Copy files to remote hosts</li><li>ansible.builtin.template: Copy files with variable substitution</li><li>ansible.builtin.service: Manage services (start, stop, restart)</li><li>ansible.builtin.user: Manage user accounts</li><li>ansible.builtin.file: Manage files and directories</li><li>ansible.builtin.shell: Run shell commands</li></ul><p>Handlers are special tasks that only run when notified. They’re typically used for actions that should only happen if something changed — like restarting a service after its config file was modified.</p><pre>tasks:<br>  - name: Copy nginx config<br>    ansible.builtin.copy:<br>      src: nginx.conf<br>      dest: /etc/nginx/nginx.conf<br>    notify: Restart nginx<br><br>handlers:<br>  - name: Restart nginx<br>    ansible.builtin.service:<br>      name: nginx<br>      state: restarted</pre><p>If the config file changes, the handler runs. If the file was already identical, the handler is skipped. This prevents unnecessary service restarts.</p><p>Roles are a way to organize playbooks into reusable components. Instead of having one giant playbook, you break it into roles like “webserver”, “database”, “monitoring”, each with its own tasks, handlers, variables, and files.</p><p>Roles follow a standard directory structure, making them easy to share and reuse. You can publish roles to Ansible Galaxy for others to use, or download roles that others have published.</p><p>Plugins extend Ansible’s core functionality. There are plugins for:</p><ul><li><strong>Connection</strong>: How Ansible connects to hosts (SSH, WinRM, Docker, etc.)</li><li><strong>Lookup</strong>: Retrieve data from external sources</li><li><strong>Filter</strong>: Transform data in templates</li><li><strong>Callback</strong>: Customize output and logging</li></ul><p>Collections are the distribution format for Ansible content. A collection can include modules, plugins, roles, and playbooks, everything packaged together. This allows vendors and developers to distribute their Ansible content independently from Ansible’s release cycle.</p><h3>Installing Ansible</h3><p>Alright, enough theory. Let’s install Ansible on our control node.</p><p>SSH into the server VM and run:</p><pre># Confirm we&#39;re root in the right directory<br>whoami<br># root<br><br>pwd<br># /root<br><br># Check Python version (Ansible requires Python)<br>python3 --version<br># Python 3.12.3<br><br># Install prerequisite for adding PPAs<br>apt install software-properties-common -y<br><br># Add the official Ansible PPA<br>add-apt-repository --yes --update ppa:ansible/ansible<br><br># Install Ansible<br>apt install ansible -y<br><br># Verify installation<br>ansible --version<br># ansible [core 2.19.5]<br>#   config file = /etc/ansible/ansible.cfg<br>#   python version = 3.12.3</pre><p>The output shows key information:</p><ul><li>Ansible core version (2.19.5)</li><li>Default config file location (/etc/ansible/ansible.cfg)</li><li>Python version Ansible is using</li></ul><p>Let’s also create a working directory for our exercises:</p><pre>mkdir my-ansible<br>cd my-ansible</pre><p>This is where we’ll store our inventory, playbooks, and roles.</p><h3>SSH Authentication Setup</h3><p>Ansible connects to managed nodes via SSH. While you can use password authentication, SSH key-based authentication is the way to go for automation. Let’s set it up.</p><p>First, a quick primer on what’s in the ~/.ssh directory:</p><pre>tree ~/.ssh<br># /root/.ssh<br># ├── authorized_keys  # Public keys allowed to connect TO this machine<br># ├── id_rsa           # Your private key (never share this!)<br># ├── id_rsa.pub       # Your public key (safe to distribute)<br># └── known_hosts      # Fingerprints of servers you&#39;ve connected to</pre><p>The id_rsa (private key) and id_rsa.pub (public key) pair work together. You put your public key on servers you want to access, and you use your private key to prove your identity.</p><pre>ssh-keygen -t rsa -N &quot;&quot; -f /root/.ssh/id_rsa<br># Generating public/private rsa key pair.<br># Your identification has been saved in /root/.ssh/id_rsa<br># Your public key has been saved in /root/.ssh/id_rsa.pub</pre><p>The -N &quot;&quot; flag sets an empty passphrase. For automated systems, this is common (though in high-security environments, you might use ssh-agent instead).</p><p>Now we copy our public key to all three managed nodes:</p><pre>for i in {1..3}; do <br>  sshpass -p &#39;qwe123&#39; ssh-copy-id -o StrictHostKeyChecking=no root@tnode$i<br>done<br># tnode1: Number of key(s) added: 1<br># tnode2: Number of key(s) added: 1<br># tnode3: Number of key(s) added: 1</pre><p>The sshpass utility provides the password non-interactively (useful for bootstrapping, not recommended for production). The StrictHostKeyChecking=no flag accepts host keys automatically.</p><p>Let’s confirm we can now SSH without passwords:</p><pre>for i in {1..3}; do <br>  echo &quot;&gt;&gt; tnode$i &lt;&lt;&quot;<br>  ssh tnode$i hostname<br>  echo<br>done<br># &gt;&gt; tnode1 &lt;&lt;<br># tnode1<br>#<br># &gt;&gt; tnode2 &lt;&lt;<br># tnode2<br>#<br># &gt;&gt; tnode3 &lt;&lt;<br># tnode3</pre><p>All three nodes respond without prompting for a password.</p><p>Ansible needs Python on managed nodes to execute modules:</p><pre>for i in {1..3}; do <br>  echo &quot;&gt;&gt; tnode$i &lt;&lt;&quot;<br>  ssh tnode$i python3 -V<br>  echo<br>done<br># &gt;&gt; tnode1 &lt;&lt;<br># Python 3.12.3<br>#<br># &gt;&gt; tnode2 &lt;&lt;<br># Python 3.12.3<br>#<br># &gt;&gt; tnode3 &lt;&lt;<br># Python 3.9.21</pre><p>All nodes have Python installed. Note that tnode3 (Rocky Linux) has a slightly older version — that’s fine, Ansible handles it.</p><h3>Inventory Basics</h3><p>The inventory tells Ansible which hosts to manage. Let’s explore different ways to structure it.</p><p>The most basic inventory is just a list of hosts:</p><pre>cat &lt;&lt;EOT &gt; inventory<br>10.10.1.11<br>10.10.1.12<br>10.10.1.13<br>EOT</pre><p>Or using hostnames (since we have /etc/hosts configured):</p><pre>cat &lt;&lt;EOT &gt; inventory<br>tnode1<br>tnode2<br>tnode3<br>EOT</pre><p>Verify the inventory parses correctly:</p><pre>ansible-inventory -i ./inventory --list | jq</pre><p>In real environments, you’ll want to organize hosts by role. Groups are defined with square brackets:</p><pre>[webservers]<br>web1.example.com<br>web2.example.com<br><br>[dbservers]<br>db1.example.com<br>db2.example.com</pre><p>A host can belong to multiple groups:</p><pre>[webservers]<br>web1.example.com<br>web2.example.com<br><br>[dbservers]<br>db01.example.com<br>db02.example.com<br><br>[production]<br>web1.example.com<br>web2.example.com<br>db01.example.com<br>db02.example.com<br><br>[development]<br>192.168.0.42</pre><p>You can create groups of groups using the :children suffix:</p><pre>[webservers]<br>web1.example.com<br>web2.example.com<br><br>[dbservers]<br>db01.example.com<br>db02.example.com<br><br>[datacenter:children]<br>webservers<br>dbservers</pre><p>Now datacenter contains all hosts from both webservers and dbservers.</p><h3>Real-World Example: Kubespray</h3><p>Here’s what a production inventory might look like (from Kubespray):</p><pre>[all]<br>master01 ansible_host=192.168.10.10 ip=192.168.10.10 ansible_user=root<br>worker01 ansible_host=192.168.10.11 ip=192.168.10.11 ansible_user=root<br>worker02 ansible_host=192.168.10.12 ip=192.168.10.12 ansible_user=root<br><br>[kube_control_plane]<br>master01<br><br>[etcd]<br>master0<br><br>[kube_node]<br>worker01<br>worker02<br><br>[k8s_cluster:children]<br>kube_control_plane<br>kube_node</pre><p>This inventory defines individual hosts with variables, groups them by function (control plane, etcd, worker nodes), and creates a parent group for the entire cluster.</p><p>For large inventories, you can use ranges instead of listing every host:</p><pre>[webservers]<br>web[1:20].example.com      # web1 through web20<br><br>[databases]<br>db[01:05].example.com      # db01 through db05 (zero-padded)<br><br>[alphabet]<br>[a:f].example.com          # a.example.com through f.example.com</pre><h3>Our Lab Inventory</h3><p>Let’s create the inventory we’ll use for the rest of this tutorial:</p><pre>cat &lt;&lt;EOT &gt; inventory<br>[web]<br>tnode1 ansible_python_interpreter=/usr/bin/python3<br>tnode2 ansible_python_interpreter=/usr/bin/python3<br><br>[db]<br>tnode3 ansible_python_interpreter=/usr/bin/python3<br><br>[all:children]<br>web<br>db<br>EOT</pre><p>A few things to note:</p><ul><li>We explicitly set ansible_python_interpreter to avoid warnings about Python auto-detection</li><li>tnode1 and tnode2 are in the web group</li><li>tnode3 is in the db group</li><li>The all group (which is implicit anyway) contains both groups</li></ul><p>Verify the inventory:</p><pre>ansible-inventory -i ./inventory --list | jq<br># Shows all hosts organized by group<br><br>ansible-inventory -i ./inventory --graph<br># @all:<br>#   |--@ungrouped:<br>#   |--@web:<br>#   |  |--tnode1<br>#   |  |--tnode2<br>#   |--@db:<br>#   |  |--tnode3</pre><h3>Ansible Configuration (ansible.cfg)</h3><p>Every time we run an Ansible command, we’ve been specifying -i ./inventory. That gets old fast. Let&#39;s create a configuration file to set defaults.</p><pre>cat &lt;&lt;EOT &gt; ansible.cfg<br>[defaults]<br>inventory = ./inventory<br>remote_user = root<br>ask_pass = false<br><br>[privilege_escalation]<br>become = true<br>become_method = sudo<br>become_user = root<br>become_ask_pass = false<br>EOT</pre><p><strong>[defaults] </strong>Basic operational settings:</p><ul><li>inventory: Path to the inventory file. Now we can skip the -i flag.</li><li>remote_user: The user Ansible logs in as on remote hosts.</li><li>ask_pass: Whether to prompt for SSH password. We use key-based auth, so false.</li></ul><p><strong>[privilege_escalation] </strong>How Ansible gains elevated privileges:</p><ul><li>become: Enable privilege escalation (like sudo).</li><li>become_method: How to escalate (sudo, su, etc.).</li><li>become_user: Which user to become (usually root).</li><li>become_ask_pass: Whether to prompt for the sudo password.</li></ul><p>Ansible looks for configuration in this order (first match wins):</p><ol><li>ANSIBLE_CONFIG environment variable (if set)</li><li>./ansible.cfg (current directory)</li><li>~/.ansible.cfg (home directory)</li><li>/etc/ansible/ansible.cfg (system-wide default)</li></ol><p>This is why we created ansible.cfg in our working directory — it takes precedence over the system-wide config.</p><p>Now verify that the configuration is being picked up:</p><pre>ansible-inventory --list | jq<br># Should work without -i flag now</pre><h3>Ad-Hoc Commands</h3><p>Ad-hoc commands let you run quick, one-off tasks without writing a playbook. They’re great for testing, troubleshooting, or simple operations.</p><pre>ansible &lt;host-pattern&gt; -m &lt;module&gt; -a &quot;&lt;arguments&gt;&quot;</pre><p>The ping module checks if Ansible can connect to hosts and execute Python:</p><pre>ansible -m ping web<br># tnode1 | SUCCESS =&gt; {<br>#     &quot;changed&quot;: false,<br>#     &quot;ping&quot;: &quot;pong&quot;<br># }<br># tnode2 | SUCCESS =&gt; {<br>#     &quot;changed&quot;: false,<br>#     &quot;ping&quot;: &quot;pong&quot;<br># }<br><br>ansible -m ping db<br># tnode3 | SUCCESS =&gt; {<br>#     &quot;changed&quot;: false,<br>#     &quot;ping&quot;: &quot;pong&quot;<br># }<br><br>ansible -m ping all<br># All three nodes respond</pre><p>Note: This is <strong>not</strong> an ICMP ping. It’s an Ansible module that connects via SSH and runs a small Python script to verify everything works.</p><p>The shell module executes arbitrary commands:</p><pre>ansible -m shell -a &quot;uptime&quot; db<br># tnode3 | CHANGED | rc=0 &gt;&gt;<br>#  00:04:22 up  1:54,  1 user,  load average: 0.00, 0.00, 0.00<br><br>ansible -m shell -a &quot;free -h&quot; web<br># tnode1 | CHANGED | rc=0 &gt;&gt;<br>#                total        used        free      shared  buff/cache   available<br># Mem:           1.3Gi       230Mi       909Mi       4.8Mi       266Mi       1.1Gi<br># Swap:          3.7Gi          0B       3.7Gi<br># tnode2 | CHANGED | rc=0 &gt;&gt;<br># ...<br><br>ansible -m shell -a &quot;tail -n 3 /etc/passwd&quot; all<br># Shows last 3 lines of /etc/passwd on all hosts</pre><p>By default, Ansible uses the remote_user from ansible.cfg. You can override with -u:</p><pre># Try connecting as vagrant without password<br>ansible -m ping web -u vagrant<br># Permission denied (publickey,password).<br><br># Now with password authentication<br>ansible -m ping web -u vagrant --ask-pass<br># SSH password: [enter qwe123]<br># tnode1 | SUCCESS =&gt; ...<br># tnode2 | SUCCESS =&gt; ...</pre><h3>Your First Playbook</h3><p>Ad-hoc commands are useful, but playbooks are where Ansible really shines. Let’s write our first one.</p><p>A playbook is YAML file containing a list of plays:</p><pre>---<br>- hosts: all           # Which hosts to target<br>  tasks:               # List of tasks to run<br>    - name: Task 1     # Human-readable description<br>      module_name:     # Which module to use<br>        arg1: value1   # Module arguments<br>        arg2: value2</pre><pre>cat &gt; first-playbook.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Print message<br>      debug:<br>        msg: Hello CloudNet@ Ansible Study<br>EOF</pre><p>The debug module simply prints a message. It&#39;s useful for troubleshooting and learning.</p><p>Before running, let’s verify the syntax:</p><pre>ansible-playbook --syntax-check first-playbook.yml<br># playbook: first-playbook.yml</pre><p>No errors. Now let’s see what happens with broken YAML:</p><pre>cat &gt; first-playbook-with-error.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Print message<br>      debug:<br>      msg: Hello CloudNet@ Ansible Study<br>EOF<br><br>ansible-playbook --syntax-check first-playbook-with-error.yml<br># ERROR! conflicting action statements: debug, msg</pre><p>The error shows exactly where the problem is (the msg should be indented under debug).</p><pre>ansible-playbook first-playbook.yml</pre><p>Output:</p><pre>PLAY [all] ******************************************************************<br><br>TASK [Gathering Facts] ******************************************************<br>ok: [tnode1]<br>ok: [tnode2]<br>ok: [tnode3]<br><br>TASK [Print message] ********************************************************<br>ok: [tnode1] =&gt; {<br>    &quot;msg&quot;: &quot;Hello CloudNet@ Ansible Study&quot;<br>}<br>ok: [tnode2] =&gt; {<br>    &quot;msg&quot;: &quot;Hello CloudNet@ Ansible Study&quot;<br>}<br>ok: [tnode3] =&gt; {<br>    &quot;msg&quot;: &quot;Hello CloudNet@ Ansible Study&quot;<br>}<br><br>PLAY RECAP ******************************************************************<br>tnode1    : ok=2    changed=0    unreachable=0    failed=0    skipped=0<br>tnode2    : ok=2    changed=0    unreachable=0    failed=0    skipped=0<br>tnode3    : ok=2    changed=0    unreachable=0    failed=0    skipped=0</pre><p>Let’s break down what happened:</p><ol><li><strong>Gathering Facts</strong>: Ansible automatically collects information about each host</li><li><strong>Print message</strong>: Our task runs on each host</li><li><strong>Play Recap</strong>: Summary showing success/failure counts</li></ol><p>Let’s write something more useful — a playbook that restarts the SSH service:</p><pre>cat &gt; restart-sshd.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Restart sshd service<br>      ansible.builtin.service:<br>        name: ssh<br>        state: restarted<br>EOF</pre><p>But wait! There’s a problem. On Debian/Ubuntu, the service is called ssh. On RHEL/Rocky, it&#39;s called sshd. Let&#39;s run it and see what happens:</p><pre>ansible-playbook restart-sshd.yml<br># ...<br># fatal: [tnode3]: FAILED! =&gt; {&quot;changed&quot;: false, &quot;msg&quot;: &quot;Could not find the requested service ssh: host&quot;}<br># tnode1 and tnode2 succeed, tnode3 fails</pre><h3>Handling OS Differences with Conditionals</h3><p>We need to use different service names based on the OS. Ansible collects OS information as “facts” that we can use in conditionals:</p><pre>cat &gt; restart-sshd.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Restart SSH on Debian<br>      ansible.builtin.service:<br>        name: ssh<br>        state: restarted<br>      when: ansible_facts[&#39;os_family&#39;] == &#39;Debian&#39;<br>- name: Restart SSH on RedHat<br>      ansible.builtin.service:<br>        name: sshd<br>        state: restarted<br>      when: ansible_facts[&#39;os_family&#39;] == &#39;RedHat&#39;<br>EOF<br><br>$ ansible-playbook restart-sshd.yml</pre><p>Output:</p><pre>TASK [Restart SSH on Debian] ************************************************<br>skipping: [tnode3]<br>changed: [tnode1]<br>changed: [tnode2]<br><br>TASK [Restart SSH on RedHat] ************************************************<br>skipping: [tnode1]<br>skipping: [tnode2]<br>changed: [tnode3]</pre><p>Now each task runs only on the appropriate OS. This is a common pattern in Ansible, using facts and conditionals to handle heterogeneous environments.</p><h3>Variables</h3><p>Variables make playbooks flexible and reusable. Instead of hardcoding values, you define them once and reference them throughout your playbook.</p><p>Ansible has many places where you can define variables. Here’s the precedence order (highest to lowest):</p><ol><li><strong>Extra vars</strong> (-e on command line) — Always wins</li><li><strong>Play vars</strong> (vars: in playbook)</li><li><strong>Host vars</strong> (defined per-host in inventory)</li><li><strong>Group vars</strong> (defined per-group in inventory)</li></ol><p>Understanding this precedence is crucial. If you define the same variable in multiple places, the higher-precedence one wins.</p><p>Group variables apply to all hosts in a group. Define them in the inventory:</p><pre>cat &gt;&gt; inventory &lt;&lt;&#39;EOF&#39;<br><br>[all:vars]<br>user=ansible<br>EOF</pre><p>Now every host has access to the user variable.</p><p>Let’s create a playbook that uses it:</p><pre>cat &gt; create-user.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Create User {{ user }}<br>      ansible.builtin.user:<br>        name: &quot;{{ user }}&quot;<br>        state: present<br>EOF</pre><p>Variable references use double curly braces: {{ variable_name }}. Note the quotes around &quot;{{ user }}&quot; — this is required when a value starts with a variable reference.</p><p>Run it:</p><pre>ansible-playbook create-user.yml<br># TASK [Create User ansible] **************************************************<br># changed: [tnode1]<br># changed: [tnode2]<br># changed: [tnode3]</pre><p>The task name shows “Create User ansible” — the variable was substituted.</p><p>Run it again:</p><pre>ansible-playbook create-user.yml<br># TASK [Create User ansible] **************************************************<br># ok: [tnode1]<br># ok: [tnode2]<br># ok: [tnode3]</pre><p>Notice changed=0 this time. The user already exists, so Ansible doesn&#39;t create it again. This is <strong>idempotency</strong> in action.</p><p>Verify the user was created:</p><pre>for i in {1..3}; do ssh tnode$i tail -n 1 /etc/passwd; done<br># ansible:x:1001:1001::/home/ansible:/bin/sh<br># ansible:x:1001:1001::/home/ansible:/bin/sh<br># ansible:x:1001:1001::/home/ansible:/bin/bash</pre><h3>Host Variables</h3><p>Host variables apply to a single host and override group variables:</p><pre>cat &gt; inventory &lt;&lt;&#39;EOF&#39;<br>[web]<br>tnode1 ansible_python_interpreter=/usr/bin/python3<br>tnode2 ansible_python_interpreter=/usr/bin/python3<br><br>[db]<br>tnode3 ansible_python_interpreter=/usr/bin/python3 user=ansible1<br><br>[all:children]<br>web<br>db<br><br>[all:vars]<br>user=ansible<br>EOF</pre><p>Now tnode3 has user=ansible1 while others have user=ansible.</p><pre>cat &gt; create-user1.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  tasks:<br>    - name: Create User {{ user }}<br>      ansible.builtin.user:<br>        name: &quot;{{ user }}&quot;<br>        state: present<br>EOF<br><br>ansible-playbook create-user1.yml<br># TASK [Create User ansible1] *************************************************<br># changed: [tnode3]</pre><p>The host variable ansible1 overrode the group variable ansible.</p><h3>Play Variables</h3><p>Play variables are defined within the playbook and override both host and group variables:</p><pre>cat &gt; create-user2.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  vars:<br>    user: ansible2<br>  tasks:<br>    - name: Create User {{ user }}<br>      ansible.builtin.user:<br>        name: &quot;{{ user }}&quot;<br>        state: present<br>EOF<br><br>$ ansible-playbook create-user2.yml<br># TASK [Create User ansible2] *************************************************<br># changed: [tnode1]<br># changed: [tnode2]<br># changed: [tnode3]</pre><p>Even though tnode3 has user=ansible1 in inventory, the play variable ansible2 wins.</p><h3>External Variable Files</h3><p>For better organization, you can store variables in separate files:</p><pre>mkdir -p vars<br>cat &gt; vars/users.yml &lt;&lt;&#39;EOF&#39;<br>user: ansible3<br>EOF<br><br>cat &gt; create-user3.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  vars_files:<br>    - vars/users.yml<br>  tasks:<br>    - name: Create User {{ user }}<br>      ansible.builtin.user:<br>        name: &quot;{{ user }}&quot;<br>        state: present<br>EOF<br><br>$ ansible-playbook create-user3.yml<br># TASK [Create User ansible3] *************************************************<br># changed on all hosts</pre><h3>Extra Variables (Command Line)</h3><p>Extra variables have the highest precedence. Pass them with -e:</p><pre>ansible-playbook -e user=ansible4 create-user3.yml<br># TASK [Create User ansible4] *************************************************<br># changed on all hosts</pre><p>Even though the playbook uses vars_files with user: ansible3, the -e user=ansible4 overrides it.</p><h3>Registered Variables</h3><p>You can capture task output in a variable using register:</p><pre>cat &gt; create-user4.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  tasks:<br>    - name: Create User {{ user }}<br>      ansible.builtin.user:<br>        name: &quot;{{ user }}&quot;<br>        state: present<br>      register: result<br>    - name: Display result<br>      ansible.builtin.debug:<br>        var: result<br>EOF<br><br>$ ansible-playbook -e user=ansible5 create-user4.yml</pre><p>Output:</p><pre>TASK [Display result] *******************************************************<br>ok: [tnode3] =&gt; {<br>    &quot;result&quot;: {<br>        &quot;changed&quot;: true,<br>        &quot;comment&quot;: &quot;&quot;,<br>        &quot;create_home&quot;: true,<br>        &quot;group&quot;: 1006,<br>        &quot;home&quot;: &quot;/home/ansible5&quot;,<br>        &quot;name&quot;: &quot;ansible5&quot;,<br>        &quot;shell&quot;: &quot;/bin/sh&quot;,<br>        &quot;state&quot;: &quot;present&quot;,<br>        &quot;uid&quot;: 1006<br>    }<br>}</pre><p>The registered variable contains detailed information about what the task did. This is useful for conditional logic, debugging, or passing information between tasks.</p><h3>Facts</h3><p>Facts are variables that Ansible automatically discovers about managed hosts. They include hardware info, network configuration, OS details, and more.</p><pre>cat &gt; facts.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  tasks:<br>    - name: Print all facts<br>      ansible.builtin.debug:<br>        var: ansible_facts<br>EOF<br><br>$ ansible-playbook facts.yml</pre><p>This dumps a huge amount of information — hostname, IP addresses, memory, CPU, disk, OS version, Python version, and much more.</p><p>You don’t usually need all facts. Here’s how to access specific ones:</p><pre>cat &gt; facts1.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  tasks:<br>    - name: Print specific facts<br>      ansible.builtin.debug:<br>        msg: &gt;<br>          The default IPv4 address of {{ ansible_facts.hostname }}<br>          is {{ ansible_facts.default_ipv4.address }}<br>EOF<br><br>$ ansible-playbook facts1.yml<br># &quot;msg&quot;: &quot;The default IPv4 address of tnode3 is 10.10.1.13&quot;</pre><p>Fact Description ansible_facts.hostname Short hostname ansible_facts.fqdn Fully qualified domain name ansible_facts.default_ipv4.address Primary IPv4 address ansible_facts.os_family OS family (Debian, RedHat, etc.) ansible_facts.distribution Specific distribution (Ubuntu, Rocky, etc.) ansible_facts.distribution_version Distribution version ansible_facts.kernel Kernel version ansible_facts.memtotal_mb Total memory in MB ansible_facts.processor_cores Number of CPU cores</p><h3>Disabling Fact Gathering</h3><p>Fact gathering adds time to playbook runs. If you don’t need facts, disable them:</p><pre>cat &gt; facts3.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  gather_facts: no<br>  tasks:<br>    - name: Print message<br>      ansible.builtin.debug:<br>        msg: Hello Ansible World<br>EOF<br><br>$ ansible-playbook facts3.yml<br># Note: no &quot;Gathering Facts&quot; task runs</pre><p>But if you try to use facts after disabling gathering:</p><pre>cat &gt; facts3-error.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  gather_facts: no<br>  tasks:<br>    - name: Print facts<br>      ansible.builtin.debug:<br>        msg: &quot;Hostname is {{ ansible_facts.hostname }}&quot;<br>EOF<br><br>$ ansible-playbook facts3-error.yml<br># ERROR! &#39;dict object&#39; has no attribute &#39;hostname&#39;</pre><p>You can disable automatic gathering but collect facts later when needed:</p><pre>cat &gt; facts4.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  gather_facts: no<br>  tasks:<br>    - name: Manually gather facts<br>      ansible.builtin.setup:<br>    - name: Print facts<br>      ansible.builtin.debug:<br>        msg: &quot;Hostname is {{ ansible_facts.hostname }}&quot;<br>EOF<br><br><br>ansible-playbook facts4.yml<br># Works now because we explicitly gathered facts</pre><p>You can create your own facts on managed hosts. Create .fact files in /etc/ansible/facts.d/:</p><pre># On the control node (for local testing)<br>mkdir -p /etc/ansible/facts.d<br><br>cat &gt; /etc/ansible/facts.d/my-custom.fact &lt;&lt;&#39;EOF&#39;<br>[packages]<br>web_package = httpd<br>db_package = mariadb-server<br><br>[users]<br>user1 = ansible<br>user2 = devlos<br>EOF</pre><p>Access custom facts via ansible_local:</p><pre>cat &gt; facts5.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: localhost<br>  tasks:<br>    - name: Print custom facts<br>      ansible.builtin.debug:<br>        var: ansible_local<br>EOF<br><br>ansible-playbook facts5.yml<br># &quot;ansible_local&quot;: {<br>#     &quot;my-custom&quot;: {<br>#         &quot;packages&quot;: {<br>#             &quot;db_package&quot;: &quot;mariadb-server&quot;,<br>#             &quot;web_package&quot;: &quot;httpd&quot;<br>#         },<br>#         &quot;users&quot;: {<br>#             &quot;user1&quot;: &quot;ansible&quot;,<br>#             &quot;user2&quot;: &quot;devlos&quot;<br>#         }<br>#     }<br># }</pre><p>Custom facts are useful for site-specific configuration that you want Ansible to discover automatically.</p><h3>Loops</h3><p>Loops let you repeat a task multiple times with different values. This eliminates the need to write the same task over and over.</p><pre>cat &gt; check-services.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Check services are running<br>      ansible.builtin.service:<br>        name: &quot;{{ item }}&quot;<br>        state: started<br>      loop:<br>        - rsyslog<br>        - cron<br>EOF<br><br>ansible-playbook check-services.yml<br># TASK [Check services are running] *******************************************<br># ok: [tnode1] =&gt; (item=rsyslog)<br># ok: [tnode1] =&gt; (item=cron)<br># ok: [tnode2] =&gt; (item=rsyslog)<br># ...</pre><p>The {{ item }} variable holds the current value from the loop list.</p><p>Instead of hardcoding the list, use a variable:</p><pre>cat &gt; check-services2.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  vars:<br>    services:<br>      - rsyslog<br>      - cron<br>  tasks:<br>    - name: Check services are running<br>      ansible.builtin.service:<br>        name: &quot;{{ item }}&quot;<br>        state: started<br>      loop: &quot;{{ services }}&quot;<br>EOF</pre><p>Sometimes you need multiple values per iteration. Use a list of dictionaries:</p><pre>cat &gt; make-file.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Create files<br>      ansible.builtin.file:<br>        path: &quot;{{ item[&#39;log-path&#39;] }}&quot;<br>        mode: &quot;{{ item[&#39;log-mode&#39;] }}&quot;<br>        state: touch<br>      loop:<br>        - log-path: /var/log/test1.log<br>          log-mode: &#39;0644&#39;<br>        - log-path: /var/log/test2.log<br>          log-mode: &#39;0600&#39;<br>EOF<br><br>$ ansible-playbook make-file.yml</pre><p>Verify:</p><pre>ansible -m shell -a &quot;ls -l /var/log/test*.log&quot; all<br># -rw-r--r-- 1 root root 0 ... /var/log/test1.log<br># -rw------- 1 root root 0 ... /var/log/test2.log</pre><p>Each file was created with its specified permissions.</p><p>You might encounter older playbooks using with_* syntax:</p><pre># Old syntax (still works, but loop is preferred)<br>- name: Print items<br>  debug:<br>    msg: &quot;{{ item }}&quot;<br>  with_items:<br>    - one<br>    - two<br>    - three<br><br># Modern syntax<br>- name: Print items<br>  debug:<br>    msg: &quot;{{ item }}&quot;<br>  loop:<br>    - one<br>    - two<br>    - three</pre><p>The loop keyword is the modern approach. with_items still works but is considered legacy.</p><p>Capture output from all loop iterations:</p><pre>cat &gt; loop_register.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: localhost<br>  tasks:<br>    - name: Loop echo test<br>      ansible.builtin.shell: &quot;echo &#39;I can speak {{ item }}&#39;&quot;<br>      loop:<br>        - Korean<br>        - English<br>      register: result<br>    - name: Show all results<br>      ansible.builtin.debug:<br>        var: result<br>    - name: Show just stdout from each iteration<br>      ansible.builtin.debug:<br>        msg: &quot;Stdout: {{ item.stdout }}&quot;<br>      loop: &quot;{{ result.results }}&quot;<br>EOF<br><br>$ ansible-playbook loop_register.yml</pre><p>When you register inside a loop, the variable contains a results array with output from each iteration.</p><h3>Conditionals</h3><p>The when keyword lets you run tasks only when certain conditions are met.</p><pre>cat &gt; when_task.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: localhost<br>  vars:<br>    run_my_task: true<br>  tasks:<br>    - name: Echo message<br>      ansible.builtin.shell: &quot;echo test&quot;<br>      when: run_my_task<br>      register: result<br>    - name: Show result<br>      ansible.builtin.debug:<br>        var: result<br>EOF<br><br>$ ansible-playbook when_task.yml<br># Task runs because run_my_task is true</pre><p>Change run_my_task: false:</p><pre>cat &gt; when_task_false.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: localhost<br>  vars:<br>    run_my_task: false<br>  tasks:<br>    - name: Echo message<br>      ansible.builtin.shell: &quot;echo test&quot;<br>      when: run_my_task<br>      register: result<br>    - name: Show result<br>      ansible.builtin.debug:<br>        var: result<br>EOF<br><br><br>$ ansible-playbook when_task_false.yml<br># TASK [Echo message] *********************************************************<br># skipping: [localhost]</pre><p>The task is skipped.</p><h3>Comparison Operators</h3><p>Operator Description == Equal != Not equal &gt;, &gt;=, &lt;, &lt;= Numeric comparisons in Value in list is defined Variable exists is not defined Variable doesn&#39;t exist not Negation and, or Combine conditions</p><pre>cat &gt; check-os.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  vars:<br>    supported_distros:<br>      - Ubuntu<br>      - CentOS<br>  tasks:<br>    - name: Print supported os<br>      ansible.builtin.debug:<br>        msg: &quot;This {{ ansible_facts[&#39;distribution&#39;] }} needs to use apt&quot;<br>      when: ansible_facts[&#39;distribution&#39;] in supported_distros<br>EOF<br><br>$ ansible-playbook check-os.yml<br># tnode1, tnode2: prints message (Ubuntu)<br># tnode3: skipped (Rocky not in list)</pre><h3>Multiple Conditions</h3><p>Use and / or or list syntax:</p><pre># Using &#39;and&#39;<br>when: ansible_facts[&#39;distribution&#39;] == &quot;Ubuntu&quot; and ansible_facts[&#39;distribution_version&#39;] == &quot;24.04&quot;</pre><pre># List syntax (implicit AND)<br>when:<br>  - ansible_facts[&#39;distribution&#39;] == &quot;Ubuntu&quot;<br>  - ansible_facts[&#39;distribution_version&#39;] == &quot;24.04&quot;</pre><pre># Combining AND and OR<br>when: &gt;<br>  ( ansible_facts[&#39;distribution&#39;] == &quot;Rocky&quot; and<br>    ansible_facts[&#39;distribution_version&#39;] == &quot;9.6&quot; )<br>  or<br>  ( ansible_facts[&#39;distribution&#39;] == &quot;Ubuntu&quot; and<br>    ansible_facts[&#39;distribution_version&#39;] == &quot;24.04&quot; )</pre><h3>Loop + Condition</h3><p>You can combine loops and conditionals:</p><pre>cat &gt; check-mount.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: db<br>  tasks:<br>    - name: Print Root Directory Size<br>      ansible.builtin.debug:<br>        msg: &quot;Directory {{ item.mount }} size is {{ item.size_available }}&quot;<br>      loop: &quot;{{ ansible_facts[&#39;mounts&#39;] }}&quot;<br>      when: item[&#39;mount&#39;] == &quot;/&quot; and item[&#39;size_available&#39;] &gt; 300000000<br>EOF<br><br>$ ansible-playbook check-mount.yml<br># Only prints for root mount if it has &gt;300MB available</pre><h3>Register + Condition</h3><pre>cat &gt; register-when.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: all<br>  tasks:<br>    - name: Get rsyslog service status<br>      ansible.builtin.command: systemctl is-active rsyslog<br>      register: result</pre><pre>    - name: Print rsyslog status<br>      ansible.builtin.debug:<br>        msg: &quot;Rsyslog status is {{ result.stdout }}&quot;<br>      when: result.stdout == &quot;active&quot;<br>EOF</pre><pre>ansible-playbook register-when.yml<br># Only prints if rsyslog is active</pre><h3>Handlers and Error Handling</h3><p>Handlers are tasks that only run when notified. They’re perfect for actions that should happen only if something changed:</p><pre>cat &gt; handler-sample.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode2<br>  tasks:<br>    - name: Restart rsyslog<br>      ansible.builtin.service:<br>        name: rsyslog<br>        state: restarted<br>      notify:<br>        - print msg<br>  handlers:<br>    - name: print msg<br>      ansible.builtin.debug:<br>        msg: &quot;rsyslog is restarted&quot;<br>EOF<br><br>$ ansible-playbook handler-sample.yml<br># TASK [Restart rsyslog] ******************************************************<br># changed: [tnode2]<br>#<br># RUNNING HANDLER [print msg] *************************************************<br># ok: [tnode2] =&gt; {<br>#     &quot;msg&quot;: &quot;rsyslog is restarted&quot;<br># }</pre><p>The handler ran because the service task returned changed. If you run it again, the service restarts (always changed for state: restarted), so the handler runs again.</p><p>In practice, handlers are commonly used to restart services only when their configuration files change.</p><p>By default, if a task fails, Ansible stops executing on that host. Use ignore_errors to continue:</p><pre>cat &gt; ignore-example.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  tasks:<br>    - name: Install nonexistent package<br>      ansible.builtin.apt:<br>        name: apache3      # This doesn&#39;t exist<br>        state: latest<br>      ignore_errors: yes</pre><pre>    - name: Print msg<br>      ansible.builtin.debug:<br>        msg: &quot;This still runs despite the error above&quot;<br>EOF</pre><pre>ansible-playbook ignore-example.yml<br># First task fails but second task still runs</pre><p>Normally, if a play fails, notified handlers don’t run. Use force_handlers: yes to change this:</p><pre>cat &gt; force-handler.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode2<br>  force_handlers: yes<br>  tasks:<br>    - name: Restart rsyslog<br>      ansible.builtin.service:<br>        name: rsyslog<br>        state: restarted<br>      notify:<br>        - print msg<br>    - name: Install nonexistent package<br>      ansible.builtin.apt:<br>        name: apache3<br>        state: latest<br>    handlers:<br>    - name: print msg<br>      ansible.builtin.debug:<br>        msg: &quot;rsyslog is restarted&quot;<br>EOF<br><br>ansible-playbook force-handler.yml<br># Second task fails, but handler still runs because of force_handlers</pre><p>Use failed_when to define custom failure criteria:</p><pre>cat &gt; failed-when.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  tasks:<br>    - name: Run user add script<br>      ansible.builtin.shell: /root/adduser-script.sh<br>      register: command_result<br>      failed_when: &quot;&#39;Please input user id and password&#39; in command_result.stdout&quot;<br>    - name: Print msg<br>      ansible.builtin.debug:<br>        msg: &quot;{{ command_result.stdout }}&quot;<br>EOF</pre><p>Even though the shell command returns exit code 0 (success), we mark it as failed if it prints the usage message.</p><h3>Block / Rescue / Always</h3><p>For try-catch-finally style error handling:</p><pre>cat &gt; block-example.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode2<br>  vars:<br>    logdir: /var/log/daily_log<br>    logfile: todays.log<br>  tasks:<br>    - name: Configure Log Env<br>      block:<br>        - name: Find Directory<br>          ansible.builtin.find:<br>            paths: &quot;{{ logdir }}&quot;<br>          register: result<br>          failed_when: &quot;&#39;Not all paths&#39; in result.msg&quot;<br>  rescue:<br>        - name: Make Directory when Not found<br>          ansible.builtin.file:<br>            path: &quot;{{ logdir }}&quot;<br>            state: directory<br>            mode: &#39;0755&#39;<br>  always:<br>        - name: Create File<br>          ansible.builtin.file:<br>            path: &quot;{{ logdir }}/{{ logfile }}&quot;<br>            state: touch<br>            mode: &#39;0644&#39;<br>EOF<br><br>$ ansible-playbook block-example.yml</pre><p>First run (directory doesn’t exist):</p><ul><li>block task fails (directory not found)</li><li>rescue task runs (creates directory)</li><li>always task runs (creates file)</li></ul><p>Second run (directory exists):</p><ul><li>block task succeeds</li><li>rescue is skipped</li><li>always task runs</li></ul><h3>Roles</h3><p>As playbooks grow, they become hard to manage. Roles let you organize your automation into reusable components.</p><p>A role follows a standard structure:</p><pre>my-role/<br>├── defaults/       # Default variables (can be overridden)<br>│   └── main.yml<br>├── files/          # Static files to copy<br>├── handlers/       # Handler definitions<br>│   └── main.yml<br>├── meta/           # Role metadata (dependencies, etc.)<br>│   └── main.yml<br>├── tasks/          # Main task list<br>│   └── main.yml<br>├── templates/      # Jinja2 templates<br>├── tests/          # Test playbooks<br>│   ├── inventory<br>│   └── test.yml<br>└── vars/           # Variables (high precedence, hard to override)<br>    └── main.yml</pre><p>Use ansible-galaxy to scaffold a new role:</p><pre>ansible-galaxy role init my-role<br>tree my-role/</pre><p>Let’s create a role that:</p><ol><li>Checks if the OS is supported</li><li>Installs Apache</li><li>Deploys a custom index.html</li><li>Restarts Apache when the file changes</li></ol><p><strong>tasks/main.yml:</strong></p><pre>cat &gt; my-role/tasks/main.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- name: Install service {{ service_title }}<br>  ansible.builtin.apt:<br>    name: &quot;{{ item }}&quot;<br>    state: latest<br>  loop: &quot;{{ httpd_packages }}&quot;<br>  when: ansible_facts.distribution in supported_distros<br><br>- name: Copy html file<br>  ansible.builtin.copy:<br>    src: &quot;{{ src_file_path }}&quot;<br>    dest: &quot;{{ dest_file_path }}&quot;<br>  notify: <br>    - restart service<br>EOF</pre><p><strong>files/index.html:</strong></p><pre>echo &quot;Hello! Ansible&quot; &gt; my-role/files/index.html</pre><p><strong>handlers/main.yml:</strong></p><pre>cat &gt; my-role/handlers/main.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- name: restart service<br>  ansible.builtin.service:<br>    name: &quot;{{ service_name }}&quot;<br>    state: restarted<br>EOF</pre><p><strong>defaults/main.yml</strong> (overridable variables):</p><pre>cat &gt; my-role/defaults/main.yml &lt;&lt;&#39;EOF&#39;<br>---<br>service_title: &quot;Apache Web Server&quot;<br>EOF</pre><p><strong>vars/main.yml</strong> (internal variables):</p><pre>cat &gt; my-role/vars/main.yml &lt;&lt;&#39;EOF&#39;<br>---<br>service_name: apache2<br>src_file_path: index.html<br>dest_file_path: /var/www/html/index.html<br>httpd_packages:<br>  - apache2<br>  - apache2-doc<br>supported_distros:<br>  - Ubuntu<br>EOF</pre><h3>Using the Role</h3><p>Create a playbook that calls the role:</p><pre>cat &gt; role-example.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  tasks:<br>    - name: Print start play<br>      ansible.builtin.debug:<br>        msg: &quot;Let&#39;s start role play&quot;<br>    - name: Install Service by role<br>      ansible.builtin.import_role:<br>        name: my-role<br>EOF<br><br>$ ansible-playbook role-example.yml</pre><p>Output:</p><pre>TASK [Print start play] *******************************************************<br>ok: [tnode1] =&gt; {<br>    &quot;msg&quot;: &quot;Let&#39;s start role play&quot;<br>}<br><br>TASK [my-role : install service Apache Web Server] ****************************<br>changed: [tnode1] =&gt; (item=apache2)<br>changed: [tnode1] =&gt; (item=apache2-doc)<br><br>TASK [my-role : copy html file] ***********************************************<br>changed: [tnode1]<br><br>RUNNING HANDLER [my-role : restart service] ***********************************<br>changed: [tnode1]</pre><p>Verify:</p><pre>curl tnode1<br># Hello! Ansible</pre><h3>Overriding Role Variables</h3><p>Override the default service_title:</p><pre>cat &gt; role-example.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  tasks:<br>    - name: Install Service by role<br>      ansible.builtin.import_role:<br>        name: my-role<br>      vars:<br>        service_title: Httpd<br>EOF<br><br>$ ansible-playbook role-example.yml<br># TASK [my-role : install service Httpd] **************************************</pre><h4>Using the roles: Section</h4><p>An alternative to import_role:</p><pre>cat &gt; role-example2.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  roles:<br>    - role: my-role<br>      service_title: &quot;Custom Apache&quot;<br>  tasks:<br>    - name: Print finish<br>      ansible.builtin.debug:<br>        msg: &quot;Finished role play&quot;<br>EOF</pre><h4>Special Task Sections</h4><p>When using roles:, you can add tasks that run before or after:</p><pre>cat &gt; special_role.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  pre_tasks:<br>    - name: Print Start<br>      ansible.builtin.debug:<br>        msg: &quot;Starting role play&quot;<br>  roles:<br>    - my-role<br>  tasks:<br>    - name: Curl test<br>      ansible.builtin.uri:<br>        url: http://tnode1<br>        return_content: true<br>      register: curl_result<br>      notify: Print result<br>      changed_when: true<br>  post_tasks:<br>    - name: Print Finish<br>      ansible.builtin.debug:<br>        msg: &quot;Finished role play&quot;<br>  handlers:<br>    - name: Print result<br>      ansible.builtin.debug:<br>        msg: &quot;{{ curl_result.content }}&quot;<br>EOF</pre><p>Execution order:</p><ol><li>pre_tasks</li><li>roles</li><li>tasks → triggers handler</li><li>post_tasks</li></ol><h3>Tags</h3><p>Tags let you selectively run or skip tasks. Useful when you want to run just a portion of a large playbook.</p><h4>Adding Tags to Tasks</h4><pre>cat &gt; tags1.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: web<br>  tasks:<br>    - name: Install packages<br>      ansible.builtin.apt:<br>        name: htop<br>        state: present<br>      tags:<br>        - packages<br><br>- name: Restart service<br>      ansible.builtin.service:<br>        name: rsyslog<br>        state: restarted<br>      tags:<br>        - service<br>EOF</pre><h4>Running with Tags</h4><pre># List available tags<br>ansible-playbook tags1.yml --list-tags<br># TASK TAGS: [packages, service]<br><br># Run only tasks tagged &#39;packages&#39;<br>ansible-playbook tags1.yml --tags &quot;packages&quot;<br># Only &quot;Install packages&quot; runs<br><br># Skip tasks tagged &#39;packages&#39;<br>ansible-playbook tags1.yml --skip-tags &quot;packages&quot;<br># Only &quot;Restart service&quot; runs<br><br># Run all tagged tasks<br>ansible-playbook tags1.yml --tags tagged</pre><h3>Special Tags</h3><p><strong>always</strong>: Task always runs (unless explicitly skipped)</p><pre>- name: Always run this<br>  debug:<br>    msg: &quot;I always run&quot;<br>  tags:<br>    - always</pre><p><strong>never</strong>: Task never runs (unless explicitly requested)</p><pre>- name: Debug task<br>  debug:<br>    msg: &quot;Only runs with --tags debug or --tags never&quot;<br>  tags:<br>    - never<br>    - debug</pre><h4>Tags on Blocks, Plays, and Roles</h4><pre># Tag a block<br>- block:<br>    - name: Task 1<br>      ...<br>    - name: Task 2<br>      ...<br>  tags: myblock<br><br># Tag a play<br>- hosts: all<br>  tags: deploy<br>  tasks:<br>    ...<br><br># Tag a role<br>roles:<br>  - role: my-role<br>    tags: [ web, setup ]</pre><h3>Ansible Galaxy</h3><p>Ansible Galaxy is a community hub for sharing roles and collections. Why write everything from scratch when someone has already done the work?On the web: <a href="https://galaxy.ansible.com">https://galaxy.ansible.com</a> Via command line:</p><pre>ansible-galaxy role search postgresql --platforms Ubuntu<br># Found 282 roles matching your search</pre><pre>ansible-galaxy role info geerlingguy.postgresql<br># Role: geerlingguy.postgresql<br>#     description: PostgreSQL server for Linux.<br>#     download_count: 3758702<br>#     ...</pre><pre># Install to default location<br>ansible-galaxy role install geerlingguy.postgresql<br><br># Install to specific directory<br>ansible-galaxy role install -p roles geerlingguy.postgresql<br><br># List installed roles<br>ansible-galaxy role list</pre><pre>cat &gt; role-galaxy.yml &lt;&lt;&#39;EOF&#39;<br>---<br>- hosts: tnode1<br>  become: yes<br>  roles:<br>    - geerlingguy.postgresql<br>EOF<br><br>ansible-playbook role-galaxy.yml</pre><pre>ansible-galaxy role remove geerlingguy.postgresql</pre><p>Collections are the modern way to distribute Ansible content. They can include modules, plugins, roles, and playbooks — all packaged together.</p><p>Before collections, all modules were bundled with Ansible core. This created problems:</p><ul><li>Module updates tied to Ansible releases</li><li>Namespace collisions</li><li>Everything installed whether you need it or not</li></ul><p>Collections solve this by allowing vendors and developers to distribute content independently.</p><pre># List installed collections<br>ansible-galaxy collection list<br><br># Install a collection<br>ansible-galaxy collection install openstack.cloud:2.1.0<br><br># Install from a tarball (for offline)<br>ansible-galaxy collection download -p ./collection openstack.cloud<br>ansible-galaxy collection install ./collection/openstack-cloud-2.2.0.tar.gz<br><br># Remove a collection (manual deletion)<br>rm -rf ~/.ansible/collections/ansible_collections/openstack</pre><p>When referencing modules from collections, use the fully qualified name:</p><pre># Full name<br>- name: Create server<br>  openstack.cloud.server:<br>    name: myserver<br>    ...<br><br># Or import the collection<br>collections:<br>  - openstack.cloud<br><br>tasks:<br>  - name: Create server<br>    server:<br>      name: myserver</pre><h3>Conclusion</h3><p>We’ve covered a lot of ground. Let’s recap what you’ve learned:</p><ul><li><strong>Ansible basics</strong>: Agentless architecture, SSH-based, idempotent</li><li><strong>Inventories</strong>: Organizing hosts into groups</li><li><strong>Playbooks</strong>: YAML files describing desired state</li><li><strong>Variables</strong>: Multiple scopes with clear precedence rules</li><li><strong>Facts</strong>: Auto-discovered host information</li><li><strong>Loops</strong>: Repeating tasks efficiently</li><li><strong>Conditionals</strong>: Running tasks based on conditions</li><li><strong>Handlers</strong>: Triggered tasks for “changed” events</li><li><strong>Error handling</strong>: ignore_errors, failed_when, block/rescue/always</li><li><strong>Roles</strong>: Reusable, modular automation components</li><li><strong>Tags</strong>: Selective task execution</li><li><strong>Galaxy</strong>: Sharing and reusing community content</li><li><strong>Collections</strong>: Modern content distribution</li></ul><p>The manual approach to server management, SSH’ing in, running commands, copying files, simply doesn’t scale. With Ansible, you describe what you want, and it figures out how to get there. Your infrastructure becomes code that you can version, review, and reproduce.</p><p>Next week, we’ll put all this knowledge to use with Kubespray, using Ansible to deploy a production-ready Kubernetes cluster. The foundation you’ve built here will make that journey much smoother.</p><p>Thanks for reading! See you in the next post.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7bd6ec84e07c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Kubernetes Cluster From Scratch: Part 3 & Part 4]]></title>
            <link>https://sigridjin.medium.com/building-a-kubernetes-cluster-from-scratch-setting-up-etcd-and-control-plane-0719698f0182?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/0719698f0182</guid>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 10 Jan 2026 18:00:37 GMT</pubDate>
            <atom:updated>2026-01-10T18:27:34.849Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PsbXYWJ-M6v6LGS80xnr-A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7OoPKctY6s47qAacmFWaSQ.png" /></figure><blockquote>See the repository if you want to follow my tutorial! <a href="https://github.com/sigridjineth/k8s-hard-way">https://github.com/sigridjineth/k8s-hard-way</a></blockquote><p>In Part 2, we built the control plane: etcd for storage, the API server for cluster access, the controller manager for reconciliation loops, and the scheduler for pod placement. We verified that everything was running and that we could interact with the cluster using kubectl.</p><p>But right now, if you try to create a pod, nothing will happen. The scheduler will look for a node to place it on and find none. The pod will sit in Pending state forever. We need worker nodes.</p><p>In this part, we will configure both worker nodes with everything they need to run containers. This includes containerd as the container runtime, kubelet as the node agent, kube-proxy for service networking, and CNI plugins for pod networking. We will also set up routing so that pods on different nodes can talk to each other.</p><p>This is where Kubernetes starts to feel real. By the end of this part, you will be able to deploy actual workloads.</p><p>Before we dive into configuration, let me explain what each component does and how they interact.</p><p>The kubelet is the primary node agent. It watches the API server for pods that are assigned to its node and ensures those pods are running. When a new pod is scheduled to the node, kubelet tells the container runtime to start the containers. It also reports node status back to the API server, including resource capacity, running pods, and health conditions.</p><p>The container runtime actually runs containers. We use containerd, which is the industry standard runtime used by most Kubernetes distributions including Docker Desktop, EKS, GKE, and AKS. Containerd handles pulling images, creating containers, managing their lifecycle, and cleaning up when they exit. It implements the Container Runtime Interface (CRI) that kubelet uses to communicate with it.</p><p>Under containerd, there is runc. This is a low-level runtime that does the actual work of creating Linux containers using namespaces and cgroups. Containerd calls runc to create each container, and runc sets up the isolated environment where the container process runs.</p><p>The kube-proxy handles service networking. When you create a Service in Kubernetes, kube-proxy sets up iptables rules (or IPVS rules, depending on configuration) that route traffic destined for the service’s ClusterIP to one of the backing pods. It watches the API server for service and endpoint changes and updates the routing rules accordingly.</p><p>The CNI plugins handle pod networking. When kubelet creates a pod, it calls a CNI plugin to set up the network namespace, assign an IP address, and configure routing. We will use the bridge plugin, which creates a Linux bridge on each node and connects all pods to it. Each pod gets an IP from the node’s pod CIDR range.</p><p>Let us start with node-0. From the jumpbox, SSH into it:</p><pre>ssh root@node-0</pre><p>First, let us verify that swap is disabled. Kubernetes requires this:</p><pre>swapon --show</pre><p>This should produce no output. If you see swap listed, disable it:</p><pre>swapoff -a<br>sed -i &#39;/swap/s/^/#/&#39; /etc/fstab</pre><p>Install the required system packages:</p><pre>apt-get update<br>apt-get install -y socat conntrack ipset</pre><p>Let me explain why we need these.</p><ol><li>socat is used by kubectl port-forward to create the tunnels between your machine and pods.</li><li>conntrack is the connection tracking tool. Kube-proxy uses it to track network connections for proper load balancing and NAT.</li><li>ipset is used by kube-proxy in IPVS mode to efficiently manage large sets of IP addresses. Even if we use iptables mode, having it installed does not hurt.</li></ol><p>Now load the kernel modules that Kubernetes networking requires:</p><pre>cat &gt; /etc/modules-load.d/kubernetes.conf &lt;&lt; EOF<br>overlay<br>br_netfilter<br>EOF</pre><pre>modprobe overlay<br>modprobe br_netfilter</pre><p>The overlay module is used by containerd’s overlayfs storage driver, which efficiently layers container filesystem images.</p><p>The br_netfilter module is critical for Kubernetes networking. By default, traffic that passes through a Linux bridge stays at layer 2 and does not go through iptables. But kube-proxy’s service routing relies on iptables. The br_netfilter module makes bridged traffic visible to iptables so that service routing works correctly for pod-to-pod communication.</p><p>Verify the modules are loaded:</p><pre>lsmod | grep br_netfilter<br>lsmod | grep overlay</pre><p>Both should show output.</p><p>Now configure the required sysctl parameters:</p><pre>cat &gt; /etc/sysctl.d/kubernetes.conf &lt;&lt; EOF<br>net.bridge.bridge-nf-call-iptables = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward = 1<br>EOF</pre><pre>sysctl --system</pre><p>The bridge-nf-call-iptables settings ensure that bridged traffic goes through iptables. The ip_forward setting enables IP forwarding, which is necessary for routing traffic between pods on different nodes.</p><p>Create the directories we will need:</p><pre>mkdir -p \<br>  /etc/cni/net.d \<br>  /opt/cni/bin \<br>  /var/lib/kubelet \<br>  /var/lib/kube-proxy \<br>  /var/lib/kubernetes \<br>  /var/run/kubernetes \<br>  /etc/containerd</pre><p>Exit back to jumpbox to copy the binaries:</p><pre>exit</pre><p>From jumpbox, copy all the worker node binaries to node-0:</p><pre>scp ~/downloads/kubelet ~/downloads/kube-proxy ~/downloads/kubectl \<br>  ~/downloads/runc ~/downloads/crictl \<br>  ~/downloads/containerd ~/downloads/containerd-shim-runc-v2 ~/downloads/ctr \<br>  root@node-0:~/<br><br>scp -r ~/downloads/cni-plugins root@node-0:~/</pre><p>SSH back to node-0:</p><pre>ssh root@node-0</pre><p>Install the binaries to their proper locations:</p><pre>mv kubelet kube-proxy kubectl /usr/local/bin/<br>mv runc crictl /usr/local/bin/<br>mv containerd containerd-shim-runc-v2 ctr /bin/<br>mv cni-plugins/* /opt/cni/bin/</pre><pre>chmod +x /usr/local/bin/kubelet /usr/local/bin/kube-proxy /usr/local/bin/kubectl<br>chmod +x /usr/local/bin/runc /usr/local/bin/crictl<br>chmod +x /bin/containerd /bin/containerd-shim-runc-v2 /bin/ctr<br>chmod +x /opt/cni/bin/*</pre><p>Verify the binaries work:</p><pre>containerd --version<br>kubelet --version<br>kube-proxy --version</pre><p>All three should print version information.</p><p>The CNI configuration tells kubelet how to set up networking for each pod. We will use the bridge plugin, which creates a Linux bridge and connects each pod to it.</p><p>Each node needs its own pod CIDR. Node-0 uses 10.200.0.0/24, which gives it 254 possible pod IPs. Create the bridge configuration:</p><pre>cat &gt; /etc/cni/net.d/10-bridge.conf &lt;&lt; EOF<br>{<br>  &quot;cniVersion&quot;: &quot;1.0.0&quot;,<br>  &quot;name&quot;: &quot;bridge&quot;,<br>  &quot;type&quot;: &quot;bridge&quot;,<br>  &quot;bridge&quot;: &quot;cni0&quot;,<br>  &quot;isGateway&quot;: true,<br>  &quot;ipMasq&quot;: true,<br>  &quot;ipam&quot;: {<br>    &quot;type&quot;: &quot;host-local&quot;,<br>    &quot;ranges&quot;: [<br>      [{&quot;subnet&quot;: &quot;10.200.0.0/24&quot;}]<br>    ],<br>    &quot;routes&quot;: [<br>      {&quot;dst&quot;: &quot;0.0.0.0/0&quot;}<br>    ]<br>  }<br>}<br>EOF</pre><p>Let me explain this configuration:</p><ol><li>The type is “bridge”, which means we use the bridge CNI plugin.</li><li>The bridge field specifies the name of the Linux bridge to create. All pods on this node will be connected to cni0.</li><li>The isGateway setting means the bridge will have an IP address and act as the default gateway for pods. Pods will send traffic to this bridge, which will then route it appropriately.</li><li>The ipMasq setting enables IP masquerading (NAT) for traffic leaving the node. When a pod sends traffic to an external destination, the source IP will be rewritten to the node’s IP.</li><li>The ipam section configures IP address management. We use host-local, which allocates IPs from a local pool. The subnet is 10.200.0.0/24 for node-0.</li></ol><p>Now create the loopback configuration:</p><pre>cat &gt; /etc/cni/net.d/99-loopback.conf &lt;&lt; EOF<br>{<br>  &quot;cniVersion&quot;: &quot;1.0.0&quot;,<br>  &quot;name&quot;: &quot;lo&quot;,<br>  &quot;type&quot;: &quot;loopback&quot;<br>}<br>EOF</pre><p>This ensures that every pod has a working loopback interface.</p><p>Create the containerd configuration file:</p><pre>cat &gt; /etc/containerd/config.toml &lt;&lt; EOF<br>version = 2<br><br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>    snapshotter = &quot;overlayfs&quot;<br>    default_runtime_name = &quot;runc&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>    runtime_type = &quot;io.containerd.runc.v2&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>    SystemdCgroup = true<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.cni]<br>  bin_dir = &quot;/opt/cni/bin&quot;<br>  conf_dir = &quot;/etc/cni/net.d&quot;<br>EOF</pre><p>The key settings here.</p><ol><li>The snapshotter is overlayfs, which is the most efficient storage driver for container layers on Linux.</li><li>The SystemdCgroup setting is important. It tells containerd to use systemd as the cgroup driver. This must match what kubelet uses, and since systemd is the init system on our Debian nodes, we should use systemd for cgroups too. Mismatched cgroup drivers cause all sorts of mysterious issues.</li><li>The cni section tells containerd where to find CNI plugins and configuration.</li></ol><p>Create the systemd unit file for containerd.</p><pre>cat &gt; /etc/systemd/system/containerd.service &lt;&lt; EOF<br>[Unit]<br>Description=containerd container runtime<br>Documentation=https://containerd.io<br>After=network.target<br><br>[Service]<br>ExecStartPre=/sbin/modprobe overlay<br>ExecStart=/bin/containerd<br>Restart=always<br>RestartSec=5<br>Delegate=yes<br>KillMode=process<br>OOMScoreAdjust=-999<br>LimitNOFILE=1048576<br>LimitNPROC=infinity<br>LimitCORE=infinity<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>The Delegate=yes setting is important. It allows containerd to manage cgroups for containers it creates. Without this, systemd might interfere with container resource limits.</p><p>Move the certificates and kubeconfig into place. First, let us check what we have in the home directory:</p><pre>ls -la ~/*.pem ~/node-0.kubeconfig</pre><p>You should see ca.pem, node-0.pem, node-0-key.pem, and node-0.kubeconfig.</p><p>Move them to the appropriate locations:</p><pre>mv ~/ca.pem /var/lib/kubernetes/<br>mv ~/node-0.pem ~/node-0-key.pem /var/lib/kubelet/<br>mv ~/node-0.kubeconfig /var/lib/kubelet/kubeconfig</pre><p>Create the kubelet configuration file:</p><pre>cat &gt; /var/lib/kubelet/kubelet-config.yaml &lt;&lt; EOF<br>kind: KubeletConfiguration<br>apiVersion: kubelet.config.k8s.io/v1beta1<br>authentication:<br>  anonymous:<br>    enabled: false<br>  webhook:<br>    enabled: true<br>  x509:<br>    clientCAFile: &quot;/var/lib/kubernetes/ca.pem&quot;<br>authorization:<br>  mode: Webhook<br>cgroupDriver: systemd<br>clusterDomain: &quot;cluster.local&quot;<br>clusterDNS:<br>  - &quot;10.32.0.10&quot;<br>podCIDR: &quot;10.200.0.0/24&quot;<br>resolvConf: &quot;/run/systemd/resolve/resolv.conf&quot;<br>runtimeRequestTimeout: &quot;15m&quot;<br>tlsCertFile: &quot;/var/lib/kubelet/node-0.pem&quot;<br>tlsPrivateKeyFile: &quot;/var/lib/kubelet/node-0-key.pem&quot;<br>registerNode: true<br>EOF</pre><p>Let me explain the important settings.</p><ol><li>The authentication section configures how kubelet authenticates incoming requests. We disable anonymous access and enable webhook authentication, which means kubelet will check with the API server to verify tokens. We also configure x509 authentication using our CA.</li><li>The authorization mode is Webhook, which means kubelet will ask the API server whether a given request is authorized. This is more secure than AlwaysAllow and integrates with Kubernetes RBAC.</li><li>The cgroupDriver must be systemd to match containerd’s configuration.</li><li>The clusterDNS setting points to 10.32.0.10, which is where CoreDNS will run. We have not set up DNS yet, but kubelet needs to know the address to configure pods correctly.</li><li>The podCIDR tells kubelet which IP range to use for pods on this node.</li><li>The resolvConf setting points to the systemd-resolved configuration. This is used as a base for pod DNS configuration.</li><li>The tlsCertFile and tlsPrivateKeyFile are the node’s certificate and key, used when kubelet serves its HTTPS API.</li></ol><p>Create the kubelet systemd unit file.</p><pre>cat &gt; /etc/systemd/system/kubelet.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Kubelet<br>Documentation=https://github.com/kubernetes/kubernetes<br>After=containerd.service<br>Requires=containerd.service</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kubelet \\<br>  --config=/var/lib/kubelet/kubelet-config.yaml \\<br>  --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock \\<br>  --kubeconfig=/var/lib/kubelet/kubeconfig \\<br>  --v=2<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>The container-runtime-endpoint tells kubelet how to communicate with containerd via its Unix socket.</p><p>Move the kube-proxy kubeconfig:</p><pre>mv ~/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig</pre><p>Create the kube-proxy configuration file:</p><pre>cat &gt; /var/lib/kube-proxy/kube-proxy-config.yaml &lt;&lt; EOF<br>kind: KubeProxyConfiguration<br>apiVersion: kubeproxy.config.k8s.io/v1alpha1<br>clientConnection:<br>  kubeconfig: &quot;/var/lib/kube-proxy/kubeconfig&quot;<br>mode: &quot;iptables&quot;<br>clusterCIDR: &quot;10.200.0.0/16&quot;<br>EOF</pre><p>The mode is set to iptables, which is the traditional and most compatible mode. Kube-proxy will create iptables rules for service routing.</p><p>The clusterCIDR is the overall pod network range. This is used for determining which traffic is internal to the cluster.</p><p>Create the systemd unit file:</p><pre>cat &gt; /etc/systemd/system/kube-proxy.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Kube Proxy<br>Documentation=https://github.com/kubernetes/kubernetes<br>After=network.target</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kube-proxy \\<br>  --config=/var/lib/kube-proxy/kube-proxy-config.yaml<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Now let us start all the services:</p><pre>systemctl daemon-reload<br>systemctl enable containerd kubelet kube-proxy<br>systemctl start containerd kubelet kube-proxy</pre><p>Wait a few seconds, then check their status:</p><pre>systemctl status containerd --no-pager</pre><p>It should show active (running).</p><pre>systemctl status kubelet --no-pager</pre><p>This should also show active (running). If it shows failed or you see errors, check the logs:</p><pre>journalctl -u kubelet --no-pager -n 50</pre><p>Common issues at this stage include certificate problems (wrong paths or permissions), kubeconfig issues, or containerd not running.</p><pre>systemctl status kube-proxy --no-pager</pre><p>This should be running as well.</p><p>Now we need to repeat the process for node-1. I will go through it more quickly since the steps are the same, just with different node-specific values.</p><p>Exit from node-0:</p><pre>exit</pre><p>From jumpbox, copy the binaries and configuration to node-1:</p><pre>scp ~/downloads/kubelet ~/downloads/kube-proxy ~/downloads/kubectl \<br>  ~/downloads/runc ~/downloads/crictl \<br>  ~/downloads/containerd ~/downloads/containerd-shim-runc-v2 ~/downloads/ctr \<br>  root@node-1:~/</pre><pre>scp -r ~/downloads/cni-plugins root@node-1:~/</pre><p>SSH into node-1:</p><pre>ssh root@node-1</pre><p>Install system dependencies:</p><pre>apt-get update<br>apt-get install -y socat conntrack ipset</pre><p>Load kernel modules:</p><pre>cat &gt; /etc/modules-load.d/kubernetes.conf &lt;&lt; EOF<br>overlay<br>br_netfilter<br>EOF<br><br>modprobe overlay<br>modprobe br_netfilter</pre><p>Configure sysctl:</p><pre>cat &gt; /etc/sysctl.d/kubernetes.conf &lt;&lt; EOF<br>net.bridge.bridge-nf-call-iptables = 1<br>net.bridge.bridge-nf-call-ip6tables = 1<br>net.ipv4.ip_forward = 1<br>EOF<br><br>sysctl --system</pre><p>Create directories:</p><pre>mkdir -p \<br>  /etc/cni/net.d \<br>  /opt/cni/bin \<br>  /var/lib/kubelet \<br>  /var/lib/kube-proxy \<br>  /var/lib/kubernetes \<br>  /var/run/kubernetes \<br>  /etc/containerd</pre><p>Install binaries:</p><pre>mv kubelet kube-proxy kubectl /usr/local/bin/<br>mv runc crictl /usr/local/bin/<br>mv containerd containerd-shim-runc-v2 ctr /bin/<br>mv cni-plugins/* /opt/cni/bin/</pre><pre>chmod +x /usr/local/bin/kubelet /usr/local/bin/kube-proxy /usr/local/bin/kubectl<br>chmod +x /usr/local/bin/runc /usr/local/bin/crictl<br>chmod +x /bin/containerd /bin/containerd-shim-runc-v2 /bin/ctr<br>chmod +x /opt/cni/bin/*</pre><p>Create CNI configuration. Note the different subnet for node-1:</p><pre>cat &gt; /etc/cni/net.d/10-bridge.conf &lt;&lt; EOF<br>{<br>  &quot;cniVersion&quot;: &quot;1.0.0&quot;,<br>  &quot;name&quot;: &quot;bridge&quot;,<br>  &quot;type&quot;: &quot;bridge&quot;,<br>  &quot;bridge&quot;: &quot;cni0&quot;,<br>  &quot;isGateway&quot;: true,<br>  &quot;ipMasq&quot;: true,<br>  &quot;ipam&quot;: {<br>    &quot;type&quot;: &quot;host-local&quot;,<br>    &quot;ranges&quot;: [<br>      [{&quot;subnet&quot;: &quot;10.200.1.0/24&quot;}]<br>    ],<br>    &quot;routes&quot;: [<br>      {&quot;dst&quot;: &quot;0.0.0.0/0&quot;}<br>    ]<br>  }<br>}<br>EOF</pre><pre>cat &gt; /etc/cni/net.d/99-loopback.conf &lt;&lt; EOF<br>{<br>  &quot;cniVersion&quot;: &quot;1.0.0&quot;,<br>  &quot;name&quot;: &quot;lo&quot;,<br>  &quot;type&quot;: &quot;loopback&quot;<br>}<br>EOF</pre><p>Create containerd configuration (same as node-0):</p><pre>cat &gt; /etc/containerd/config.toml &lt;&lt; EOF<br>version = 2<br><br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;]<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd]<br>    snapshotter = &quot;overlayfs&quot;<br>    default_runtime_name = &quot;runc&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc]<br>    runtime_type = &quot;io.containerd.runc.v2&quot;<br>  [plugins.&quot;io.containerd.grpc.v1.cri&quot;.containerd.runtimes.runc.options]<br>    SystemdCgroup = true<br>[plugins.&quot;io.containerd.grpc.v1.cri&quot;.cni]<br>  bin_dir = &quot;/opt/cni/bin&quot;<br>  conf_dir = &quot;/etc/cni/net.d&quot;<br>EOF<br><br>cat &gt; /etc/systemd/system/containerd.service &lt;&lt; EOF<br>[Unit]<br>Description=containerd container runtime<br>Documentation=https://containerd.io<br>After=network.target<br><br>[Service]<br>ExecStartPre=/sbin/modprobe overlay<br>ExecStart=/bin/containerd<br>Restart=always<br>RestartSec=5<br>Delegate=yes<br>KillMode=process<br>OOMScoreAdjust=-999<br>LimitNOFILE=1048576<br>LimitNPROC=infinity<br>LimitCORE=infinity<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Move certificates and kubeconfig:</p><pre>mv ~/ca.pem /var/lib/kubernetes/<br>mv ~/node-1.pem ~/node-1-key.pem /var/lib/kubelet/<br>mv ~/node-1.kubeconfig /var/lib/kubelet/kubeconfig</pre><p>Create kubelet configuration. Note the different certificate filenames and podCIDR:</p><pre>cat &gt; /var/lib/kubelet/kubelet-config.yaml &lt;&lt; EOF<br>kind: KubeletConfiguration<br>apiVersion: kubelet.config.k8s.io/v1beta1<br>authentication:<br>  anonymous:<br>    enabled: false<br>  webhook:<br>    enabled: true<br>  x509:<br>    clientCAFile: &quot;/var/lib/kubernetes/ca.pem&quot;<br>authorization:<br>  mode: Webhook<br>cgroupDriver: systemd<br>clusterDomain: &quot;cluster.local&quot;<br>clusterDNS:<br>  - &quot;10.32.0.10&quot;<br>podCIDR: &quot;10.200.1.0/24&quot;<br>resolvConf: &quot;/run/systemd/resolve/resolv.conf&quot;<br>runtimeRequestTimeout: &quot;15m&quot;<br>tlsCertFile: &quot;/var/lib/kubelet/node-1.pem&quot;<br>tlsPrivateKeyFile: &quot;/var/lib/kubelet/node-1-key.pem&quot;<br>registerNode: true<br>EOF<br><br>cat &gt; /etc/systemd/system/kubelet.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Kubelet<br>Documentation=https://github.com/kubernetes/kubernetes<br>After=containerd.service<br>Requires=containerd.service<br><br>[Service]<br>ExecStart=/usr/local/bin/kubelet \\<br>  --config=/var/lib/kubelet/kubelet-config.yaml \\<br>  --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock \\<br>  --kubeconfig=/var/lib/kubelet/kubeconfig \\<br>  --v=2<br>Restart=on-failure<br>RestartSec=5<br><br>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Move kube-proxy kubeconfig and create configuration:</p><pre>mv ~/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig</pre><pre>cat &gt; /var/lib/kube-proxy/kube-proxy-config.yaml &lt;&lt; EOF<br>kind: KubeProxyConfiguration<br>apiVersion: kubeproxy.config.k8s.io/v1alpha1<br>clientConnection:<br>  kubeconfig: &quot;/var/lib/kube-proxy/kubeconfig&quot;<br>mode: &quot;iptables&quot;<br>clusterCIDR: &quot;10.200.0.0/16&quot;<br>EOF</pre><pre>cat &gt; /etc/systemd/system/kube-proxy.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Kube Proxy<br>Documentation=https://github.com/kubernetes/kubernetes<br>After=network.target</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kube-proxy \\<br>  --config=/var/lib/kube-proxy/kube-proxy-config.yaml<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Start the services:</p><pre>systemctl daemon-reload<br>systemctl enable containerd kubelet kube-proxy<br>systemctl start containerd kubelet kube-proxy</pre><p>Verify they are running:</p><pre>systemctl status containerd kubelet kube-proxy --no-pager</pre><p>All three should show active (running).</p><p>Exit from node-1 and go back to jumpbox:</p><pre>exit</pre><p>Now let us check if the nodes have registered with the cluster:</p><pre>kubectl get nodes</pre><p>You should see both nodes:</p><pre>NAME     STATUS     ROLES    AGE   VERSION<br>node-0   NotReady   &lt;none&gt;   1m    v1.32.0<br>node-1   NotReady   &lt;none&gt;   30s   v1.32.0</pre><p>The nodes show NotReady status. This is because we have not set up pod network routes yet. The nodes cannot communicate with each other’s pod networks, so the network is not fully functional. The kubelet marks the node as NotReady when the network plugin reports issues.</p><p>Actually, let us check more carefully. The CNI plugin should work for local pods. The NotReady status might be because of a different reason. Let us check the node conditions:</p><pre>kubectl describe node node-0 | grep -A5 Conditions</pre><p>Look for the Ready condition. If it says “container runtime network not ready”, that is a CNI issue. If it says something about kubelet, check the kubelet logs on that node.</p><p>In some cases, the nodes might show Ready even before we set up routes, because the CNI plugin is configured correctly for local pods. Let us proceed with setting up routes anyway, as they are necessary for pods on different nodes to communicate.</p><p>Here is the problem we need to solve. Node-0 has pods in the 10.200.0.0/24 range. Node-1 has pods in the 10.200.1.0/24 range. When a pod on node-0 wants to talk to a pod on node-1, the packet needs to get from node-0 to node-1. But node-0 does not know that 10.200.1.0/24 is reachable via node-1’s IP address.</p><p>We need to add static routes on each machine so they know how to reach each other’s pod networks.</p><p>On the server (control plane), we need routes to both worker nodes’ pod networks in case we want to access pods from there. SSH to server:</p><pre>ssh root@server</pre><p>Add the routes:</p><pre>ip route add 10.200.0.0/24 via 192.168.10.101<br>ip route add 10.200.1.0/24 via 192.168.10.102</pre><p>Verify:</p><pre>ip route</pre><p>You should see the new routes listed.</p><p>Exit and go to node-0:</p><pre>exit<br>ssh root@node-0</pre><p>Add the route to node-1’s pod network:</p><pre>ip route add 10.200.1.0/24 via 192.168.10.102</pre><p>Verify:</p><pre>ip route</pre><p>Exit and go to node-1:</p><pre>exit<br>ssh root@node-1</pre><p>Add the route to node-0’s pod network:</p><pre>ip route add 10.200.0.0/24 via 192.168.10.101</pre><p>Verify:</p><pre>ip route</pre><p>Exit back to jumpbox:</p><pre>exit</pre><p>These routes are not persistent. If you reboot a machine, the routes will be gone. For a production setup, you would add them to the network configuration files or use a CNI plugin like Calico or Flannel that handles routing automatically. For our lab, we can just re-add them if needed.</p><p>Now let us check the nodes again:</p><pre>kubectl get nodes</pre><p>You should see:</p><pre>NAME     STATUS   ROLES    AGE   VERSION<br>node-0   Ready    &lt;none&gt;   5m    v1.32.0<br>node-1   Ready    &lt;none&gt;   4m    v1.32.0</pre><p>Both nodes should now show Ready. If they still show NotReady, check the kubelet logs on each node for errors.</p><p>Let us get more details about the nodes:</p><pre>kubectl get nodes -o wide</pre><p>This shows additional information including the internal IP, OS, kernel version, and container runtime.</p><pre>kubectl describe node node-0</pre><p>This shows detailed information about node-0, including its capacity, allocatable resources, conditions, and events. Look at the Conditions section to verify that all conditions are healthy.</p><p>Let us create a simple pod to verify everything works. Create a test deployment:</p><pre>kubectl create deployment nginx --image=nginx --replicas=2</pre><p>Watch the pods come up:</p><pre>kubectl get pods -o wide -w</pre><p>Press Ctrl+C after you see both pods running. You should see something like:</p><pre>NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE<br>nginx-77b4fdf86c-abc12   1/1     Running   0          30s   10.200.0.2   node-0<br>nginx-77b4fdf86c-def34   1/1     Running   0          30s   10.200.1.2   node-1</pre><p>The scheduler has placed one pod on each node, and each pod has an IP from its node’s pod CIDR.</p><p>Let us verify the pods can communicate with each other. First, get the pod IPs:</p><pre>kubectl get pods -o wide</pre><p>Note the IP addresses. Let us say the pod on node-0 has IP 10.200.0.2 and the pod on node-1 has IP 10.200.1.2.</p><p>Exec into the pod on node-0 and ping the pod on node-1:</p><pre>kubectl exec -it $(kubectl get pods -o jsonpath=&#39;{.items[0].metadata.name}&#39;) -- /bin/bash</pre><p>Inside the pod, try to reach the other pod:</p><pre>apt-get update &amp;&amp; apt-get install -y curl<br>curl -s 10.200.1.2</pre><p>You should see the nginx welcome page HTML. If this works, pod-to-pod networking across nodes is functional.</p><p>Exit the pod:</p><pre>exit</pre><p>Let me walk through what happens when you create a pod, so you can see how all the pieces fit together.</p><ol><li>When you run kubectl create deployment, kubectl sends a request to the API server to create a Deployment object.</li><li>The API server authenticates your request using the admin certificate, authorizes it via RBAC (the system:masters group has full access), and stores the Deployment in etcd.</li><li>The Deployment controller, running in kube-controller-manager, notices the new Deployment and creates a ReplicaSet.</li><li>The ReplicaSet controller notices the ReplicaSet and creates Pod objects (two of them, since we specified replicas=2).</li><li>The scheduler notices the new Pods that have no node assigned. It evaluates each node’s capacity and constraints, selects a node for each pod, and updates the Pod objects with the node assignment.</li><li>The kubelet on each selected node notices that it has a new pod to run. It calls containerd via CRI to pull the nginx image and create a container.</li><li>Containerd pulls the image layers, creates an overlayfs mount for the container filesystem, and calls runc to create the actual container process with the appropriate namespaces and cgroups.</li><li>Before starting the container, kubelet calls the CNI plugin to set up networking. The bridge plugin creates a veth pair, puts one end in the container’s network namespace, connects the other end to the cni0 bridge, and assigns an IP address from the node’s pod CIDR.</li><li>The container starts and nginx begins listening on port 80.</li><li>Kubelet reports the pod status back to the API server, and you can see it with kubectl get pods.</li></ol><p>This whole process happens in seconds, but there is a lot of machinery involved. Every piece we configured plays a role.</p><p>Let us remove the test deployment:</p><pre>kubectl delete deployment nginx</pre><p>Verify the pods are gone:</p><pre>kubectl get pods</pre><p>You should see “No resources found”.</p><p>We have accomplished a lot in this part. Let us recap:</p><ol><li>We installed containerd as the container runtime on both worker nodes, configuring it with the proper cgroup driver and CNI settings.</li><li>We installed and configured kubelet on each node, setting up authentication, authorization, and integration with containerd.</li><li>We installed and configured kube-proxy on each node to handle service routing via iptables.</li><li>We set up CNI networking using the bridge plugin, giving each node its own pod CIDR range.</li><li>We configured static routes so that pods on different nodes can communicate with each other.</li><li>We verified that nodes registered with the cluster and show Ready status.</li><li>We tested pod deployment and cross-node pod communication.</li></ol><p>The cluster is now fully functional. You can deploy workloads, and they will be scheduled across the two worker nodes. Pods can communicate with each other regardless of which node they are on.</p><p>There are still some things missing that you would want in a production cluster. We do not have DNS, so pods cannot resolve service names. We do not have a storage solution for persistent volumes. We do not have an ingress controller for external access. But the core cluster is working.</p><p>In Part 4, we will run a comprehensive smoke test to verify all aspects of the cluster, including secrets encryption, service exposure via NodePort, kubectl exec and logs, and more. We will also discuss what we learned and what you might want to explore next.</p><p>Before moving on, make sure you can run these commands successfully from the jumpbox:</p><pre>kubectl get nodes<br>kubectl get pods -A<br>kubectl get namespaces</pre><p>The nodes should show Ready. There should be no pods in any namespace (we deleted the test deployment). The namespaces should include default, kube-system, kube-public, and kube-node-lease.</p><h3>The Fourth Part: Smoke Test and Conclusion</h3><p>We have come a long way. In Part 1, we set up our lab environment with four virtual machines. In Part 2, we created certificates, configured etcd, and brought up the control plane. In Part 3, we installed containerd, kubelet, and kube-proxy on the worker nodes and configured pod networking.</p><p>Now it is time to verify that everything actually works. A smoke test is a quick set of checks that exercises the major functionality of a system. If the smoke test passes, you can be reasonably confident that the system is working correctly. If it fails, you know something is wrong and can start debugging.</p><p>We will test data encryption, pod deployment, kubectl commands like port-forward, logs, and exec, service exposure via NodePort, and cross-node pod communication. By the end, you will have confidence that your hand-built cluster is fully functional.</p><p>Make sure you are on the jumpbox and can communicate with the cluster:</p><pre>vagrant ssh jumpbox</pre><p>Verify kubectl is working:</p><pre>kubectl get nodes</pre><p>You should see both worker nodes in Ready state:</p><pre>NAME     STATUS   ROLES    AGE   VERSION<br>node-0   Ready    &lt;none&gt;   30m   v1.32.0<br>node-1   Ready    &lt;none&gt;   29m   v1.32.0</pre><p>If the nodes show NotReady or you see connection errors, go back to Part 3 and verify that all services are running on each node.</p><p>In Part 2, we configured the API server to encrypt secrets before storing them in etcd. Let us verify this is actually working.</p><p>Create a test secret:</p><pre>kubectl create secret generic smoke-test-secret \<br>  --from-literal=username=admin \<br>  --from-literal=password=supersecretpassword</pre><p>Verify the secret was created:</p><pre>kubectl get secret smoke-test-secret</pre><p>You should see:</p><pre>NAME                TYPE     DATA   AGE<br>smoke-test-secret   Opaque   2      5s</pre><p>Now let us retrieve the secret through kubectl and verify we can read it:</p><pre>kubectl get secret smoke-test-secret -o jsonpath=&#39;{.data.password}&#39; | base64 -d<br>echo</pre><p>This should print “supersecretpassword”. The base64 decode is necessary because Kubernetes stores secret values as base64-encoded strings in the API response.</p><p>But here is the important part: let us check how the secret is actually stored in etcd. SSH into the server:</p><pre>ssh root@server</pre><p>Query etcd directly for the secret:</p><pre>etcdctl get /registry/secrets/default/smoke-test-secret | hexdump -C | head -30</pre><p>Look at the output carefully. Near the beginning, you should see something like:</p><pre>00000030  79 0a 6b 38 73 3a 65 6e  63 3a 61 65 73 63 62 63  |y.k8s:enc:aescbc|<br>00000040  3a 76 31 3a 6b 65 79 31  3a                       |:v1:key1:|</pre><p>The “k8s:enc:aescbc:v1:key1” prefix indicates that the data is encrypted using AES-CBC with the key named “key1” that we configured in the encryption config. The actual secret values (username and password) are not visible in plaintext.</p><p>If you see “k8s:enc:aescbc” in the output, encryption is working correctly. If you see the plaintext values “admin” or “supersecretpassword” anywhere in the hexdump, something is wrong with the encryption configuration.</p><p>Exit back to jumpbox:</p><pre>exit</pre><p>This encryption protects your secrets if someone gains access to etcd directly, through a backup, or by stealing the disk. They would need the encryption key to decrypt the data. Of course, anyone with kubectl access and appropriate RBAC permissions can still read secrets through the API, but that is expected and controlled through authorization policies.</p><p>Let us test that the scheduler correctly places pods across nodes and that containerd can pull images and run containers.</p><p>Create a deployment with multiple replicas:</p><pre>kubectl create deployment nginx --image=nginx:latest --replicas=3</pre><p>Watch the pods come up:</p><pre>kubectl get pods -o wide -w</pre><p>You should see the pods transition from Pending to ContainerCreating to Running. The scheduler should distribute them across both nodes. Press Ctrl+C once all pods show Running.</p><p>Let us verify the distribution:</p><pre>kubectl get pods -o wide</pre><p>You should see something like:</p><pre>NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE<br>nginx-77b4fdf86c-2xkpm   1/1     Running   0          45s   10.200.0.2   node-0<br>nginx-77b4fdf86c-8fwqz   1/1     Running   0          45s   10.200.1.2   node-1<br>nginx-77b4fdf86c-kx9rm   1/1     Running   0          45s   10.200.0.3   node-0</pre><p>The exact distribution depends on the scheduler’s algorithm, but you should see pods on both nodes. Notice that each pod has an IP from its node’s pod CIDR: pods on node-0 get IPs from 10.200.0.0/24, and pods on node-1 get IPs from 10.200.1.0/24.</p><p>If pods are stuck in Pending, check if there are any events:</p><pre>kubectl describe pod &lt;pod-name&gt;</pre><p>Look at the Events section at the bottom. Common issues include insufficient resources, image pull failures, or scheduler errors.</p><p>If pods are stuck in ContainerCreating, the issue is usually with containerd or CNI. SSH into the affected node and check the kubelet logs:</p><pre>journalctl -u kubelet --no-pager -n 100</pre><p>The kubectl port-forward command creates a tunnel from your local machine to a pod. This is useful for debugging and accessing services that are not exposed externally.</p><p>First, get the name of one of the nginx pods:</p><pre>POD_NAME=$(kubectl get pods -l app=nginx -o jsonpath=&#39;{.items[0].metadata.name}&#39;)<br>echo $POD_NAME</pre><p>Now create a port forward. We will forward local port 8080 to the pod’s port 80:</p><pre>kubectl port-forward $POD_NAME 8080:80 &amp;</pre><p>The ampersand runs the command in the background so we can continue using the terminal. You should see:</p><pre>Forwarding from 127.0.0.1:8080 -&gt; 80</pre><p>Now test the connection:</p><pre>curl -s http://127.0.0.1:8080 | head -5</pre><p>You should see the beginning of the nginx welcome page:</p><pre>&lt;!DOCTYPE html&gt;<br>&lt;html&gt;<br>&lt;head&gt;<br>&lt;title&gt;Welcome to nginx!&lt;/title&gt;<br>&lt;style&gt;</pre><p>This proves that port forwarding is working. The request went from curl to kubectl, through the API server, to the kubelet on the target node, and finally to the nginx container.</p><p>Clean up the port forward:</p><pre>kill %1</pre><p>If you see “Handling connection for 8080” messages, that is normal. It just means the port forward handled a request.</p><p>The kubectl logs command retrieves the stdout and stderr output from a container. This is essential for debugging application issues.</p><p>First, let us generate some log entries by making requests to nginx. We will use the pod IP directly:</p><pre>POD_IP=$(kubectl get pod $POD_NAME -o jsonpath=&#39;{.status.podIP}&#39;)<br>echo &quot;Pod IP: $POD_IP&quot;</pre><p>Make a few requests to generate access logs. We need to do this from a node that can reach the pod network, so let us SSH to server (which has routes to both pod networks):</p><pre>ssh root@server &quot;curl -s http://$POD_IP &gt; /dev/null&quot;<br>ssh root@server &quot;curl -s http://$POD_IP &gt; /dev/null&quot;<br>ssh root@server &quot;curl -s http://$POD_IP &gt; /dev/null&quot;</pre><p>Now retrieve the logs:</p><pre>kubectl logs $POD_NAME</pre><p>You should see nginx startup messages followed by access log entries:</p><pre>/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration<br>/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/<br>...<br>192.168.10.100 - - [10/Jan/2026:15:30:45 +0000] &quot;GET / HTTP/1.1&quot; 200 615 &quot;-&quot; &quot;curl/7.88.1&quot; &quot;-&quot;<br>192.168.10.100 - - [10/Jan/2026:15:30:46 +0000] &quot;GET / HTTP/1.1&quot; 200 615 &quot;-&quot; &quot;curl/7.88.1&quot; &quot;-&quot;<br>192.168.10.100 - - [10/Jan/2026:15:30:47 +0000] &quot;GET / HTTP/1.1&quot; 200 615 &quot;-&quot; &quot;curl/7.88.1&quot; &quot;-&quot;</pre><p>The access logs show the requests we made. The source IP is 192.168.10.100 (the server node) because that is where we ran curl from.</p><p>You can also follow logs in real-time with the -f flag:</p><pre>kubectl logs -f $POD_NAME &amp;</pre><p>Make another request:</p><pre>ssh root@server &quot;curl -s http://$POD_IP &gt; /dev/null&quot;</pre><p>You should see a new log line appear. Stop the log follow:</p><pre>kill %1</pre><p>The kubectl exec command runs a command inside a running container. This is invaluable for debugging when you need to inspect the container’s filesystem, check running processes, or test network connectivity from inside the container.</p><p>Run a simple command:</p><pre>kubectl exec $POD_NAME -- nginx -v</pre><p>You should see:</p><pre>nginx version: nginx/1.27.3</pre><p>The double dash separates kubectl arguments from the command to run in the container.</p><p>Now let us get an interactive shell:</p><pre>kubectl exec -it $POD_NAME -- /bin/bash</pre><p>You are now inside the nginx container. Let us explore:</p><pre>hostname</pre><p>This shows the pod name.</p><pre>cat /etc/os-release | head -3</pre><p>This shows the base image OS (Debian for the official nginx image).</p><pre>ps aux</pre><p>You should see the nginx master and worker processes.</p><pre>ls /usr/share/nginx/html/</pre><p>This is where nginx serves files from.</p><p>Let us check the network configuration inside the container:</p><pre>cat /etc/resolv.conf</pre><p>You should see:</p><pre>search default.svc.cluster.local svc.cluster.local cluster.local<br>nameserver 10.32.0.10<br>options ndots:5</pre><p>This shows that Kubernetes configured the pod to use 10.32.0.10 as its DNS server. We have not deployed CoreDNS yet, so DNS resolution will not work, but kubelet correctly configured the pod.</p><p>Exit the container:</p><pre>exit</pre><p>So far we have only accessed pods from inside the cluster network. In a real application, you need to expose services to external users. One way to do this is with a NodePort service, which opens a port on every node that forwards traffic to the pods.</p><p>Create a NodePort service for our nginx deployment:</p><pre>kubectl expose deployment nginx --type=NodePort --port=80</pre><p>Check the service:</p><pre>kubectl get service nginx</pre><p>You should see something like:</p><pre>NAME    TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE<br>nginx   NodePort   10.32.0.123   &lt;none&gt;        80:31234/TCP   5s</pre><p>The PORT(S) column shows “80:31234/TCP”, meaning the service listens on port 80 internally (ClusterIP) and port 31234 on every node (NodePort). The exact NodePort number is randomly assigned from the range 30000–32767.</p><p>Let us extract the NodePort:</p><pre>NODE_PORT=$(kubectl get service nginx -o jsonpath=&#39;{.spec.ports[0].nodePort}&#39;)<br>echo &quot;NodePort: $NODE_PORT&quot;</pre><p>Now we can access nginx through any node’s IP address on this port. From the jumpbox:</p><pre>curl -s http://192.168.10.101:$NODE_PORT | head -5</pre><p>This goes to node-0. You should see the nginx welcome page.</p><p>Try node-1 as well:</p><pre>curl -s http://192.168.10.102:$NODE_PORT | head -5</pre><p>Same result. The NodePort works on every node, regardless of which node the pod is actually running on. Kube-proxy handles the routing automatically using iptables rules.</p><p>You can even access it through the server node (control plane):</p><pre>curl -s http://192.168.10.100:$NODE_PORT | head -5</pre><p>This also works because kube-proxy is not running on the server, but… wait, actually it should not work on the server since we did not install kube-proxy there. Let me correct that. Try the worker nodes only:</p><pre>curl -s http://192.168.10.101:$NODE_PORT | head -5<br>curl -s http://192.168.10.102:$NODE_PORT | head -5</pre><p>Both should return the nginx welcome page.</p><p>Let us also verify the ClusterIP works. The ClusterIP is only accessible from within the cluster network (from nodes or pods). SSH to a worker node and test:</p><pre>CLUSTER_IP=$(kubectl get service nginx -o jsonpath=&#39;{.spec.clusterIP}&#39;)<br>echo &quot;ClusterIP: $CLUSTER_IP&quot;<br>ssh root@node-0 &quot;curl -s http://$CLUSTER_IP | head -3&quot;</pre><p>You should see the nginx welcome page. The ClusterIP is a virtual IP that kube-proxy maps to the backend pods.</p><p>This test verifies that our CNI configuration and routing are working correctly for cross-node pod communication.</p><p>First, let us identify pods on different nodes:</p><pre>kubectl get pods -o wide</pre><p>Pick one pod on node-0 and one on node-1. Let us store their names and IPs:</p><pre>POD_ON_NODE0=$(kubectl get pods -o jsonpath=&#39;{.items[?(@.spec.nodeName==&quot;node-0&quot;)].metadata.name}&#39; | awk &#39;{print $1}&#39;)<br>POD_ON_NODE1=$(kubectl get pods -o jsonpath=&#39;{.items[?(@.spec.nodeName==&quot;node-1&quot;)].metadata.name}&#39; | awk &#39;{print $1}&#39;)<br><br>IP_ON_NODE0=$(kubectl get pod $POD_ON_NODE0 -o jsonpath=&#39;{.status.podIP}&#39;)<br>IP_ON_NODE1=$(kubectl get pod $POD_ON_NODE1 -o jsonpath=&#39;{.status.podIP}&#39;)<br><br>echo &quot;Pod on node-0: $POD_ON_NODE0 ($IP_ON_NODE0)&quot;<br>echo &quot;Pod on node-1: $POD_ON_NODE1 ($IP_ON_NODE1)&quot;</pre><p>Now exec into the pod on node-0 and try to reach the pod on node-1:</p><pre>kubectl exec $POD_ON_NODE0 -- curl -s --connect-timeout 5 http://$IP_ON_NODE1 | head -3</pre><p>You should see the nginx welcome page. This proves that:</p><p>The pod on node-0 can send packets to the 10.200.1.0/24 network The routing on node-0 correctly forwards these packets to node-1 Node-1 receives the packets and delivers them to the correct pod The response makes it back through the same path</p><p>Let us test the reverse direction:</p><pre>kubectl exec $POD_ON_NODE1 -- curl -s --connect-timeout 5 http://$IP_ON_NODE0 | head -3</pre><p>Same result. Bidirectional cross-node pod communication is working.</p><p>If these tests fail with a timeout, the most likely issue is the static routes we added in Part 3. SSH into each node and verify the routes are still present:</p><pre>ssh root@node-0 &quot;ip route | grep 10.200&quot;<br>ssh root@node-1 &quot;ip route | grep 10.200&quot;</pre><p>You should see routes for the other node’s pod CIDR. If the routes are missing, re-add them:</p><pre>ssh root@node-0 &quot;ip route add 10.200.1.0/24 via 192.168.10.102&quot;<br>ssh root@node-1 &quot;ip route add 10.200.0.0/24 via 192.168.10.101&quot;</pre><p>Let us verify that containerd is working correctly by using crictl, the CRI command-line tool.</p><p>SSH into node-0:</p><pre>ssh root@node-0</pre><p>List running containers:</p><pre>crictl ps</pre><p>You should see the nginx containers. The output shows container ID, image, creation time, state, name, and pod ID.</p><p>List all pods known to containerd:</p><pre>crictl pods</pre><p>This shows pods managed by the CRI. Each pod has a pause container (the infrastructure container that holds the network namespace) plus the application containers.</p><p>Check container details:</p><pre>CONTAINER_ID=$(crictl ps -q | head -1)<br>crictl inspect $CONTAINER_ID | head -50</pre><p>This shows detailed information about the container including its configuration, mounts, and runtime state.</p><p>Check containerd service status:</p><pre>systemctl status containerd --no-pager</pre><p>It should show active (running) with no errors.</p><p>Exit back to jumpbox:</p><pre>exit</pre><p>Let us clean up the resources we created during testing:</p><pre>kubectl delete deployment nginx<br>kubectl delete service nginx<br>kubectl delete secret smoke-test-secret</pre><p>Verify everything is gone:</p><pre>kubectl get all</pre><p>You should see only the kubernetes service in the default namespace:</p><pre>NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE<br>service/kubernetes   ClusterIP   10.32.0.1    &lt;none&gt;        443/TCP   2h</pre><p>Let us step back and appreciate what we accomplished. We built a fully functional Kubernetes cluster from scratch, without using any automation tools. Here is what our cluster consists of:</p><p>On the jumpbox, we have our administration environment with kubectl configured to communicate with the cluster. This machine holds all our certificates and serves as the entry point for cluster management.</p><p>On the server node, we run the control plane components. etcd stores all cluster state in a key-value format, with secrets encrypted at rest using AES-CBC. The kube-apiserver handles all API requests, authenticating clients using TLS certificates, authorizing requests using RBAC, and persisting changes to etcd. The kube-controller-manager runs control loops that watch for changes and reconcile current state with desired state. The kube-scheduler assigns pods to nodes based on resource requirements and constraints.</p><p>On node-0 and node-1, we run the worker node components. containerd manages the container lifecycle, pulling images and creating containers using runc. kubelet watches the API server for pod assignments, tells containerd to start containers, sets up volumes and networking, and reports status back to the control plane. kube-proxy maintains iptables rules for service routing, ensuring that traffic to ClusterIP and NodePort services reaches the correct pods.</p><p>For networking, we use the CNI bridge plugin which creates a Linux bridge on each node and assigns pod IPs from node-specific CIDR ranges. Static routes between nodes enable cross-node pod communication.</p><p>The entire system is secured with TLS. Every component has its own certificate for authentication. The API server verifies client certificates against the cluster CA. Components authenticate to the API server using their certificates, and RBAC rules control what each component can do.</p><p>While our cluster is functional, there are several things you would need to add before running production workloads.</p><p>DNS is not configured. Pods cannot resolve service names like “nginx.default.svc.cluster.local”. In production, you would deploy CoreDNS as a cluster add-on. CoreDNS runs as pods in the kube-system namespace and provides DNS resolution for services and pods. Without it, applications that rely on service discovery by name will not work.</p><p>High availability is not implemented. We have a single control plane node, which is a single point of failure. If the server goes down, the cluster becomes unmanageable. Production clusters typically run three or more control plane nodes with etcd distributed across them. The API server, controller manager, and scheduler can all run in HA mode with leader election.</p><p>Load balancing for the API server is missing. Clients connect directly to the single API server at 192.168.10.100:6443. In production, you would put a load balancer in front of multiple API servers so clients can connect to any of them.</p><p>Persistent storage is not available. Our cluster has no storage provisioner, so PersistentVolumeClaims cannot be fulfilled. Production clusters need a storage solution like local-path-provisioner for development, or cloud provider integrations like EBS CSI driver for AWS, or distributed storage like Ceph or Longhorn.</p><p>Ingress controller is not deployed. NodePort services work for basic external access, but they require knowing the node IP and port. An ingress controller like nginx-ingress or Traefik provides HTTP routing based on hostnames and paths, TLS termination, and other features needed for web applications.</p><p>Monitoring and logging are not configured. There is no way to see cluster-wide metrics or aggregate logs from all pods. Production clusters run Prometheus for metrics collection, Grafana for visualization, and something like Loki or the ELK stack for log aggregation.</p><p>Network policies are not enforced. Any pod can communicate with any other pod. In production, you would want network policies to restrict traffic between namespaces and applications. This requires a CNI plugin that supports network policies, like Calico or Cilium, rather than the simple bridge plugin we used.</p><p>Node auto-scaling is not possible. If you need more capacity, you have to manually provision new nodes. Managed Kubernetes services integrate with cloud auto-scaling groups to automatically add or remove nodes based on demand.</p><p>Certificate rotation is not automated. Our certificates expire in one year. Before that happens, you need to regenerate and redistribute them. Production setups use cert-manager or built-in certificate rotation to handle this automatically.</p><p>Throughout this tutorial, you may have encountered issues. Here are some common problems and how to diagnose them.</p><p>If nodes show NotReady status, check the kubelet logs on the affected node. SSH into the node and run journalctl -u kubelet -n 100. Look for error messages about CNI, container runtime, or certificate issues. Also check that containerd is running with systemctl status containerd. A common cause is mismatched cgroup drivers between containerd and kubelet.</p><p>If pods are stuck in Pending state, describe the pod with kubectl describe pod to see events. If the event says “no nodes available to schedule pods”, check if nodes are Ready. If it mentions insufficient resources, the nodes may not have enough CPU or memory. If it mentions taints, the nodes may have taints that the pod does not tolerate.</p><p>If pods are stuck in ContainerCreating state, the issue is usually with image pulling or CNI setup. Check kubelet logs on the node where the pod is scheduled. Common issues include the node being unable to reach the container registry, or CNI plugin misconfiguration. Verify the CNI config files exist in /etc/cni/net.d/ and the CNI binaries exist in /opt/cni/bin/.</p><p>If pods cannot communicate across nodes, verify the static routes are in place. On each node, run ip route and confirm there is a route to the other node’s pod CIDR. If routes are missing, re-add them. Also check that IP forwarding is enabled with sysctl net.ipv4.ip_forward, which should return 1.</p><p>If kubectl commands fail with certificate errors, the issue is with your kubeconfig or certificates. Common errors include “certificate signed by unknown authority” which means the CA does not match, or “certificate has expired” which means you need to regenerate certificates. Verify your kubeconfig points to the correct CA and client certificates.</p><p>If the API server is not responding, SSH into the server and check systemctl status kube-apiserver. Look at the logs with journalctl -u kube-apiserver -n 100. Common issues include etcd not running, wrong certificate paths, or binding to the wrong address.</p><p>If etcd is not working, check its status with systemctl status etcd and logs with journalctl -u etcd -n 100. Verify the data directory exists and has correct permissions. A common issue is disk space running out, which causes etcd to stop accepting writes.</p><p>When you are done experimenting with the cluster, you can destroy the virtual machines to free up resources.</p><p>Exit from the jumpbox if you are connected:</p><pre>exit</pre><p>From your Mac, in the kubernetes-the-hard-way directory:</p><pre>vagrant destroy -f</pre><p>This stops and deletes all four virtual machines. The -f flag skips the confirmation prompt.</p><p>If you want to start over, you can run vagrant up again to create fresh VMs and repeat the tutorial.</p><p>To clean up the Vagrant box images that were downloaded:</p><pre>vagrant box list<br>vagrant box remove bento/debian-12</pre><p>This removes the base image, freeing up disk space. It will be re-downloaded if you run vagrant up again.</p><p>Now that you understand how Kubernetes works at a fundamental level, here are some directions you might explore.</p><p>Deploy CoreDNS to enable service discovery. The Kubernetes documentation has guides for deploying CoreDNS as a cluster add-on. Once deployed, pods will be able to resolve service names.</p><p>Try a more advanced CNI plugin like Calico or Cilium. These provide network policies, better performance, and additional features. You would replace the bridge CNI configuration with the new plugin’s configuration.</p><p>Add a third worker node. This involves creating a new VM, generating certificates for it, installing the worker components, and adding the appropriate routes. It is good practice and helps you understand the scaling process.</p><p>Set up etcd clustering with three nodes. This requires running etcd on multiple machines with peer discovery and reconfiguring the API server to connect to all etcd endpoints.</p><p>Explore the Kubernetes API directly. Use curl with your client certificate to make requests to the API server. This helps you understand what kubectl is doing under the hood.</p><p>Read through the source code of kubelet or kube-scheduler. These are written in Go and are reasonably readable. Understanding the code deepens your knowledge of how Kubernetes actually works.</p><p>Try kubeadm and compare. After doing everything manually, run through the kubeadm installation process and see how much it automates. You will appreciate the tooling more and understand what it is doing.</p><p>I want to share some personal thoughts on going through this process.</p><p>Building a cluster from scratch is frustrating. There are so many moving parts, and when something does not work, it can be hard to figure out why. I spent hours debugging certificate issues that turned out to be typos in filenames. I lost time to routing problems that were caused by forgetting to add routes after a VM reboot.</p><p>But the frustration is worth it. When you use kubeadm or a managed service, Kubernetes is a black box. You run a command and a cluster appears. When something breaks, you have no idea where to look. After building a cluster manually, you understand what is inside the box. You know that kubelet talks to containerd over a Unix socket. You know that kube-proxy creates iptables rules. You know that etcd stores everything and that the API server is the gateway to it all.</p><p>This knowledge is invaluable for troubleshooting. When a pod is stuck, you know to check kubelet logs. When services are not working, you know to check kube-proxy and iptables. When the cluster is slow, you know to check etcd performance. You can reason about problems instead of just searching for solutions.</p><p>The certificate infrastructure was the hardest part for me to understand. Why do we need so many certificates? Why does the API server need both a server certificate and a client certificate? It took a while for me to grasp that the API server acts as a server when kubectl connects to it, but acts as a client when it connects to kubelet or etcd. Each role needs its own certificate.</p><p>Networking was the second hardest part. The interaction between CNI plugins, iptables rules, and IP routing is complex. Drawing diagrams helped me understand the packet flow. If you are confused, try tracing what happens when pod A on node-0 sends a packet to pod B on node-1. Follow the packet through the veth interface, the bridge, the routing table, the physical network, and back up through the other node.</p><p>One thing that surprised me is how much of Kubernetes is just Linux. Containers are Linux namespaces and cgroups. Networking is Linux bridges and iptables. Storage is Linux mount points. Kubernetes orchestrates these primitives but does not reinvent them. Understanding Linux systems gives you a huge advantage in understanding Kubernetes.</p><p>You have now built a Kubernetes cluster the hard way. You generated certificates, configured etcd, set up the control plane, installed worker nodes, and verified everything works. You understand what each component does and how they communicate.</p><p>This knowledge will serve you well throughout your career with Kubernetes. Whether you are debugging production issues, designing cluster architecture, or evaluating managed services, you now have a mental model of what is happening under the hood.</p><p>The cluster you built is not production-ready, but it is real. It runs containers, schedules pods, and routes traffic. You can deploy applications to it and watch them run. That is a significant accomplishment.</p><p>From here, you might add the missing pieces to make it more production-like. Or you might tear it down and rebuild it to reinforce what you learned. Or you might move on to using kubeadm or managed services with your new understanding. Whatever you choose, you are now a more capable Kubernetes practitioner.</p><p>Thank you for following along. I hope this tutorial was helpful. If you found errors or have suggestions for improvement, I would appreciate hearing about them. Building something complex and explaining it clearly is challenging, and there is always room to do better.</p><p>Good luck with your Kubernetes journey.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0719698f0182" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Kubernetes Cluster From Scratch — Part 1 and Part 2]]></title>
            <link>https://sigridjin.medium.com/building-a-kubernetes-cluster-from-scratch-overview-and-prerequisites-498ed989fd45?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/498ed989fd45</guid>
            <category><![CDATA[kubernetes]]></category>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Sat, 10 Jan 2026 17:49:21 GMT</pubDate>
            <atom:updated>2026-01-10T18:25:06.163Z</atom:updated>
            <content:encoded><![CDATA[<h3>Building a Kubernetes Cluster From Scratch — Part 1 and Part 2</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NKoG-Bju8rtHgFe8e30Scg.png" /></figure><blockquote>See the repository if you want to follow my tutorial! <a href="https://github.com/sigridjineth/k8s-hard-way">https://github.com/sigridjineth/k8s-hard-way</a></blockquote><h3>The first part</h3><p>This is the first part of a comprehensive tutorial series where we will build a fully functional Kubernetes cluster without using any automation tools like kubeadm or kubespray. By the end of this series, you will have manually installed and configured every single component that makes up a Kubernetes cluster, giving you deep insight into how all the pieces fit together.</p><p>I spent about a week going through this process myself, and I can tell you that it was both frustrating and enlightening. There were moments when I questioned why anyone would do this manually when tools exist to automate everything. But once the cluster came to life and I could actually trace the path of a request from kubectl all the way to a running container, it all made sense. The understanding you gain from this exercise is something you simply cannot get from running a single command that sets everything up for you.</p><p>If you have been working with Kubernetes for a while, you have probably used managed services like EKS, GKE, or AKS. Maybe you have spun up local clusters with minikube or kind. These tools are fantastic for getting things done quickly, but they hide an enormous amount of complexity from you.</p><p>Here is the thing: when something breaks in production, and it will break eventually, you need to understand what is actually happening under the hood. Why is that pod stuck in Pending? Why can pods on one node not talk to pods on another node? Why is the API server rejecting your requests with a cryptic certificate error?</p><p>When you build a cluster by hand, you touch every certificate, every configuration file, every systemd unit. You see exactly how the API server authenticates requests, how the scheduler decides where to place pods, how kubelet communicates with the container runtime. This knowledge becomes invaluable when you need to troubleshoot real problems.</p><p>There is another practical benefit too. If you ever work in an air-gapped environment where you cannot pull images from the internet or use managed services, knowing how to set up Kubernetes manually becomes a necessity rather than a nice-to-have skill.</p><p>By the end of this tutorial series, we will have a working Kubernetes cluster with the following components:</p><p>One control plane node running etcd, kube-apiserver, kube-controller-manager, and kube-scheduler. In production you would want multiple control plane nodes for high availability, but for learning purposes a single node is sufficient.</p><p>Two worker nodes running containerd as the container runtime, kubelet as the node agent, and kube-proxy for network proxying.</p><p>A jumpbox that serves as our administration host. This is where we will generate certificates, create configuration files, and run kubectl commands. Think of it as a bastion host that provides secure access to the cluster.</p><p>The network setup uses a private network in the 192.168.10.0/24 range for communication between nodes. Each worker node gets its own pod CIDR: node-0 gets 10.200.0.0/24 and node-1 gets 10.200.1.0/24. Services will use the 10.32.0.0/24 range.</p><p>Before we begin, let me be clear about what you need. This tutorial is written for a MacBook Pro, and I have tested it on both Intel and Apple Silicon machines. The instructions should work on either, though I will point out any differences when they matter.</p><p>You will need at least 16GB of RAM on your machine. We are going to run four virtual machines simultaneously, and while each one is not huge, they add up. If you have only 8GB, you might be able to squeeze by with reduced memory allocations, but I would not recommend it.</p><p>You should have at least 20GB of free disk space. The base VM images plus all the Kubernetes binaries will eat through storage faster than you might expect.</p><p>I am assuming you are comfortable with the command line and have a basic understanding of networking concepts like IP addresses, subnets, and routing. You should know what TCP/IP is and have at least a vague idea of how TLS certificates work. If terms like “private key” and “certificate authority” sound completely foreign, you might want to do some background reading first.</p><p>You should also have some familiarity with Kubernetes concepts. I am not going to explain what a pod is or why you would use a deployment. If you have never used kubectl before, I would suggest playing with minikube first to get the basics down.</p><p>Let us start by installing the tools we need on your Mac. We will use Homebrew for package management because it makes everything so much easier.</p><p>If you do not have Homebrew installed yet, open Terminal and run this:</p><pre>/bin/bash -c &quot;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)&quot;</pre><p>Follow the prompts to complete the installation. If you are on Apple Silicon, Homebrew installs to /opt/homebrew, so you may need to add it to your PATH. The installer will tell you exactly what to do.</p><p>Now let us install VirtualBox. This is our virtualization platform that will run the virtual machines:</p><pre>brew install --cask virtualbox</pre><p>On macOS, you will likely see a prompt about allowing system extensions. Go to System Settings, then Privacy and Security, and approve the Oracle extension. You may need to reboot your Mac after this.</p><p>Once VirtualBox is installed, verify it works:</p><pre>VBoxManage --version</pre><p>You should see something like 7.0.14r161095 or similar. The exact version does not matter much as long as it is relatively recent.</p><p>Next, install Vagrant. This tool lets us define our virtual machine configuration in code and spin up multiple VMs with a single command:</p><pre>brew install --cask vagrant</pre><p>Verify the installation:</p><pre>vagrant --version</pre><p>You should see Vagrant 2.4.1 or something close to that.</p><blockquote>A quick note for Apple Silicon users: VirtualBox added support for ARM-based Macs starting with version 7.0. However, the ARM support is still maturing, and you might encounter occasional quirks. If you run into serious issues, an alternative is to use UTM or VMware Fusion, but those would require modifying the instructions throughout this tutorial. For most people, VirtualBox works fine.</blockquote><p>Let us create a dedicated directory for this project. I like to keep all my lab work organized:</p><pre>mkdir -p ~/kubernetes-the-hard-way<br>cd ~/kubernetes-the-hard-way</pre><p>This directory will contain our Vagrantfile, initialization scripts, and eventually all the certificates and configuration files we generate.</p><p>Before we create the Vagrantfile, let me explain what we are going to build and why.</p><p>We need four machines total. The jumpbox is a lightweight machine that we will use for administration. It does not need much horsepower since it just runs command-line tools. We will give it 2 CPU cores and 1.5GB of RAM.</p><p>The server machine is our control plane node. It will run etcd, the API server, and the other control plane components. These are memory-hungry processes, so we allocate 2 CPU cores and 2GB of RAM.</p><p>The two worker nodes, node-0 and node-1, will run our actual workloads. Each gets 2 CPU cores and 2GB of RAM. In a real cluster you would probably have more workers and give them more resources, but for learning purposes this is plenty.</p><p>All four machines will run Debian 12, which is a stable and well-supported Linux distribution. Kubernetes works on many Linux flavors, but Debian gives us a clean and predictable environment.</p><p>Now, about networking. Vagrant automatically creates a NAT network for each VM so they can access the internet. This network uses the 10.0.2.0/24 range, and every VM gets 10.0.2.15 as its first network interface. This might seem odd since they all have the same IP, but it works because each VM has its own isolated NAT network.</p><p>The second network interface is where the real action happens. We create a private network in the 192.168.10.0/24 range where all our VMs can talk to each other directly. The jumpbox gets 192.168.10.10, the server gets 192.168.10.100, and the workers get 192.168.10.101 and 192.168.10.102.</p><p>Now let us create the Vagrantfile that defines our virtual machines. Create a new file:</p><pre>cat &gt; Vagrantfile &lt;&lt; &#39;EOF&#39;<br>BOX_IMAGE = &quot;bento/debian-12&quot;<br>BOX_VERSION = &quot;202502.21.0&quot;</pre><pre>Vagrant.configure(&quot;2&quot;) do |config|</pre><pre>config.vm.define &quot;jumpbox&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/k8s-hardway&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;jumpbox&quot;<br>      vb.cpus = 2<br>      vb.memory = 1536<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.hostname = &quot;jumpbox&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.10&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: 2210, auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;init.sh&quot;<br>  end</pre><pre>config.vm.define &quot;server&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/k8s-hardway&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;server&quot;<br>      vb.cpus = 2<br>      vb.memory = 2048<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.hostname = &quot;server&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.100&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: 2200, auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;init.sh&quot;<br>  end</pre><pre>config.vm.define &quot;node-0&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/k8s-hardway&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;node-0&quot;<br>      vb.cpus = 2<br>      vb.memory = 2048<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.hostname = &quot;node-0&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.101&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: 2201, auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;init.sh&quot;<br>  end</pre><pre>config.vm.define &quot;node-1&quot; do |subconfig|<br>    subconfig.vm.box = BOX_IMAGE<br>    subconfig.vm.box_version = BOX_VERSION<br>    subconfig.vm.provider &quot;virtualbox&quot; do |vb|<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--groups&quot;, &quot;/k8s-hardway&quot;]<br>      vb.customize [&quot;modifyvm&quot;, :id, &quot;--nicpromisc2&quot;, &quot;allow-all&quot;]<br>      vb.name = &quot;node-1&quot;<br>      vb.cpus = 2<br>      vb.memory = 2048<br>      vb.linked_clone = true<br>    end<br>    subconfig.vm.hostname = &quot;node-1&quot;<br>    subconfig.vm.network &quot;private_network&quot;, ip: &quot;192.168.10.102&quot;<br>    subconfig.vm.network &quot;forwarded_port&quot;, guest: 22, host: 2202, auto_correct: true, id: &quot;ssh&quot;<br>    subconfig.vm.synced_folder &quot;./&quot;, &quot;/vagrant&quot;, disabled: true<br>    subconfig.vm.provision &quot;shell&quot;, path: &quot;init.sh&quot;<br>  end</pre><pre>end<br>EOF</pre><p>Let me walk through what this file does because there are some important details.</p><ol><li>The BOX_IMAGE specifies which base image to use. We use bento/debian-12, which is a well-maintained Debian image from the Bento project. The version pin ensures we all use the same image and avoid surprises.</li><li>The linked_clone setting is a nice optimization. Instead of creating a full copy of the base disk for each VM, VirtualBox creates a differencing disk that only stores changes. This saves disk space and speeds up VM creation significantly.</li><li>The nicpromisc2 setting puts the second network interface in promiscuous mode. This allows it to see all network traffic on the virtual network, which is necessary for our cluster networking to work properly.</li><li>We disable the synced folder because we do not need it and it can cause issues on some setups. We will transfer files between machines using scp instead.</li></ol><p>Each VM has a forwarded port for SSH. This means you can SSH to localhost:2210 to reach the jumpbox, localhost:2200 for the server, and so on. Vagrant uses these for its own SSH connections.</p><p>Each VM needs some initial configuration when it first boots. We handle this with a shell script that Vagrant runs automatically. Create the init.sh file:</p><pre>cat &gt; init.sh &lt;&lt; &#39;EOF&#39;<br>#!/usr/bin/env bash<br><br>echo &quot;&gt;&gt;&gt; Starting initial configuration &lt;&lt;&lt;&quot;<br><br>echo &quot;[1/7] Configuring shell environment&quot;<br>echo &#39;alias vi=vim&#39; &gt;&gt; /etc/profile<br>echo &#39;export HISTTIMEFORMAT=&quot;%F %T &quot;&#39; &gt;&gt; /etc/profile<br>ln -sf /usr/share/zoneinfo/UTC /etc/localtime<br><br>echo &quot;[2/7] Disabling AppArmor&quot;<br>systemctl stop apparmor 2&gt;/dev/null<br>systemctl disable apparmor 2&gt;/dev/null<br><br>echo &quot;[3/7] Disabling swap&quot;<br>swapoff -a<br>sed -i &#39;/swap/s/^/#/&#39; /etc/fstab<br><br>echo &quot;[4/7] Installing required packages&quot;<br>apt-get update -qq<br>apt-get install -y -qq tree git jq curl wget vim sshpass net-tools dnsutils &gt; /dev/null 2&gt;&amp;1<br><br>echo &quot;[5/7] Setting root password&quot;<br>echo &quot;root:kubernetes&quot; | chpasswd<br><br>echo &quot;[6/7] Configuring SSH&quot;<br>sed -i &#39;s/^#*PasswordAuthentication.*/PasswordAuthentication yes/&#39; /etc/ssh/sshd_config<br>sed -i &#39;s/^#*PermitRootLogin.*/PermitRootLogin yes/&#39; /etc/ssh/sshd_config<br>systemctl restart sshd<br><br>echo &quot;[7/7] Configuring /etc/hosts&quot;<br>cat &gt;&gt; /etc/hosts &lt;&lt; HOSTS<br>192.168.10.10  jumpbox<br>192.168.10.100 server server.kubernetes.local<br>192.168.10.101 node-0 node-0.kubernetes.local<br>192.168.10.102 node-1 node-1.kubernetes.local<br>HOSTS<br><br>echo &quot;&gt;&gt;&gt; Initial configuration complete &lt;&lt;&lt;&quot;<br>EOF<br>chmod +x init.sh</pre><p>Let me explain each step because these settings are important for Kubernetes to work correctly.</p><p>Disabling AppArmor might seem like a security regression, and in production you would want to keep it enabled and properly configured. For this learning exercise, we disable it to avoid potential conflicts with container runtimes. AppArmor can interfere with how containerd manages containers if not configured correctly.</p><p>Disabling swap is a hard requirement for Kubernetes. The kubelet refuses to start if swap is enabled. The reason is that Kubernetes needs to accurately track memory usage for scheduling decisions, and swap makes this unpredictable. When a node runs low on memory, Kubernetes wants to evict pods rather than have them silently slow down due to swapping.</p><p>The packages we install are utilities we will need throughout the tutorial. jq is essential for parsing JSON output from kubectl and etcdctl. sshpass lets us script SSH connections with passwords, which is convenient for automation. The networking tools help with debugging.</p><p>Setting a root password and enabling password authentication makes it easy to SSH between machines. In production you would use key-based authentication only, but for a lab environment this is much more convenient.</p><p>The /etc/hosts entries let us refer to machines by name instead of IP address. This is simpler than setting up a DNS server and works perfectly for our small cluster.</p><p>Now we are ready to create our virtual machines. Make sure you are in the project directory and run:</p><pre>vagrant up</pre><p>This command reads the Vagrantfile, downloads the base image if needed, creates all four VMs, and runs the initialization script on each one. The first run takes a while because it needs to download the Debian image, which is several hundred megabytes. Subsequent runs are much faster thanks to the linked clone feature.</p><p>You should see output for each VM as it boots and runs the provisioning script. If everything goes well, after about 5–10 minutes you will have four running virtual machines.</p><p>Check the status of your VMs. If any VM shows as not running, check the VirtualBox application for error messages. Common issues include not enough RAM, VirtualBox extensions not approved in system settings, or conflicts with other virtualization software.</p><pre>$ vagrant status<br><br>Current machine states:<br><br>jumpbox                   running (virtualbox)<br>server                    running (virtualbox)<br>node-0                    running (virtualbox)<br>node-1                    running (virtualbox)</pre><p>The jumpbox is our command center for the rest of this tutorial. Let us connect to it:</p><pre>vagrant ssh jumpbox</pre><p>You are now inside the jumpbox VM. Since we configured the init script to add entries to /etc/hosts, you can ping the other machines by name:</p><pre>ping -c 3 server<br>ping -c 3 node-0<br>ping -c 3 node-1</pre><p>All three should respond successfully. If any of them fail, check that the VMs are running and that the private network was set up correctly.</p><p>While we are here, let us verify the system configuration:</p><pre>cat /etc/os-release | grep PRETTY_NAME</pre><p>You should see Debian GNU/Linux 12 (bookworm).</p><p>Check that swap is disabled:</p><pre>swapon --show</pre><p>This should produce no output, indicating swap is off.</p><p>Verify the network configuration:</p><pre>ip addr show eth1</pre><p>You should see 192.168.10.10 as the IP address for the jumpbox.</p><p>Now let us test SSH connectivity to the other machines. From the jumpbox:</p><pre>ssh root@server hostname</pre><p>Enter the password “kubernetes” when prompted. You should see “server” printed as the hostname. Test the worker nodes too:</p><pre>ssh root@node-0 hostname<br>ssh root@node-1 hostname</pre><p>If all three commands work, your lab environment is set up correctly.</p><p>Typing the password every time gets old fast. Let us set up SSH keys so we can connect without passwords.</p><p>On the jumpbox, generate an SSH key pair:</p><pre>ssh-keygen -t ed25519 -N &quot;&quot; -f ~/.ssh/id_ed25519</pre><p>The ed25519 algorithm is modern and secure. The -N “” flag sets an empty passphrase, which is fine for a lab environment.</p><p>Now copy the public key to all other machines:</p><pre>ssh-copy-id -o StrictHostKeyChecking=no root@server<br>ssh-copy-id -o StrictHostKeyChecking=no root@node-0<br>ssh-copy-id -o StrictHostKeyChecking=no root@node-1</pre><p>Enter the password “kubernetes” for each one. After this, you can SSH without a password:</p><pre>ssh root@server &quot;echo &#39;SSH key auth working&#39;&quot;<br>ssh root@node-0 &quot;echo &#39;SSH key auth working&#39;&quot;<br>ssh root@node-1 &quot;echo &#39;SSH key auth working&#39;&quot;</pre><p>All three should print the message without asking for a password.</p><p>Throughout this tutorial, we will need to reference information about our machines: their IP addresses, hostnames, and which pod CIDR they should use. Let us create a configuration file that captures all this information.</p><p>On the jumpbox:</p><pre>cat &gt; ~/machines.txt &lt;&lt; &#39;EOF&#39;<br>192.168.10.100 server server.kubernetes.local<br>192.168.10.101 node-0 node-0.kubernetes.local 10.200.0.0/24<br>192.168.10.102 node-1 node-1.kubernetes.local 10.200.1.0/24<br>EOF</pre><p>The format is: IP address, short hostname, FQDN, and pod CIDR (for worker nodes only). We will reference this file in later scripts to avoid hardcoding values everywhere.</p><p>Before we finish this part, let us download all the binaries we will need. This takes a while, so it is better to do it now while we discuss what each component does.</p><p>First, let us set up a directory structure:</p><pre>mkdir -p ~/downloads<br>cd ~/downloads</pre><p>Now determine your architecture. On the jumpbox:</p><pre>ARCH=$(uname -m)<br>if [ &quot;$ARCH&quot; = &quot;x86_64&quot; ]; then<br>  ARCH=&quot;amd64&quot;<br>elif [ &quot;$ARCH&quot; = &quot;aarch64&quot; ]; then<br>  ARCH=&quot;arm64&quot;<br>fi<br>echo &quot;Architecture: $ARCH&quot;</pre><p>Set the Kubernetes version we will use:</p><pre>KUBE_VERSION=&quot;v1.32.0&quot;<br>ETCD_VERSION=&quot;v3.5.17&quot;<br>CONTAINERD_VERSION=&quot;2.0.1&quot;<br>RUNC_VERSION=&quot;v1.2.4&quot;<br>CNI_VERSION=&quot;v1.6.2&quot;</pre><p>Download the Kubernetes control plane binaries:</p><pre>echo &quot;Downloading kube-apiserver...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kube-apiserver&quot;<br><br>echo &quot;Downloading kube-controller-manager...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kube-controller-manager&quot;<br><br>echo &quot;Downloading kube-scheduler...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kube-scheduler&quot;</pre><p>Download kubectl, which we will use for cluster administration:</p><pre>echo &quot;Downloading kubectl...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kubectl&quot;</pre><p>Download the worker node binaries:</p><pre>echo &quot;Downloading kubelet...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kubelet&quot;<br><br>echo &quot;Downloading kube-proxy...&quot;<br>curl -sLO &quot;https://dl.k8s.io/${KUBE_VERSION}/bin/linux/${ARCH}/kube-proxy&quot;</pre><p>Download etcd, the distributed key-value store that Kubernetes uses for all cluster state:</p><pre>echo &quot;Downloading etcd...&quot;<br>curl -sLO &quot;https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz&quot;<br>tar -xzf etcd-${ETCD_VERSION}-linux-${ARCH}.tar.gz<br>mv etcd-${ETCD_VERSION}-linux-${ARCH}/etcd .<br>mv etcd-${ETCD_VERSION}-linux-${ARCH}/etcdctl .<br>rm -rf etcd-${ETCD_VERSION}-linux-${ARCH}*</pre><p>Download containerd, the container runtime:</p><pre>echo &quot;Downloading containerd...&quot;<br>curl -sLO &quot;https://github.com/containerd/containerd/releases/download/v${CONTAINERD_VERSION}/containerd-${CONTAINERD_VERSION}-linux-${ARCH}.tar.gz&quot;<br>mkdir containerd-temp<br>tar -xzf containerd-${CONTAINERD_VERSION}-linux-${ARCH}.tar.gz -C containerd-temp<br>mv containerd-temp/bin/* .<br>rm -rf containerd-temp containerd-${CONTAINERD_VERSION}-linux-${ARCH}.tar.gz</pre><p>Download runc, the low-level container runtime that containerd uses:</p><pre>echo &quot;Downloading runc...&quot;<br>curl -sLO &quot;https://github.com/opencontainers/runc/releases/download/${RUNC_VERSION}/runc.${ARCH}&quot;<br>mv runc.${ARCH} runc</pre><p>Download the CNI plugins for container networking:</p><pre>echo &quot;Downloading CNI plugins...&quot;<br>curl -sLO &quot;https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz&quot;<br>mkdir cni-plugins<br>tar -xzf cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz -C cni-plugins<br>rm cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz</pre><p>Download crictl, a command-line tool for interacting with CRI-compatible container runtimes:</p><pre>echo &quot;Downloading crictl...&quot;<br>CRICTL_VERSION=&quot;v1.32.0&quot;<br>curl -sLO &quot;https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz&quot;<br>tar -xzf crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz<br>rm crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz</pre><p>Make all binaries executable:</p><pre>chmod +x kube-apiserver kube-controller-manager kube-scheduler kubectl<br>chmod +x kubelet kube-proxy<br>chmod +x etcd etcdctl<br>chmod +x containerd containerd-shim-runc-v2 containerd-stress ctr<br>chmod +x runc crictl</pre><p>Verify we have everything:</p><pre>ls -la</pre><p>You should see all the binaries we downloaded. Let us also check that they run:</p><pre>./kubectl version --client<br>./etcdctl version</pre><p>Both should print version information without errors.</p><p>Let me briefly explain what each of these components does, because understanding their roles is key to understanding Kubernetes as a whole.</p><ol><li>The kube-apiserver is the front door to your cluster. Every interaction with Kubernetes, whether from kubectl, the kubelet, or external systems, goes through the API server. It validates requests, authenticates users, and stores the resulting state in etcd.</li><li>The kube-controller-manager runs a collection of controllers, each responsible for maintaining a particular aspect of cluster state. The deployment controller watches for deployment changes and creates or updates replica sets. The replica set controller ensures the right number of pods are running. The node controller monitors node health. There are many more, but they all follow the same pattern: watch for changes, compare current state to desired state, and take action to reconcile differences.</li><li>The kube-scheduler watches for newly created pods that have no node assigned and selects a node for them to run on. It considers resource requirements, affinity rules, taints and tolerations, and many other factors when making scheduling decisions.</li><li>The kubelet is the agent that runs on every worker node. It receives pod specifications from the API server and ensures the containers described in those specs are running and healthy. It works with the container runtime to actually start and stop containers.</li><li>The kube-proxy runs on every node and maintains network rules that allow communication to pods from inside and outside the cluster. When you create a Service, kube-proxy sets up the rules that route traffic to the right pods.</li><li>etcd is a distributed key-value store that holds all cluster state. Every object you create in Kubernetes, every pod, every service, every secret, is stored in etcd. It uses the Raft consensus algorithm to maintain consistency across multiple nodes, though in our setup we will run just a single etcd instance.</li><li>containerd is a container runtime that manages the complete container lifecycle: pulling images, creating containers, attaching storage, setting up networking, and so on. It implements the Container Runtime Interface (CRI) that Kubernetes uses to interact with container runtimes.</li><li>runc is a lower-level tool that actually creates and runs containers according to the OCI (Open Container Initiative) specification. containerd uses runc under the hood.</li><li>The CNI plugins handle container networking. When kubelet creates a pod, it calls a CNI plugin to set up the network namespace, assign an IP address, and configure routing.</li></ol><p>At this point, your lab environment is fully set up. You have four virtual machines that can communicate with each other, SSH keys configured for password-less access, and all the Kubernetes binaries downloaded and ready to deploy.</p><p>Let us do a final verification before we move on. Exit from the jumpbox if you are still connected, then verify you can reach it:</p><pre>exit<br>vagrant ssh jumpbox -c &quot;echo &#39;Environment ready&#39;&quot;</pre><p>You should see “Environment ready” printed.</p><h3>The Second Part</h3><h3>Building a Kubernetes Cluster From Scratch: Setting Up etcd and Control Plane</h3><p>In Part 1, we created our lab environment with four virtual machines and downloaded all the Kubernetes binaries. Now comes the real work: setting up the certificate infrastructure and bootstrapping the control plane.</p><p>This part is long and dense. We will generate over a dozen certificates, create configuration files for every component, and bring up etcd and the three control plane services. Take your time working through this. If something does not work, go back and check the previous steps carefully. Certificate issues are the number one cause of problems during Kubernetes installation, and a single typo can cause hours of debugging.</p><p>Before we start generating certificates, let me explain why Kubernetes needs so many of them.</p><p>Kubernetes uses TLS certificates for two purposes: encryption and authentication. Every connection between components is encrypted using TLS, which prevents eavesdropping. But more importantly, certificates prove identity. When the kubelet connects to the API server, it presents a certificate that says “I am the kubelet for node-0.” The API server checks that this certificate was signed by a trusted authority before accepting the connection.</p><p>This is called mutual TLS, or mTLS. In regular TLS, like when you browse a website, only the server proves its identity. In mTLS, both sides present certificates and verify each other. This is critical for security in a distributed system like Kubernetes where many components communicate over the network.</p><p>Each certificate contains a Common Name (CN) and optionally an Organization (O) field. Kubernetes uses these fields for identity. The CN becomes the username, and the O becomes the group membership. For example, a certificate with CN=system:kube-scheduler tells the API server that requests using this certificate come from the scheduler component.</p><p>Here is what we need to create.</p><ol><li>A Certificate Authority (CA) that signs all other certificates. Every component trusts certificates signed by this CA.</li><li>An admin client certificate for cluster administration. This is what kubectl will use.</li><li>A kubelet client certificate for each worker node. These identify the kubelets to the API server.</li><li>Client certificates for kube-controller-manager, kube-scheduler, and kube-proxy. Each component authenticates to the API server with its own certificate.</li><li>A server certificate for the API server. This proves to clients that they are talking to the real API server.</li><li>A key pair for signing service account tokens. This is not a certificate per se, but the API server uses it to sign and verify tokens.</li></ol><p>First, connect to the jumpbox where we will do all our work:</p><pre>vagrant ssh jumpbox</pre><p>Create a directory for our certificates and configuration files:</p><pre>mkdir -p ~/certs ~/configs ~/units<br>cd ~/certs</pre><p>We will generate all certificates in the certs directory, configuration files in configs, and systemd unit files in units.</p><p>The Certificate Authority is the root of trust for our cluster. Every other certificate will be signed by the CA, and every component will be configured to trust certificates signed by it.</p><p>First, create the CA configuration file:</p><pre>cat &gt; ca-config.json &lt;&lt; EOF<br>{<br>  &quot;signing&quot;: {<br>    &quot;default&quot;: {<br>      &quot;expiry&quot;: &quot;8760h&quot;<br>    },<br>    &quot;profiles&quot;: {<br>      &quot;kubernetes&quot;: {<br>        &quot;usages&quot;: [&quot;signing&quot;, &quot;key encipherment&quot;, &quot;server auth&quot;, &quot;client auth&quot;],<br>        &quot;expiry&quot;: &quot;8760h&quot;<br>      }<br>    }<br>  }<br>}<br>EOF</pre><p>This configuration defines a signing profile called “kubernetes” that we will use for all certificates. The expiry is set to 8760 hours, which is one year. In production you might want longer-lived CA certificates and shorter-lived component certificates, but for learning purposes one year is fine.</p><p>Now create the CA certificate signing request:</p><pre>cat &gt; ca-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;Kubernetes&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;Kubernetes&quot;,<br>      &quot;OU&quot;: &quot;CA&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><p>The CN (Common Name) is “Kubernetes” and the O (Organization) is also “Kubernetes”. The location fields do not matter for functionality but are required by the X.509 standard.</p><p>We need the cfssl tools to generate certificates. Install them:</p><pre>curl -sLO https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_amd64<br>curl -sLO https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_amd64<br>chmod +x cfssl_1.6.5_linux_amd64 cfssljson_1.6.5_linux_amd64<br>sudo mv cfssl_1.6.5_linux_amd64 /usr/local/bin/cfssl<br>sudo mv cfssljson_1.6.5_linux_amd64 /usr/local/bin/cfssljson</pre><pre># If you are on Apple Silicon and your VMs are ARM-based, use the arm64 versions instead:</pre><pre>curl -sLO <a href="https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_arm64">https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_arm64</a><br>curl -sLO <a href="https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_arm64">https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_arm64</a><br>chmod +x cfssl_1.6.5_linux_arm64 cfssljson_1.6.5_linux_arm64<br>sudo mv cfssl_1.6.5_linux_arm64 /usr/local/bin/cfssl<br>sudo mv cfssljson_1.6.5_linux_arm64 /usr/local/bin/cfssljson</pre><p>Verify cfssl is working:</p><pre>cfssl version</pre><p>Now generate the CA certificate and private key:</p><pre>cfssl gencert -initca ca-csr.json | cfssljson -bare ca</pre><p>This creates three files: ca.pem (the certificate), ca-key.pem (the private key), and ca.csr (the certificate signing request, which we do not need). Let us verify:</p><pre>ls -la ca*</pre><p>You should see ca-key.pem, ca.pem, and ca.csr along with the JSON files we created.</p><p>You can examine the CA certificate:</p><pre>openssl x509 -in ca.pem -text -noout | head -20</pre><p>You will see the certificate details including the subject (CN=Kubernetes) and that it is a CA certificate.</p><p>The admin certificate is what we will use with kubectl to manage the cluster. It needs to be part of the system:masters group, which has full cluster admin privileges.</p><pre>cat &gt; admin-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;admin&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:masters&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><p>Notice that the Organization (O) is set to “system:masters”. This is a special group in Kubernetes that has unrestricted access to the cluster. The CN is “admin”, which will be the username.</p><p>Generate the certificate:</p><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -profile=kubernetes \<br>  admin-csr.json | cfssljson -bare admin</pre><p>This creates admin.pem and admin-key.pem.</p><p>Each kubelet needs its own certificate. The certificate identifies which node the kubelet is running on. Kubernetes uses a specific naming convention: the CN must be system:node:&lt;nodename&gt; and the O must be system:nodes.</p><p>For node-0:</p><pre>cat &gt; node-0-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;system:node:node-0&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:nodes&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -hostname=node-0,node-0.kubernetes.local,192.168.10.101 \<br>  -profile=kubernetes \<br>  node-0-csr.json | cfssljson -bare node-0</pre><p>The hostname flag adds Subject Alternative Names (SANs) to the certificate. These are the names and IP addresses that are valid for this certificate. The kubelet certificate needs the node’s hostname and IP address.</p><p>For node-1:</p><pre>cat &gt; node-1-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;system:node:node-1&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:nodes&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -hostname=node-1,node-1.kubernetes.local,192.168.10.102 \<br>  -profile=kubernetes \<br>  node-1-csr.json | cfssljson -bare node-1</pre><p>The kube-controller-manager authenticates to the API server using its own certificate:</p><pre>cat &gt; kube-controller-manager-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;system:kube-controller-manager&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:kube-controller-manager&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -profile=kubernetes \<br>  kube-controller-manager-csr.json | cfssljson -bare kube-controller-manager</pre><p>Generate Kube-Proxy Client Certificate:</p><pre>cat &gt; kube-proxy-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;system:kube-proxy&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:node-proxier&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -profile=kubernetes \<br>  kube-proxy-csr.json | cfssljson -bare kube-proxy</pre><p>Generate Scheduler Client Certificate:</p><pre>cat &gt; kube-scheduler-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;system:kube-scheduler&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;system:kube-scheduler&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -profile=kubernetes \<br>  kube-scheduler-csr.json | cfssljson -bare kube-scheduler</pre><p>The API server certificate is special because it needs many Subject Alternative Names. Clients connect to the API server using various names and IP addresses, and the certificate must be valid for all of them.</p><pre>cat &gt; kubernetes-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;kubernetes&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;Kubernetes&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><p>Now we need to specify all the SANs. The API server needs to be reachable by:</p><ul><li>The hostname “server” and “server.kubernetes.local”</li><li>The IP address 192.168.10.100</li><li>The internal cluster IP 10.32.0.1 (the first IP in the service CIDR)</li><li>The localhost addresses for internal communication</li><li>The special name “kubernetes” and its variations.</li></ul><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -hostname=10.32.0.1,192.168.10.100,server,server.kubernetes.local,127.0.0.1,localhost,kubernetes,kubernetes.default,kubernetes.default.svc,kubernetes.default.svc.cluster,kubernetes.default.svc.cluster.local \<br>  -profile=kubernetes \<br>  kubernetes-csr.json | cfssljson -bare kubernetes</pre><p>You can verify the SANs were included:</p><pre>openssl x509 -in kubernetes.pem -text -noout | grep -A1 &quot;Subject Alternative Name&quot;</pre><p>Service accounts use a different authentication mechanism. Instead of certificates, they use tokens signed by a key pair. The API server signs tokens with the private key, and components verify tokens using the public key.</p><pre>cat &gt; service-account-csr.json &lt;&lt; EOF<br>{<br>  &quot;CN&quot;: &quot;service-accounts&quot;,<br>  &quot;key&quot;: {<br>    &quot;algo&quot;: &quot;rsa&quot;,<br>    &quot;size&quot;: 2048<br>  },<br>  &quot;names&quot;: [<br>    {<br>      &quot;C&quot;: &quot;US&quot;,<br>      &quot;L&quot;: &quot;Portland&quot;,<br>      &quot;O&quot;: &quot;Kubernetes&quot;,<br>      &quot;OU&quot;: &quot;Kubernetes The Hard Way&quot;,<br>      &quot;ST&quot;: &quot;Oregon&quot;<br>    }<br>  ]<br>}<br>EOF</pre><pre>cfssl gencert \<br>  -ca=ca.pem \<br>  -ca-key=ca-key.pem \<br>  -config=ca-config.json \<br>  -profile=kubernetes \<br>  service-account-csr.json | cfssljson -bare service-account</pre><p>Let us make sure we generated everything:</p><pre>ls -la *.pem</pre><p>You should see these files:</p><pre>admin-key.pem<br>admin.pem<br>ca-key.pem<br>ca.pem<br>kube-controller-manager-key.pem<br>kube-controller-manager.pem<br>kube-proxy-key.pem<br>kube-proxy.pem<br>kube-scheduler-key.pem<br>kube-scheduler.pem<br>kubernetes-key.pem<br>kubernetes.pem<br>node-0-key.pem<br>node-0.pem<br>node-1-key.pem<br>node-1.pem<br>service-account-key.pem<br>service-account.pem</pre><p>That is 18 .pem files total (9 certificates and 9 private keys).</p><p>Now we need to copy the appropriate certificates to each machine. Each machine only gets the certificates it needs.</p><p>Copy certificates to the server (control plane):</p><pre>scp ca.pem ca-key.pem kubernetes.pem kubernetes-key.pem \<br>  service-account.pem service-account-key.pem \<br>  root@server:~/</pre><p>Copy certificates to node-0:</p><pre>scp ca.pem node-0.pem node-0-key.pem root@node-0:~/</pre><p>Copy certificates to node-1:</p><pre>scp ca.pem node-1.pem node-1-key.pem root@node-1:~/</pre><p>Notice that worker nodes only get the CA certificate (to verify other certificates) and their own node certificate. They do not get any private keys except their own.</p><p>Kubernetes components use kubeconfig files to locate and authenticate to the API server. A kubeconfig file contains the cluster’s CA certificate, the API server address, and client credentials.</p><p>We need to generate kubeconfig files for kubelet (one per node), kube-proxy, kube-controller-manager, kube-scheduler, and admin.</p><p>Let us move to the configs directory:</p><pre>cd ~/configs</pre><p>First, copy kubectl to a location where we can use it:</p><pre>sudo cp ~/downloads/kubectl /usr/local/bin/<br>sudo chmod +x /usr/local/bin/kubectl</pre><p>Now let us set some variables we will use repeatedly:</p><pre>KUBERNETES_API_SERVER=&quot;https://192.168.10.100:6443&quot;</pre><p>Generate Kubelet Kubeconfig Files:</p><p>For node-0:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=${KUBERNETES_API_SERVER} \<br>  --kubeconfig=node-0.kubeconfig</pre><pre>kubectl config set-credentials system:node:node-0 \<br>  --client-certificate=~/certs/node-0.pem \<br>  --client-key=~/certs/node-0-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=node-0.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=system:node:node-0 \<br>  --kubeconfig=node-0.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=node-0.kubeconfig</pre><p>For node-1:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=${KUBERNETES_API_SERVER} \<br>  --kubeconfig=node-1.kubeconfig</pre><pre>kubectl config set-credentials system:node:node-1 \<br>  --client-certificate=~/certs/node-1.pem \<br>  --client-key=~/certs/node-1-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=node-1.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=system:node:node-1 \<br>  --kubeconfig=node-1.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=node-1.kubeconfig</pre><p>Generate Kube-Proxy Kubeconfig:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=${KUBERNETES_API_SERVER} \<br>  --kubeconfig=kube-proxy.kubeconfig</pre><pre>kubectl config set-credentials system:kube-proxy \<br>  --client-certificate=~/certs/kube-proxy.pem \<br>  --client-key=~/certs/kube-proxy-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=kube-proxy.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=system:kube-proxy \<br>  --kubeconfig=kube-proxy.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=kube-proxy.kubeconfig</pre><p>Generate Controller Manager Kubeconfig:</p><p>The controller manager runs on the same host as the API server, so it can use localhost.</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=https://127.0.0.1:6443 \<br>  --kubeconfig=kube-controller-manager.kubeconfig</pre><pre>kubectl config set-credentials system:kube-controller-manager \<br>  --client-certificate=~/certs/kube-controller-manager.pem \<br>  --client-key=~/certs/kube-controller-manager-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=kube-controller-manager.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=system:kube-controller-manager \<br>  --kubeconfig=kube-controller-manager.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=kube-controller-manager.kubeconfig</pre><p>Generate Scheduler Kubeconfig:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=https://127.0.0.1:6443 \<br>  --kubeconfig=kube-scheduler.kubeconfig</pre><pre>kubectl config set-credentials system:kube-scheduler \<br>  --client-certificate=~/certs/kube-scheduler.pem \<br>  --client-key=~/certs/kube-scheduler-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=kube-scheduler.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=system:kube-scheduler \<br>  --kubeconfig=kube-scheduler.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=kube-scheduler.kubeconfig</pre><p>Generate Admin Kubeconfig:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=https://127.0.0.1:6443 \<br>  --kubeconfig=admin.kubeconfig</pre><pre>kubectl config set-credentials admin \<br>  --client-certificate=~/certs/admin.pem \<br>  --client-key=~/certs/admin-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=admin.kubeconfig</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=admin \<br>  --kubeconfig=admin.kubeconfig</pre><pre>kubectl config use-context default --kubeconfig=admin.kubeconfig</pre><p>Distribute Kubeconfig Files. Copy the appropriate kubeconfig files to each machine.</p><p>To the server:</p><pre>scp admin.kubeconfig kube-controller-manager.kubeconfig kube-scheduler.kubeconfig root@server:~/</pre><p>To node-0:</p><pre>scp node-0.kubeconfig kube-proxy.kubeconfig root@node-0:~/</pre><p>To node-1:</p><pre>scp node-1.kubeconfig kube-proxy.kubeconfig root@node-1:~/</pre><p>Generate the Data Encryption Config: Kubernetes can encrypt secret data at rest in etcd. This means that even if someone gains access to the etcd data files, they cannot read your secrets without the encryption key.</p><p>First, generate a random encryption key:</p><pre>ENCRYPTION_KEY=$(head -c 32 /dev/urandom | base64)<br>echo $ENCRYPTION_KEY</pre><p>Save this key somewhere safe. You will need it if you ever need to recover your cluster.</p><p>Now create the encryption config:</p><pre>cat &gt; encryption-config.yaml &lt;&lt; EOF<br>kind: EncryptionConfig<br>apiVersion: v1<br>resources:<br>  - resources:<br>      - secrets<br>    providers:<br>      - aescbc:<br>          keys:<br>            - name: key1<br>              secret: ${ENCRYPTION_KEY}<br>      - identity: {}<br>EOF</pre><p>This configuration tells Kubernetes to encrypt secrets using AES-CBC with the key we generated. The identity provider at the end is a fallback that allows reading unencrypted secrets (for backward compatibility during migration).</p><p>Copy the encryption config to the server:</p><pre>scp encryption-config.yaml root@server:~/</pre><p>Now we begin the actual installation. etcd is the first component we set up because everything else depends on it. etcd stores all cluster state, including pod definitions, service configurations, secrets, and more.</p><p>SSH into the server:</p><pre>ssh root@server</pre><p>Create the necessary directories:</p><pre>mkdir -p /etc/etcd /var/lib/etcd<br>chmod 700 /var/lib/etcd</pre><p>Copy the certificates and move them into place:</p><pre>cp ca.pem kubernetes.pem kubernetes-key.pem /etc/etcd/</pre><p>Move the etcd binaries (we will copy them from jumpbox first, so exit back to jumpbox):</p><pre>exit</pre><p>From jumpbox, copy the binaries:</p><pre>scp ~/downloads/etcd ~/downloads/etcdctl root@server:~/</pre><p>SSH back to server:</p><pre>ssh root@server</pre><p>Install the binaries:</p><pre>mv etcd etcdctl /usr/local/bin/</pre><p>Verify:</p><pre>etcd --version<br>etcdctl version</pre><p>Now create the systemd unit file for etcd. We are going to run etcd with HTTP for simplicity in this tutorial. In production, you would use HTTPS with client certificate authentication, but that adds complexity that can obscure the core concepts:</p><pre>cat &gt; /etc/systemd/system/etcd.service &lt;&lt; EOF<br>[Unit]<br>Description=etcd<br>Documentation=https://github.com/etcd-io/etcd</pre><pre>[Service]<br>Type=notify<br>ExecStart=/usr/local/bin/etcd \\<br>  --name server \\<br>  --data-dir=/var/lib/etcd \\<br>  --listen-peer-urls <a href="http://127.0.0.1:2380">http://127.0.0.1:2380</a> \\<br>  --listen-client-urls <a href="http://127.0.0.1:2379">http://127.0.0.1:2379</a> \\<br>  --initial-advertise-peer-urls <a href="http://127.0.0.1:2380">http://127.0.0.1:2380</a> \\<br>  --advertise-client-urls <a href="http://127.0.0.1:2379">http://127.0.0.1:2379</a> \\<br>  --initial-cluster-token etcd-cluster-0 \\<br>  --initial-cluster server=http://127.0.0.1:2380 \\<br>  --initial-cluster-state new<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Let me explain the key options:</p><ol><li>The name flag gives this etcd member a unique name. In a multi-node cluster, each member needs a different name.</li><li>The data-dir is where etcd stores its data. We set restrictive permissions on this directory earlier.</li><li>The listen-peer-urls is where etcd listens for communication from other etcd members. We use 127.0.0.1 because we have only one member.</li><li>The listen-client-urls is where etcd listens for client connections (from the API server).</li><li>The initial-cluster defines all members of the cluster. Since we have only one member, it just lists our server.</li></ol><p>Start and enable etcd:</p><pre>systemctl daemon-reload<br>systemctl enable etcd<br>systemctl start etcd</pre><p>Check the status:</p><pre>systemctl status etcd --no-pager</pre><p>It should show “active (running)”. If it shows failed, check the logs:</p><pre>journalctl -u etcd --no-pager -n 50</pre><p>Verify etcd is working by listing members:</p><pre>etcdctl member list</pre><p>You should see one member. Check the cluster health:</p><pre>etcdctl endpoint health</pre><p>It should report as healthy.</p><p>The API server is the central hub of Kubernetes. Every other component talks to it, and it is the only component that talks directly to etcd.</p><p>First, create the directory for Kubernetes configuration:</p><pre>mkdir -p /etc/kubernetes/config /var/lib/kubernetes</pre><p>Move the certificates, encryption config, and kubeconfigs into place:</p><pre>mv ca.pem ca-key.pem kubernetes.pem kubernetes-key.pem \<br>  service-account.pem service-account-key.pem \<br>  encryption-config.yaml /var/lib/kubernetes/</pre><pre>mv kube-controller-manager.kubeconfig /var/lib/kubernetes/<br>mv kube-scheduler.kubeconfig /var/lib/kubernetes/<br>mv admin.kubeconfig /var/lib/kubernetes/</pre><p>Copy the API server binary from jumpbox. Exit to jumpbox first.</p><p>From jumpbox:</p><pre>scp ~/downloads/kube-apiserver ~/downloads/kube-controller-manager \<br>  ~/downloads/kube-scheduler ~/downloads/kubectl root@server:~/</pre><p>SSH back:</p><pre>ssh root@server</pre><p>Install the binaries:</p><pre>mv kube-apiserver kube-controller-manager kube-scheduler kubectl /usr/local/bin/<br>chmod +x /usr/local/bin/kube-apiserver /usr/local/bin/kube-controller-manager \<br>  /usr/local/bin/kube-scheduler /usr/local/bin/kubectl</pre><p>Now create the API server systemd unit file. This is where things get interesting because there are a lot of flags.</p><pre>cat &gt; /etc/systemd/system/kube-apiserver.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes API Server<br>Documentation=https://github.com/kubernetes/kubernetes</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kube-apiserver \\<br>  --advertise-address=192.168.10.100 \\<br>  --allow-privileged=true \\<br>  --audit-log-maxage=30 \\<br>  --audit-log-maxbackup=3 \\<br>  --audit-log-maxsize=100 \\<br>  --audit-log-path=/var/log/audit.log \\<br>  --authorization-mode=Node,RBAC \\<br>  --bind-address=0.0.0.0 \\<br>  --client-ca-file=/var/lib/kubernetes/ca.pem \\<br>  --enable-admission-plugins=NamespaceLifecycle,NodeRestriction,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota \\<br>  --etcd-servers=http://127.0.0.1:2379 \\<br>  --event-ttl=1h \\<br>  --encryption-provider-config=/var/lib/kubernetes/encryption-config.yaml \\<br>  --kubelet-certificate-authority=/var/lib/kubernetes/ca.pem \\<br>  --kubelet-client-certificate=/var/lib/kubernetes/kubernetes.pem \\<br>  --kubelet-client-key=/var/lib/kubernetes/kubernetes-key.pem \\<br>  --runtime-config=api/all=true \\<br>  --service-account-key-file=/var/lib/kubernetes/service-account.pem \\<br>  --service-account-signing-key-file=/var/lib/kubernetes/service-account-key.pem \\<br>  --service-account-issuer=https://192.168.10.100:6443 \\<br>  --service-cluster-ip-range=10.32.0.0/24 \\<br>  --tls-cert-file=/var/lib/kubernetes/kubernetes.pem \\<br>  --tls-private-key-file=/var/lib/kubernetes/kubernetes-key.pem \\<br>  --v=2<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>That is a lot of flags. Let me explain the important ones:</p><ol><li>The advertise-address is the IP that will be advertised to other cluster members. This is where the API server is reachable.</li><li>The authorization-mode is set to Node,RBAC. Node authorization allows kubelets to access resources they need. RBAC enables role-based access control.</li><li>The client-ca-file specifies the CA used to verify client certificates. Any certificate signed by this CA will be accepted.</li><li>The enable-admission-plugins lists admission controllers that process requests. NamespaceLifecycle ensures you cannot create objects in non-existent namespaces. NodeRestriction limits what kubelets can modify. ServiceAccount ensures pods get service account tokens automatically.</li><li>The etcd-servers tells the API server where to find etcd.</li><li>The encryption-provider-config points to our encryption configuration for secrets.</li><li>The kubelet-certificate-authority, kubelet-client-certificate, and kubelet-client-key are used when the API server connects to kubelets (for logs, exec, etc.).</li><li>The service-account-key-file and service-account-signing-key-file are used to sign and verify service account tokens.</li><li>The service-cluster-ip-range defines the IP range for ClusterIP services. We use 10.32.0.0/24, which gives us 254 possible service IPs.</li><li>The tls-cert-file and tls-private-key-file are the API server’s own certificate and key.</li></ol><p>Let us set up the Controller Manager. Create the systemd unit file:</p><pre>cat &gt; /etc/systemd/system/kube-controller-manager.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Controller Manager<br>Documentation=https://github.com/kubernetes/kubernetes</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kube-controller-manager \\<br>  --bind-address=0.0.0.0 \\<br>  --cluster-cidr=10.200.0.0/16 \\<br>  --cluster-name=kubernetes \\<br>  --cluster-signing-cert-file=/var/lib/kubernetes/ca.pem \\<br>  --cluster-signing-key-file=/var/lib/kubernetes/ca-key.pem \\<br>  --kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \\<br>  --leader-elect=true \\<br>  --root-ca-file=/var/lib/kubernetes/ca.pem \\<br>  --service-account-private-key-file=/var/lib/kubernetes/service-account-key.pem \\<br>  --service-cluster-ip-range=10.32.0.0/24 \\<br>  --use-service-account-credentials=true \\<br>  --v=2<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>The cluster-cidr is the range used for pod IPs across the entire cluster. We use 10.200.0.0/16, which will be subdivided into smaller ranges for each node.</p><p>The cluster-signing-cert-file and cluster-signing-key-file let the controller manager sign certificates for kubelet certificate rotation.</p><p>The leader-elect flag enables leader election. In a multi-master setup, only one controller manager is active at a time. With a single master, this still works fine.</p><p>Let us continue setting up scheduler.</p><p>First, create the scheduler configuration file:</p><pre>cat &gt; /etc/kubernetes/config/kube-scheduler.yaml &lt;&lt; EOF<br>apiVersion: kubescheduler.config.k8s.io/v1<br>kind: KubeSchedulerConfiguration<br>clientConnection:<br>  kubeconfig: &quot;/var/lib/kubernetes/kube-scheduler.kubeconfig&quot;<br>leaderElection:<br>  leaderElect: true<br>EOF</pre><p>Now create the systemd unit file:</p><pre>cat &gt; /etc/systemd/system/kube-scheduler.service &lt;&lt; EOF<br>[Unit]<br>Description=Kubernetes Scheduler<br>Documentation=https://github.com/kubernetes/kubernetes</pre><pre>[Service]<br>ExecStart=/usr/local/bin/kube-scheduler \\<br>  --config=/etc/kubernetes/config/kube-scheduler.yaml \\<br>  --v=2<br>Restart=on-failure<br>RestartSec=5</pre><pre>[Install]<br>WantedBy=multi-user.target<br>EOF</pre><p>Now let us start all three services:</p><pre>systemctl daemon-reload<br>systemctl enable kube-apiserver kube-controller-manager kube-scheduler<br>systemctl start kube-apiserver kube-controller-manager kube-scheduler</pre><p>Wait a few seconds for them to start, then check their status:</p><pre>systemctl status kube-apiserver --no-pager<br>systemctl status kube-controller-manager --no-pager<br>systemctl status kube-scheduler --no-pager</pre><p>All three should show “active (running)”. If any show failed, check the logs:</p><pre>journalctl -u kube-apiserver --no-pager -n 50<br>journalctl -u kube-controller-manager --no-pager -n 50<br>journalctl -u kube-scheduler --no-pager -n 50</pre><p>Let us verify everything is working. First, check that the API server is responding:</p><pre>kubectl cluster-info --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>You should see:</p><pre>Kubernetes control plane is running at https://127.0.0.1:6443</pre><p>Check the component statuses:</p><pre>kubectl get componentstatuses --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><blockquote>Note: The componentstatuses endpoint is deprecated in newer Kubernetes versions and may show warnings or incomplete information. A better way to verify is to check if services are responding:</blockquote><pre>kubectl get --raw=&#39;/readyz?verbose&#39; --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>This should show many checks, all passing.</p><p>Check the default namespace and services:</p><pre>kubectl get namespaces --kubeconfig /var/lib/kubernetes/admin.kubeconfig<br>kubectl get services --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>You should see the default, kube-system, kube-public, and kube-node-lease namespaces. The kubernetes service in the default namespace should exist, pointing to 10.32.0.1.</p><p>When the API server needs to connect to kubelets (for logs, exec, port-forward, etc.), it authenticates using its kubernetes certificate. We need to create RBAC rules that allow this.</p><p>Create a ClusterRole with the necessary permissions:</p><pre>cat &gt; /tmp/kube-apiserver-to-kubelet.yaml &lt;&lt; EOF<br>apiVersion: rbac.authorization.k8s.io/v1<br>kind: ClusterRole<br>metadata:<br>  annotations:<br>    rbac.authorization.kubernetes.io/autoupdate: &quot;true&quot;<br>  labels:<br>    kubernetes.io/bootstrapping: rbac-defaults<br>  name: system:kube-apiserver-to-kubelet<br>rules:<br>  - apiGroups:<br>      - &quot;&quot;<br>    resources:<br>      - nodes/proxy<br>      - nodes/stats<br>      - nodes/log<br>      - nodes/spec<br>      - nodes/metrics<br>    verbs:<br>      - &quot;*&quot;<br>---<br>apiVersion: rbac.authorization.k8s.io/v1<br>kind: ClusterRoleBinding<br>metadata:<br>  name: system:kube-apiserver<br>roleRef:<br>  apiGroup: rbac.authorization.k8s.io<br>  kind: ClusterRole<br>  name: system:kube-apiserver-to-kubelet<br>subjects:<br>  - apiGroup: rbac.authorization.k8s.io<br>    kind: User<br>    name: kubernetes<br>EOF</pre><pre>kubectl apply -f /tmp/kube-apiserver-to-kubelet.yaml --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>This creates a ClusterRole that allows access to kubelet endpoints and binds it to the “kubernetes” user, which is the identity in the API server’s certificate.</p><p>Let us make sure secret encryption is working. Create a test secret:</p><pre>kubectl create secret generic test-secret --from-literal=mykey=mydata \<br>  --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>Now query etcd directly to see how it is stored:</p><pre>etcdctl get /registry/secrets/default/test-secret | hexdump -C | head -20</pre><p>You should see “k8s:enc:aescbc:v1:key1” at the beginning of the data, indicating it is encrypted. The actual secret value should not be visible in plaintext.</p><p>Clean up the test secret:</p><pre>kubectl delete secret test-secret --kubeconfig /var/lib/kubernetes/admin.kubeconfig</pre><p>The control plane is ready, but we cannot run any workloads yet. There are no worker nodes registered with the cluster. If you try to run a pod now, the scheduler will not find any nodes to place it on.</p><p>Before moving on, exit from the server and verify you can reach the API server from the jumpbox:</p><pre>exit</pre><p>From jumpbox, let us create a kubeconfig that points to the server’s external IP:</p><pre>kubectl config set-cluster kubernetes-the-hard-way \<br>  --certificate-authority=~/certs/ca.pem \<br>  --embed-certs=true \<br>  --server=https://192.168.10.100:6443 \<br>  --kubeconfig=~/.kube/config</pre><pre>kubectl config set-credentials admin \<br>  --client-certificate=~/certs/admin.pem \<br>  --client-key=~/certs/admin-key.pem \<br>  --embed-certs=true \<br>  --kubeconfig=~/.kube/config</pre><pre>kubectl config set-context default \<br>  --cluster=kubernetes-the-hard-way \<br>  --user=admin \<br>  --kubeconfig=~/.kube/config</pre><pre>kubectl config use-context default --kubeconfig=~/.kube/config</pre><p>Now test it:</p><pre>kubectl get nodes</pre><p>You should see “No resources found” because there are no worker nodes yet. That is expected.</p><pre>kubectl get namespaces</pre><p>You should see the default namespaces listed.</p><p>In Part 3, we will set up the worker nodes. We will install containerd as the container runtime, configure kubelet and kube-proxy, and set up pod networking so containers on different nodes can communicate with each other. By the end of Part 3, you will be able to run actual workloads on your cluster.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=498ed989fd45" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Handling POSIX Signals in Rust]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://sigridjin.medium.com/handling-posix-signals-in-rust-fac42c33e5b6?source=rss-9e186deee13f------2"><img src="https://cdn-images-1.medium.com/max/2600/1*PXCH3vKrr5-K3_eZ0J2zPg.jpeg" width="2816"></a></p><p class="medium-feed-snippet">What Are Signals?</p><p class="medium-feed-link"><a href="https://sigridjin.medium.com/handling-posix-signals-in-rust-fac42c33e5b6?source=rss-9e186deee13f------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://sigridjin.medium.com/handling-posix-signals-in-rust-fac42c33e5b6?source=rss-9e186deee13f------2</link>
            <guid isPermaLink="false">https://medium.com/p/fac42c33e5b6</guid>
            <dc:creator><![CDATA[Sigrid Jin]]></dc:creator>
            <pubDate>Tue, 23 Dec 2025 06:42:12 GMT</pubDate>
            <atom:updated>2025-12-23T06:42:12.245Z</atom:updated>
        </item>
    </channel>
</rss>