GrepMyMind - Medium

Argo CD’s ApplicationSet: Dynamic Deployments Across The Fleet

Mike Tougeron — Mon, 11 Mar 2024 18:11:35 GMT

A series of square cubes representing a deployment across multiple Kubernetes clusters. Generated with https://firefly.adobe.com/

Argo CD provides a wide variety of methods to deploy your application(s) to Kubernetes clusters. An Application defines the source of the deployment and the cluster you want to deploy it to but an ApplicationSet allows you to deploy an Application across multiple clusters. Let’s start simple and break down each piece until we have a full-fledged, dynamic deployment system.

We’ll start with the following Application that deploys the k8s-pvc-tagger using Helm.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  destination:
    namespace: k8s-pvc-tagger 
    server: https://my-clusters-cluster.example.com
  project: default
  source:
    chart: k8s-pvc-tagger
    repoURL: https://mtougeron.github.io/helm-charts/
    targetRevision: 2.0.8
    helm:
      releaseName: k8s-pvc-tagger

This is great if you’re deploying to a single cluster but what about if you have more than one? You could create multiple Application resources but that would be a pain and isn’t scalable. This is where Argo CD’s ApplicationSet comes into play.

An ApplicationSet (docs) allows you to automatically generate a list of Application resources to deploy in a templated fashion. For example, you can use an ApplicationSet to deploy the same Application to multiple clusters or use it to deploy your application based on pull requests to a repository. In this post, I’ll show you how to use the ApplicationSet generators (docs) to do several powerful deployments to ease the way you use Argo CD.

Before we get into specific examples, it’s important to understand how an ApplicationSet works. You can use a variety of different generators to determine what the deployment should look like. These generators can be based on resources like files or directories in a repo, open pull requests, or labels on the Kubernetes clusters that are registered in Argo CD. Each of these generators can be combined together with the matrix or merge generators to create a complex criteria for selection. In addition, you can use values from the files inside of a file or git generator. I’ll go through these different generators in the sections below.

In my opinion, the simplest of generators is the clusters generator (docs) that allows you to deploy to multiple clusters based on the cluster labels assigned when a Kubernetes cluster was added to Argo CD. This generator filters the available clusters based on those labels and creates anApplication resource for each cluster that is found. Here’s an example of that approach.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production
  template:
    metadata:
      name: 'k8s-pvc-tagger-{{.name}}'
    spec:
      destination:
        namespace: k8s-pvc-tagger 
        server: '{{.server}}'
      project: default
      source:
        chart: k8s-pvc-tagger
        repoURL: https://mtougeron.github.io/helm-charts/
        targetRevision: 2.0.8
        helm:
          releaseName: k8s-pvc-tagger

It’s important to use the {{.name}} variable (or similar) so it creates a unique Application resource. Otherwise you will have conflicts and that’s never a good thing. Second, you’ll see the {{.server}} variable that defines the cluster’s server url that the Application is being deployed to. The rest looks like it did with a standard Application.

But what if you wanted to deploy different versions of k8s-pvc-tagger based on the environment; after all it’s always good to test in non-prod first, right? AnApplicationSet allows for this as well. In this example, we’re defining that the stage environment should run version 2.0.8 while production is still running 2.0.7. We’re able to use the templating options to dynamically decide which version to run.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: stage
      values:
        version: 2.0.8
  - clusters:
      selector:
        matchLabels:
          environment: production
      values:
        version: 2.0.7
  template:
    metadata:
      name: 'k8s-pvc-tagger-{{.name}}'
    spec:
      destination:
        namespace: k8s-pvc-tagger 
        server: '{{.server}}'
      project: default
      source:
        chart: k8s-pvc-tagger
        repoURL: https://mtougeron.github.io/helm-charts/
        targetRevision: '{{.values.version}}'
        helm:
          releaseName: k8s-pvc-tagger

Let’s take this a step farther though. What if we wanted dev to use a pre-release version, stage to run 2.0.8 and production to run 2.0.7? This gets a little more complicated because Helm with Argo CD doesn’t allow you to install a chart without defining a version. This means we need to get a bit tricky and toggle between installing the Helm chart from git or the chart repository.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: dev
      values:
        version: HEAD
  - clusters:
      selector:
        matchLabels:
          environment: stage
      values:
        version: 2.0.8
  - clusters:
      selector:
        matchLabels:
          environment: production
      values:
        version: 2.0.7
  template:
    metadata:
      name: 'k8s-pvc-tagger-{{.name}}'
    spec:
      destination:
        namespace: k8s-pvc-tagger 
        server: '{{.server}}'
      project: default
      source:
        chart: '{{if ne .values.version "HEAD"}}k8s-pvc-tagger{{end}}'
        path: '{{if eq .values.version "HEAD"}}charts/k8s-pvc-tagger{{end}}'
        repoURL: '{{if ne .values.version "HEAD"}}https://mtougeron.github.io/helm-charts/{{else}}https://github.com/mtougeron/k8s-pvc-tagger{{end}}'
        targetRevision: '{{.values.version}}'
        helm:
          releaseName: k8s-pvc-tagger

In that example, if the .values.version is HEAD we set an empty value for chart and instead set the path to the Helm chart in git. Similarly we toggle between the chart repository and the git repo in the repoURL field. This is handy for doing an automated pre-release to the dev clusters whenever main is updated but staggering the deployments to the stage and production environments.

If we wanted to alter the values for the Helm chart based on the cluster’s environment we can go one step farther with the templating and set the valuesObject in the Application's template. In this example, for the dev environment we’ll run it with debug: true so that we can see more details in the logs. We’ll also adjust the amount of cpu requested because we run a larger cluster in production than we do in the other environments.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  ...
  template:
    spec:
      source:
        helm:
          valuesObject:
            debug: '{{if eq .values.environment "dev"}}true{{end}}'
            resources:
              requests:
                cpu: '{{if eq .values.environment "production"}}100m{{else}}50m{{end}}'

Following the idea to the next level, let’s run a version of a deployment based on a PR. This is helpful for testing changes before those changes are merged. In this scenario we’ll use the helm-guestbook chart that Argo CD provides.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: guestbook
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - pullRequest
      github:
        owner: argoproj
        repo: argocd-example-apps
        labels:
        - ok-to-test
  template:
    metadata:
      name: 'guestbook-{{.branch_slug}}-{{.number}}'
    labels:
      branch: 'guestbook-{{.branch}}'
    spec:
      destination:
        namespace: 'guestbook-{{.branch_slug}}-{{.number}}'
        server: https://kubernetes.default.svc
      project: default
      source:
        path: helm-guestbook
        repoURL: https://github.com/argoproj/argocd-example-apps
        targetRevision: '{{.head_sha}}'
        helm:
          releaseName: guestbook
          valuesObject:
            ingress:
              hosts:
              - 'https://guestbook-{{.branch_slug}}-{{.number}}.example.com'
    syncPolicy:
      syncOptions:
      - CreateNamespace=true

Breaking down the changes by section, you’ll see the generators is now using pullRequest (docs). In our case, we’re using GitHub for the source but it supports options like GitLab, Bitbucket, and others. It’s defined the repo that the PRs are from and only creates an Application for the PR if the PR has the label ok-to-test on it. This helps prevent tests from running unless you’ve authorized them.

In the next section you’ll see that it uses .branch_slug and .number to add more information to the name so that it is more unique. You might have also noted that we added labels to the metadata so that we can filter in the Argo CD UI to all the Application resources created for a branch in the repo for guestbook. Most importantly, the targetRevision is set to the .head_sha so that it uses the code from the PR’s revision.

In the valuesObject we dynamically assign the hosts so that each PR has its own URL to test against. Other values can be customized as well so that the deployment for the PR best represents the changes being made.

Lastly, the spec.destination.namespace is unique per branch & PR as well. This allows for each PR to be deployed into its own Kubernetes Namespace for isolation. In order for this to work it also needs to have the CreateNamespace=true option set.

The merge generator is pretty cool IMO because it can allow for filtering the clusters found from the clusters generator based on the values found in the git generator. Let’s take an example where you have 100 clusters but for some reason you want to only install the k8s-pvc-tagger Helm chart into 10 of them. You could label each cluster with a flag that defines which clusters run that app. However, if you decided to add or remove it from a cluster you have to add that new label to the cluster which is generally a more operations focused task. Wouldn’t it be easier to just drop a values file into a directory of a git repo and have it automatically installed? Or have a single file that defines which version of a Helm chart to install?

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - merge:
      mergeKeys:
      - name
      generators:
      - clusters:
          selector:
            matchLabels:
              argocd.argoproj.io/secret-type: cluster
      - git:
          repoURL: https://github.com/mtougeron/my-deploy-repo
          revision: HEAD
          files:
          - path: "clusters/*.yaml"
    selector:
      matchExpressions:
      - key: k8s-pvc-tagger
        operator: Exists
  template:
    metadata:
      name: 'k8s-pvc-tagger-{{.name}}'
    spec:
      destination:
        namespace: k8s-pvc-tagger 
        server: '{{.server}}'
      project: default
      source:
        chart: k8s-pvc-tagger
        repoURL: https://mtougeron.github.io/helm-charts/
        targetRevision: '{{index . "k8s-pvc-tagger"}}'
        helm:
          releaseName: k8s-pvc-tagger

In the mtougeron/my-deploy-repo repository in the clusters directory a set of yaml files exist that have the name of the cluster along with each chart and their version to install.

name: my-cluster-name
k8s-pvc-tagger: 2.0.8
guestbook: HEAD
some-other-app: 1.2.3

Argo CD will first get the list of clusters that exist and merge that list with the files found in that directory. It will then filter that list to the files that have a variable called k8s-pvc-tagger. Lastly, it uses the value of that variable to set the targetRevision to install.

While not specific to an ApplicationSet, a feature that I really like in Argo CD is the ability to use sources instead of source for an Application. This allows you to use more than one repository in your deployment. Why would you want this you ask? A common practice is to use an open source Helm chart but have your own configuration repository. Let’s say I had a configuration repository that contains my values file(s) for the Helm chart.

├── guestbook
├── k8s-pvc-tagger
│   ├── dev.yaml
│   ├── production.yaml
│   └── stage.yaml
└── some-other-app

Now I want to use these Helm values files when rendering the chart via Argo CD. I setup two sources (instead of using source). One for the Helm chart and one that references my-config-repo where the values file(s) live. The values files are stored in the values directory and broken down by chart. It aliases the my-config-repo repository as $values so that it can be used in the first source for where to pull the files from.

spec:
  template:
    spec:
      sources:
      - repoURL: https://mtougeron.github.io/helm-charts/
        chart: k8s-pvc-tagger
        version: 2.0.8
        helm:
          releaseName: k8s-pvc-tagger
          valueFiles:
            - values.yaml
            - $values/{{.metadata.labels.environment}}.yaml
      - repoURL: https://github.com/mtougeron/my-config-repo
        path: 'values/k8s-pvc-tagger'
        targetRevision: HEAD
        ref: values

As you see in that example, it also dynamically points to the values file for the environment label set for the cluster in Argo CD.

When you sum it all together, as seen below, you have a powerful way to dynamically filter and set the version of the charts you want to install on each cluster.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: k8s-pvc-tagger
  namespace: argocd
spec:
  goTemplate: true
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: dev
      values:
        version: HEAD
  - merge:
      mergeKeys:
      - name
      generators:
      - clusters:
          selector:
            matchLabels:
              argocd.argoproj.io/secret-type: cluster
              environment: stage
      - git:
          repoURL: https://github.com/mtougeron/my-config-repo
          revision: HEAD
          files:
          - path: "clusters/*.yaml"
    selector:
      matchExpressions:
      - key: k8s-pvc-tagger
        operator: Exists
  - merge:
      mergeKeys:
      - name
      generators:
      - clusters:
          selector:
            matchLabels:
              argocd.argoproj.io/secret-type: cluster
              environment: production
      - git:
          repoURL: https://github.com/mtougeron/my-config-repo
          revision: HEAD
          files:
          - path: "clusters/*.yaml"
    selector:
      matchExpressions:
      - key: k8s-pvc-tagger
        operator: Exists
  template:
    metadata:
      name: 'k8s-pvc-tagger-{{.name}}'
    spec:
      destination:
        namespace: k8s-pvc-tagger 
        server: '{{.server}}'
      project: default
      sources:
      - repoURL: '{{if ne .values.version "HEAD"}}https://mtougeron.github.io/helm-charts/{{else}}https://github.com/mtougeron/k8s-pvc-tagger{{end}}'
        chart: '{{if ne .values.version "HEAD"}}k8s-pvc-tagger{{end}}'
        path: '{{if eq .values.version "HEAD"}}charts/k8s-pvc-tagger{{else}}{{index . "k8s-pvc-tagger"}}{{end}}'
        targetRevision: '{{.values.version}}'
        helm:
          releaseName: k8s-pvc-tagger
          valueFiles:
            - values.yaml
            - $values/{{.metadata.labels.environment}}.yaml
      - repoURL: https://github.com/mtougeron/my-config-repo
        path: 'values/k8s-pvc-tagger'
        targetRevision: HEAD
        ref: values

Hopefully you’ve found these examples helpful and agree that using an ApplicationSet is a powerful way to do deployments. If you have any questions, I’m available on the CNCF slack and I’d be happy to provide more details. You can also watch some of my talks (GitOps Me Some of That! Managing Hundreds of Clusters with Argo CD or Hundreds of Clusters Sitting in a Tree with Argo CD) on the same subject as well.

Argo CD’s ApplicationSet: Dynamic Deployments Across The Fleet was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kubernetes clusters for everyone using vcluster

Mike Tougeron — Mon, 20 Dec 2021 18:37:46 GMT

I recently started playing around with a powerful Kubernetes tool called vcluster from Loft Labs. vcluster provides an easy way of creating virtual Kubernetes clusters inside of a regular cluster but scoped within a namespace. What’s really neat is that the resources created can still be restricted by the host cluster’s RBAC, quotas and other security policies. While I’ve only started to touch the surface of what vcluster can do I can already see some long-term, high-impact use-cases.

Custom Operators use-case

In my environment tenant users are not allowed to create or modify CustomResourceDefinitions (CRDs) that Custom Operators use (they can run the operator but not manage the CRDs). They have to go through a ticketing & deployment process which adds overhead for the SRE team to review and delays for the development team. It’s unfortunately a lose-lose situation but the multi-tenancy restrictions and requirements must be respected at all times. It’s even more difficult to develop for a new Custom Operator where the CRD hasn’t been fully designed yet. Using tools like kind are useful for doing development on these operators but they don’t always let you test with the application full stack running along side it.

But this is where vcluster can come to the rescue! There can be a cluster where the tenant team has a namespace, with appropriate RBAC permissions, from which they can launch a virtual Kubernetes cluster using vcluster. The team can then do their development using that virtualized cluster. Where things get interesting is if this is taken to the next level and the production application is run inside of the vcluster. Suddenly the development team can run their custom operator however they’d like and it would still respect the multi-tenant nature of the host cluster. For me running production workloads with vcluster is still a long ways off but the potential for it has me super excited. Even if running the entire application stack isn’t feasible, running the custom operator could be.

Development environment use-case

Another powerful option for vcluster is to manage developer workflows. Like git allows you to easy toggle between branches vcluster could allow you to toggle between virtual clusters per feature branch. Each feature branch could be fully deployed in isolation from each other. When the feature branch work is finish the virtual cluster is removed and all resources are automatically cleaned up. Because of the way that pods are created on the nodes the host cluster’s namespace quotas still apply. This protects against the ever frustrating resource creep that can happen.

Install and run vcluster

Setting up a virtual cluster using vcluster is wickedly straight-forward. There’s a little bit of prep work from the cluster operator team and then the tenants can manage the vcluster themselves.

Setting up the namespace

Start by creating a new namespace, in this case we’ll call it team-touge, and give the tenant the necessary access.

kubectl create namespace team-touge

Create a RoleBinding that gives users admin access to the team-touge namespace. Depending on your environment you may want to use something more restrictive than admin but for my use-case it is appropriate.

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: team-admin
  namespace: team-touge
subjects:
  - kind: Group
    name: some-team-group-name
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: admin
  apiGroup: rbac.authorization.k8s.io

After access to the host cluster namespace has been given the tenants can take over and manage everything from this point forward.

Setup vcluster

After the vcluster cli is installed we need to create a yaml config file that tells vcluster as run as non-root.

# vcluster.yaml
securityContext:
  runAsUser: 12345
  runAsNonRoot: true

We then use this config file when launching the virtual cluster called touge inside of the team-touge namespace.

$> vcluster create touge -n team-touge -f vcluster.yaml
[info]   execute command: helm upgrade touge vcluster --repo https://charts.loft.sh --version 0.5.0-alpha.7 --kubeconfig /var/folders/rn/hrzkvjz5325dvtxz2ztzyf480000gp/T/1416406882 --namespace team-touge --install --repository-config='' --values /var/folders/rn/hrzkvjz5325dvtxz2ztzyf480000gp/T/3591452462 --values vcluster.yaml
[done] √ Successfully created virtual cluster touge in namespace team-touge. Use 'vcluster connect touge --namespace team-touge' to access the virtual cluster

After a few seconds we now have some new pods running in the team-touge namespace.

$> kubectl get pods -n team-touge
coredns-7bbd4f6c46-pvqcg-x-kube-system-x-touge  1/1  Running 0 1m16s
touge-0                                         2/2  Running 0 3m30s

The touge-0 pod is vcluster with the k3s control-plane and the coredns-7bbd4f6c46-pvqcg-x-kube-system-x-touge pod is the coredns deployment for inside the virtual cluster. Now that vcluster is running we can connect to that virtual cluster via port-forwarding. This means that you can configure RBAC rules for whether or not a user is allowed to connect to it at all. You could also setup an Ingress for connectivity if you’d prefer.

$> vcluster connect touge -n team-touge
[done] √ Virtual cluster kube config written to: ./kubeconfig.yaml. You can access the cluster via `kubectl --kubeconfig ./kubeconfig.yaml get namespaces`
[info]   Starting port-forwarding at 8443:8443
Forwarding from 127.0.0.1:8443 -> 8443
Forwarding from [::1]:8443 -> 8443

In another window I configure my environment to use this newly generated kube config and get a list of pods in the kube-system namespace.

$> export KUBECONFIG=$(pwd)/kubeconfig.yaml
$> kubectl get pods -n kube-system

NAME                       READY   STATUS    RESTARTS   AGE
coredns-7bbd4f6c46-pvqcg   1/1     Running   0          7m31s

If you look back, you’ll see this pod name matches to the coredns-7bbd4f6c46-pvqcg-x-kube-system-x-touge that we saw earlier. vcluster will sync the pods (but not the deployment) and delegate scheduling to the host cluster. That way it ends up on a real node and can serve real workloads.

Inspecting the virtual cluster

An interesting aspect of the way the vcluster works is how it creates “fake” nodes that the workloads run on. Once a pod is scheduled on the host cluster’s node it will appear inside of the virtual cluster as well with partial data. For example, the IP address of the node will be different and the node Conditions are specific to the virtual cluster’s management of the node. The only pods it shows as running on the node are those that are running inside the virtual cluster. This keeps the data segmented from the host cluster and any other vcluster’s that may be running.

$> kubectl get nodes
NAME                                  STATUS   ROLES    AGE   VERSION
vmss-agent-worker-touge-cmixj000000   Ready       10m   v1.20.11+k3s2

$> kubectl describe node vmss-agent-worker-touge-cmixj000000
Name:               vmss-agent-worker-touge-cmixj000000
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=fake-vmss-agent-worker-touge-cmixj000000
                    kubernetes.io/os=linux
                    vcluster.loft.sh/fake-node=true

Non-terminated Pods:          (1 in total)
  Namespace                   Name                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                        ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-7bbd4f6c46-pkdgp    100m (0%)     1 (6%)      70Mi (0%)        170Mi (0%)     10m

Let’s create a secret in the virtual cluster’s kube-system namespace.

$> kubectl create secret generic my-secret -n kube-system --from-literal=TEST=foo
secret/my-secret created

$> kubectl get secret -n kube-system my-secret
NAME        TYPE     DATA   AGE
my-secret   Opaque   1      20s

But if we try to access this secret from host cluster it won’t exist.

$> kubectl get secret -n kube-system my-secret
Error from server (NotFound): secrets "my-secret" not found

$> kubectl get secret -n team-touge my-secret
Error from server (NotFound): secrets "my-secret" not found

Only resources such as pods, services and ingresses are sync’d with the host cluster by default. Additional resources can be sync’d between the host and virtual cluster but it may need additional RBAC permissions that a standard user won’t have.

Installing Vault Secrets Operator

As I’ve talked about before, I love the Vault Secrets Operator so let’s install it on the new virtual cluster.

If you remember, in the host cluster, we gave users the ability to admin the team-touge namespace but we didn’t give them access to create new namespaces. However, they do have admin access to the virtual cluster (more on that later) so a new namespace can still be created. This new namespace exists only within the virtual cluster.

$> kubectl create namespace vault-secrets-operator
namespace/vault-secrets-operator created

Next we’ll install the chart repositories and the charts. PLEASE NOTE that this is NOT how you would install Vault in a production environment.

$> helm repo add hashicorp https://helm.releases.hashicorp.com
"hashicorp" has been added to your repositories
$> helm repo add ricoberger https://ricoberger.github.io/helm-charts
"ricoberger" has been added to your repositories

$> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ricoberger" chart repository
...Successfully got an update from the "hashicorp" chart repository
Update Complete. ⎈Happy Helming!⎈

$> helm install --namespace vault-secrets-operator vault hashicorp/vault

# Follow the steps from https://learn.hashicorp.com/tutorials/vault/kubernetes-minikube?in=vault/kubernetes#initialize-and-unseal-vault to unseal Vault

$> helm install --namespace vault-secrets-operator --set environmentVars\[0\].name=VAULT_TOKEN --set environmentVars\[0\].value=your-vault-token vault-secrets-operator ricoberger/vault-secrets-operator

We can now see that the application is installed along with the CRDs.

$> kubectl get crd vaultsecrets.ricoberger.de
NAME                         CREATED AT
vaultsecrets.ricoberger.de   2021-12-19T23:35:48Z

If we go back to the host cluster, you can see that the CRD is not installed there.

$> unset KUBECONFIG
$> kubectl get crd vaultsecrets.ricoberger.de
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "vaultsecrets.ricoberger.de" not found

A note on security

I’m still digging into this but one of the things about vcluster that isn’t clear to me is the cluster access permissions. It appears to me that all users, using the vcluster connect, have admin access. You can restrict access to the vcluster by restricting who can port-forward or access the kubeconfig secret in the host namespace but that blanket access/restriction is limiting in scope. While for a demo/testing environment this may acceptable, you’ll want to make sure you have something more robust before using it for anything beyond that. There may be something already in place for this but I’m not seeing it at first glance.

On the plus side with security, because the virtualized cluster still has pods running within the host cluster, any rules you may have in place with a PodSecurityPolicy (note those are now deprecated) or Open Policy Agent are still enforced. If you don’t allow a pod to be deployed on a control-plane node via an OPA policy, that pod will fail to launch when vcluster creates it.

Events:
  Type     Reason     Age                 From        Message
  ----     ------     ----                ----        -------
  Warning  SyncError  17s (x14 over 59s)  pod-syncer  Error syncing to physical cluster: admission webhook "mutating-webhook.openpolicyagent.org" denied the request: The Pod spec.tolerations[] contains a toleration for a control plane Node taint, but the Pod is not within an approved control plane Namespace

Where to next?

In this post I focused on the custom operators use-case but there are a few others that I’m playing around with. I’d really like to be able to deploy, as close as possible, the deployments from my core cluster into a virtual cluster for CI/CD. That would greatly speed up my testing as the cloud resources wouldn’t need to be created each time in order to test with a clean environment. It would also be nice to have teams test deploying their full application stack inside a virtual cluster for things like canary or blue/green deployments. Being able to fully isolate and test different versions of the applications could provide a lot of value to the application teams.

I’ll definitely be watching this project closely over the coming months!

Kubernetes clusters for everyone using vcluster was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

PagerDuty OncallStatus for MacOS

Mike Tougeron — Mon, 16 Aug 2021 16:08:42 GMT

I’m now happy to introduce the new open source application OncallStatus for PagerDuty on MacOS!

Being oncall is never fun and rarely easy. On top of that it can also be _noisy_. The phone alerts are always loud as they try to get your attention. Beyond that, when I’m already at the computer the noise from my phone bothers me immensely. Pre-covid when I was in the office it would disturb my teammates and while working from home it creates excess stress that I don’t need. I’d rather have my computer notify me instead. Rather than complain and do nothing, I still bitched but did something about it.

The OncallStatus app was my solution to this problem. This new app works with your PagerDuty oncall schedules and notifies you when a new incident has been created.

You can filter the events based on high & low priority or the escalation level you are assigned to. No need to get that extra visual noise for the low priority stuff. Now I can set my phone to silent when I’m at my desk and the noise is much easier on my mind.

Written in Go

As I’ve been learning/working with Go lately I decided to try and use that to write this app. Using the cgo bindings it is possible to call Cocoa functions and have native MacOS functionality built into your Go app. I started with using a nice library called menuet but unfortunately it hasn’t been kept up-to-date and had a tendency to crash a lot on startup. Looking for another library that could do menu bar updates I came across getlantern/systray. While not as feature rich as menuet, it worked AND it promised cross-platform compatibility. While I haven’t been able to work on the Linux & Windows support yet, I’m happy that it should be possible.

Unfortunately when it came to notifications the only cross-platform library I could find was severely lacking in functionality. So in the spirit of an MVP I decided to stick with just MacOS to start. I took a bit of code from menuet and some from a utility called go-notifier. This allowed me to create a notification that was clickable and would open a web browser straight to the incident page. Because most of this code was Cocoa it took me _a long time_ to get it working. It’s not the greatest code but it does get the job done. I know I’m going to have to go back and refactor it at some point.

A neat idea that I learned from menuet is checking GitHub for new releases of OncallStatus. When the tag used for the current running version doesn’t match the latest release on GitHub it creates a notification letting you know there is a new version. This could be done with just a few lines of code thanks to google/go-github.

Authentication

When it comes to authentication security is important. The login process connects to PagerDuty using Proof Key for Code Exchange (PKCE). The PagerDuty API PKCE docs did a decent job of explaining how to use it. At the moment the app is only requesting read access to PagerDuty but at some point I’d like to expand this to write access as well so that you can acknowledge the alerts directly from the app. Because PKCE is part of an oauth workflow I never had to touch the user’s username/password. The authentication and authorization all happen on the PagerDuty side.

Next I had to come up with a way to securely store the resulting API Token. I went with keybase/go-keychain which allows me to save the data into the MacOS keychain. For the cross-platform work I eventually want to do it will also allow me to save to a Linux keychain as well.

Configuring the OncallStatus.App bundle

Because this is a MacOS application I had to setup an app bundle that can be run instead of a command line tool. I found this article https://medium.com/@mattholt/packaging-a-go-application-for-macos-f7084b00f6b5 that walked me through how to setup the OncallStatus.app directory and files. I followed the steps for the icons, plist & folder structure. However I found an easier tool for creating DMG files so I was able to setup some automation for that step.

https://medium.com/media/1c3e1d0e5aa1f77970d39391d34f5f13/href

I was able to get an appropriately licensed image to use as application icon from Adobe Stock. At this point I could run OncallStatus as a normal MacOS app. Unfortunately the security warnings and restrictions still kept the app from being unusable.

Build, Signing & Notarizing

In order to distribute the application without all the security warnings showing up when someone tries to run OncallStatus I had to pay Apple a $99 yearly fee in order to sign & notarize it. I really hope I start making more apps like this one in order to make the cost worth it. It’s pretty annoying that Apple makes you pay to join the Developer Program in order to sign & notarize apps.

Once I paid the toll, I could use gon to do the signing & notarizing in an automated fashion. First I setup gon to sign the OncallStatus.app and the binary OncallStatus.app/Contents/MacOS/OncallStatus. I had originally tried to do the notarizing at the same time but I had trouble getting the DMG created properly when I did it that way. The binary file kept ending up in the DMG top level instead of just the OncallStatus.app. Instead after signing the app files, I create the DMG file, then send that to Apple to get notarized.

What’s really cool is that I can do all of this via a few GitHub Actions. It’s even able to automatically upload & attach the newly createdDMG file to the release when I create the new version tag. The more I use GitHub Actions the deeper I fall in love with them.

What’s coming next

As with many things time is going to be a key factor. Right now my priority is to pass the CKS exam. But once that’s out of the way I’d like to expand the functionality so that it works on Linux desktops. Several of my friends & co-workers run Linux as their primary machine and I’d like to be able to support them. I’d also like to clean up the notifications code as it’s using a deprecated library that will eventually go away. If I ever get more comfortable with Cocoa (or whatever language) for MacOS UI development I’d like to allow users to acknowledge the alert directly from the notification.

For now however, I hope you give it a try and like it. Please don’t hesitate to give feedback here or as a GitHub issue.

PagerDuty OncallStatus for MacOS was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

It’s all about the data; a journey into Kubernetes CSI on AWS

Mike Tougeron — Mon, 15 Mar 2021 17:10:32 GMT

Over the last several weeks I’ve taken a trip into the world of Kubernetes storage; both the Container Storage Interface (CSI) and Container Attached Storage (CAS). I’ve talked with folks in the CAS space before but for whatever reason the power of it never really settled into my brain until recently. The idea of this journey started picking up steam when I realized that the in-tree storage plugins were deprecated and no new enhancements were being made to them starting with Kubernetes 1.20. When I discovered that simply switching from gp2 to gp3 volumes meant I had to start using the AWS CSI Driver I realized I was behind the times. This desire for a simple change opened the door and the next thing I knew I was on an adventure of potentially significant impact.

The journey started with the new AWS EBS volume types, but then sped into some code trying to fix an open issue in the aws-ebs-csi-driver, jumped up into VolumeSnapshots, spun around to creating PVCs from snapshots, and rounded the corner into doing an OpenEBS proof of concept. By the time I was done I was exhausted but full of excitement for the future possibilities.

aws-ebs-csi-driver

Please note that all references are assuming that you are running Kubernetes 1.17+ though I’ve only been running this configuration on a 1.19 cluster.

What is it?

The aws-ebs-csi-driver is a CSI storage plugin that replaces to the in-tree storage plugin for AWS EBS volumes. These plugins are what are used when you request a new PVC and the EBS volumes get created behind the scenes. In the beginning the storage plugins were part of the base Kubernetes repo/app but over time that has evolved. The creation of the CSI spec has opened the door not only for new CAS providers but also the cloud providers. The speed at which storage drivers can be iterated on can grow significantly because it is out of band from the core Kubernetes releases.

If you’re looking at it for the first time it can be daunting to understand what each component of the app does or even why you need to run it at all! At a very high level, you run a Deployment for the controller (with its many sidecars), a DaemonSet (also with sidecars) on every node, and optionally the snapshot controller.

The ebs-csi-controller has several sidecars in its deployment but main container is the ebs-plugin. This is where the code lives that interacts with the AWS APIs to create, delete, resize, etc the EBS volumes when a Persistent Volume is created. This is the code you’ll see when you go to https://github.com/kubernetes-sigs/aws-ebs-csi-driver.

The other sidecars essentially contain boilerplate logic that handle the communication and coordination between the ebs-plugin and the Kubernetes API. For example, the csi-resizer sidecar watches for PVC edits and notifies the ebs-plugin, over a socket using grpc, with the necessary data so that it can resize the EBS volume via the AWS API. The sidecars allow the ebs-plugin driver to focus on just the storage volume functionality and not have to re-implement a lot of the same Kubernetes API interactions. I highly recommend reading the Kubernetes CSI Sidecar Containers portion of the documentation to get a better idea of what each one does. I don’t think you necessarily need to know them in-depth but a general understanding of what each piece does is really helpful.

Along with the controller there is the ebs-csi-node that runs as a DaemonSet one each node. It is responsible for mounting and unmounting a volume from the node when requested by the kubelet. This is what makes the volume available for the Pods to use.

How to set it up

The aws-ebs-csi-driver requires AWS API access in order to manage the EBS volumes. I recommend something like kube2iam to handle this and to not use access keys. The official documentation has an example IAM policy and it looks like this.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:AttachVolume",
        "ec2:CreateSnapshot",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:DeleteSnapshot",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeInstances",
        "ec2:DescribeSnapshots",
        "ec2:DescribeTags",
        "ec2:DescribeVolumes",
        "ec2:DescribeVolumesModifications",
        "ec2:DetachVolume",
        "ec2:ModifyVolume"
      ],
      "Resource": "*"
    }
  ]
}

Once you have the IAM role configured you can launch the controllers via the Helm chart.

helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver

helm repo update

helm upgrade --install aws-ebs-csi-driver \
    --namespace kube-system \
    --set enableVolumeScheduling=true \
    --set enableVolumeResizing=true \
    --set 'podAnnotations.iam\.amazonaws\.com/role'=ROLE_ARN \
    --set 'node.podAnnotations.iam\.amazonaws\.com/role'=ROLE_ARN \
    aws-ebs-csi-driver/aws-ebs-csi-driver

After it has been applied you’ll see the pods running in the kube-system namespace.

NAME                                  READY STATUS    RESTARTS AGE
ebs-csi-controller-85bc6d8897-lt5xk   6/6   Running   0        3m7s
ebs-csi-controller-85bc6d8897-v542j   6/6   Running   0        3m7s
ebs-csi-node-66dt6                    3/3   Running   0        3m7s
ebs-csi-node-9424k                    3/3   Running   0        3m7s
ebs-csi-node-b9mnd                    3/3   Running   0        3m7s
ebs-csi-node-gd6d6                    3/3   Running   0        3m7s
ebs-csi-node-hr4qt                    3/3   Running   0        3m7s
ebs-csi-node-jjbcj                    3/3   Running   0        3m7s

You will also see the CSIDriver installed on your cluster.

$> kubectl get csidriver

NAME              ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
ebs.csi.aws.com   true             false            Persistent   21m

The CSIDriver is what you use when creating the StorageClass so that Kubernetes knows which CSI storage plugin should be used. This means that you can have more than one storage plugin running on your cluster at the same time! For example, in my case I have the in-tree storage plugins, the aws-ebs-csi-driver plugin, and OpenEBS (from the POC that I’ll discuss in a future blog post) all running nicely together.

How to use it

Now that the controller and node pods are running and the CSIDriver is created you can create the StorageClass(es) your users will use.

apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: ebs.csi.aws.com # <-- The same name as the CSIDriver
metadata:
  name: gp3
parameters: # <-- parameters for this CSIDriver
  encrypted: "true"
  type: gp3
allowVolumeExpansion: true
volumeBindingMode: Immediate
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: ebs.csi.aws.com
metadata:
  name: gp3-6000iops
parameters:
  encrypted: "true"
  type: gp3
  throughput: 250
  iops: 6000 # <-- For volumes 1TB-2TB in size or needing more iops
allowVolumeExpansion: true
volumeBindingMode: Immediate

From an end-user perspective, the new gp3 storage class is used just like they’ve been used to doing with the in-tree storage plugins.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: touge-pvc
spec:
  storageClassName: gp3
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Let’s follow the process and inspect the results.

$> kubectl apply -f touge-pvc.yaml
persistentvolumeclaim/touge-pvc created

$> kubectl get pvc touge-pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
touge-pvc   Bound    pvc-a2cc33c6-f5d5-425f-bd1e-0902b82bbcec   10Gi       RWO            gp3            10s

$> kubectl describe pv pvc-a2cc33c6-f5d5-425f-bd1e-0902b82bbcec
Name:              pvc-a2cc33c6-f5d5-425f-bd1e-0902b82bbcec
Labels:            
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp3
Status:            Bound
Claim:             default/touge-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.ebs.csi.aws.com/zone in [us-west-2c]
Message:           
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-0f06e363f467b87bd
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1615144050357-8081-ebs.csi.aws.com
Events:

$> aws ec2 describe-volumes --volume-ids vol-0f06e363f467b87bd --region us-west-2
{
    "Volumes": [
        {
            "Attachments": [],
            "AvailabilityZone": "us-west-2c",
            "CreateTime": "2021-03-07T21:42:30.268Z",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-west-2:REDACTED:key/REDACTED",
            "Size": 10,
            "SnapshotId": "",
            "State": "available",
            "VolumeId": "vol-0f06e363f467b87bd",
            "Iops": 3000,
            "Tags": [
                {
                    "Key": "kubernetes.io/created-for/pv/name",
                    "Value": "pvc-a2cc33c6-f5d5-425f-bd1e-0902b82bbcec"
                },
                {
                    "Key": "kubernetes.io/created-for/pvc/namespace",
                    "Value": "default"
                },
                {
                    "Key": "kubernetes.io/created-for/pvc/name",
                    "Value": "touge-pvc"
                },
                {
                    "Key": "CSIVolumeName",
                    "Value": "pvc-a2cc33c6-f5d5-425f-bd1e-0902b82bbcec"
                }
            ],
            "VolumeType": "gp3",
            "MultiAttachEnabled": false,
            "Throughput": 125
        }
    ]
}

VolumeSnapshots

VolumeSnapshots are a pretty cool feature that’s possible with CSI. You can do things like taking a snapshot of a volume and then restore the PVC with that snapshot if your data becomes corrupt. An interesting use-case would be to create a nightly snapshot of your dev database and allow users to create a PersistentVolumeClaim (PVC) from that snapshot to use in their personal testing. The snapshot doesn’t even need to come from inside Kubernetes!

Enabling VolumeSnapshots

It’s an add-on to the default setup so the first thing you need to do is install the CSI Snapshotter CRDs. After installing the Snapshotter CRDs you can add --set enableVolumeSnapshot=true to the Helm install command from above and a new StatefulSet, ebs-snapshot-controller, will be running.

It uses a VolumeSnapshotClass to know which CSI Plugin the snapshot requests go to so let’s create one.

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotClass
metadata:
  name: ebs-csi-aws
driver: ebs.csi.aws.com # <-- The CSIDriver we defined previously
deletionPolicy: Delete

Creating VolumeSnapshots

To create a new VolumeSnapshot create a resource on the cluster for it.

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: touge-snapshot
spec:
  volumeSnapshotClassName: ebs-csi-aws
  source:
    persistentVolumeClaimName: touge-pvc

This will trigger the snapshotting process, the aws-ebs-csi-driver will be notified, and it will create a snapshot in AWS for the EBS volume that is backing the PVC. Once again, let’s follow the process and inspect the results.

$> kubectl apply -f touge-snapshot.yaml 
volumesnapshot.snapshot.storage.k8s.io/touge-snapshot created

$> kubectl describe volumesnapshot touge-snapshot
Name:         touge-snapshot
Namespace:    default
Labels:       
Annotations:  API Version:  snapshot.storage.k8s.io/v1beta1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2021-03-07T21:50:40Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generation:  1
  Resource Version:  135554
  Self Link:         /apis/snapshot.storage.k8s.io/v1beta1/namespaces/default/volumesnapshots/touge-snapshot
  UID:               7d32eca6-2015-4a6e-a5b6-3ec48ca68005
Spec:
  Source:
    Persistent Volume Claim Name:  touge-pvc
  Volume Snapshot Class Name:      ebs-csi-aws
Status:
  Bound Volume Snapshot Content Name:  snapcontent-7d32eca6-2015-4a6e-a5b6-3ec48ca68005
  Creation Time:                       2021-03-07T21:51:13Z
  Ready To Use:                        true
  Restore Size:                        10Gi

If we look at the Status we will see Bound Volume Snapshot Content Name: snapcontent-7d32eca6–2015–4a6e-a5b6–3ec48ca68005. This tells us which VolumeSnapshotContent is created for our VolumeSnapshot. The VolumeSnapshotContent is a resource that is created by the snapshot controller that represents the data the CSI Plugin created.

$> kubectl describe volumesnapshotcontents snapcontent-7d32eca6-2015-4a6e-a5b6-3ec48ca68005
Name:         snapcontent-7d32eca6-2015-4a6e-a5b6-3ec48ca68005
Namespace:    
Labels:       
Annotations:  
API Version:  snapshot.storage.k8s.io/v1beta1
Kind:         VolumeSnapshotContent
Metadata:
  Creation Timestamp:  2021-03-07T21:50:40Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  Generation:  1
  Resource Version:  135553
  Self Link:         /apis/snapshot.storage.k8s.io/v1beta1/volumesnapshotcontents/snapcontent-7d32eca6-2015-4a6e-a5b6-3ec48ca68005
  UID:               0049e16f-2196-456b-9460-319ee24b3a15
Spec:
  Deletion Policy:  Delete
  Driver:           ebs.csi.aws.com
  Source:
    Volume Handle:             vol-0f06e363f467b87bd
  Volume Snapshot Class Name:  ebs-csi-aws
  Volume Snapshot Ref:
    API Version:       snapshot.storage.k8s.io/v1beta1
    Kind:              VolumeSnapshot
    Name:              touge-snapshot
    Namespace:         default
    Resource Version:  133849
    UID:               7d32eca6-2015-4a6e-a5b6-3ec48ca68005
Status:
  Creation Time:    1615153873000000000
  Ready To Use:     true
  Restore Size:     10737418240
  Snapshot Handle:  snap-05bc0ec2f3a65b7be

Here is where we see Snapshot Handle: snap-05bc0ec2f3a65b7be that tells us the SnapshotID in AWS.

$> aws ec2 describe-snapshots --snapshot-ids snap-05bc0ec2f3a65b7be --region us-west-2
{
    "Snapshots": [
        {
            "Description": "Created by AWS EBS CSI driver for volume vol-0f06e363f467b87bd",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-west-2:REDACTED:key/REDACTED",
            "OwnerId": "REDACTED",
            "Progress": "100%",
            "SnapshotId": "snap-05bc0ec2f3a65b7be",
            "StartTime": "2021-03-07T21:51:13.115Z",
            "State": "completed",
            "VolumeId": "vol-0f06e363f467b87bd",
            "VolumeSize": 10,
            "Tags": [
                {
                    "Key": "CSIVolumeSnapshotName",
                    "Value": "snapshot-7d32eca6-2015-4a6e-a5b6-3ec48ca68005"
                }
            ]
        }
    ]
}

Creating a PVC from an Existing AWS Snapshot

Let’s go through the example use-case of taking an existing AWS snapshot and creating a PVC from it for someone to use inside Kubernetes.

First we need to create the VolumeSnapshotContent that references the AWS snapshot. Using the AWS console I created snap-002e544b538087ec1 from an EBS volume that I had. To show the power of this, the volume & snapshot were created outside of Kubernetes.

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotContent
metadata:
  name: my-imported-snapshot
spec:
  volumeSnapshotRef:
    kind: VolumeSnapshot
    name: my-imported-snapshot
    namespace: default 
  source:
    snapshotHandle: snap-002e544b538087ec1 # <-- snapshot to import
  driver: ebs.csi.aws.com
  deletionPolicy: Delete
  volumeSnapshotClassName: ebs-csi-aws

Then we need to create the VolumeSnapshot that uses that VolumeSnapshotContent.

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: my-imported-snapshot
  namespace: default 
spec:
  volumeSnapshotClassName: ebs-csi-aws
  source:
    volumeSnapshotContentName: my-imported-snapshot

And apply it to the cluster.

$> kubectl apply -f touge-import-snapshot.yaml 
volumesnapshotcontent.snapshot.storage.k8s.io/my-imported-snapshot created
volumesnapshot.snapshot.storage.k8s.io/my-imported-snapshot created

$> kubectl get volumesnapshotcontent
NAME                                               AGE
my-imported-snapshot                               4m31s

$> kubectl get volumesnapshot
NAME                   AGE
my-imported-snapshot   4m56s

The VolumeSnapshot is now available to be used to create the PersistentVolumeClaim.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-imported-snapshot-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 10Gi
  dataSource:
    name: my-imported-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

This is then applied to the cluster and we will have a new PVC that we can mount on our Pod.

$> kubectl apply -f touge-pvc-from-snapshot.yaml 
persistentvolumeclaim/my-imported-snapshot-pvc created

$> kubectl get pvc my-imported-snapshot-pvc
NAME                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
my-imported-snapshot-pvc   Bound    pvc-1ff63250-5a4f-442f-9907-171b69569c2b   10Gi       RWO            gp3            26s

$> kubectl describe pv pvc-1ff63250-5a4f-442f-9907-171b69569c2b
Name:              pvc-1ff63250-5a4f-442f-9907-171b69569c2b
Labels:            
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp3
Status:            Bound
Claim:             default/my-imported-snapshot-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.ebs.csi.aws.com/zone in [us-west-2a]
Message:           
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-03b42ee74d7fd4f4e
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1615144050357-8081-ebs.csi.aws.com

Where I plan to go from here…

Once I had played around with the aws-ebs-csi-driver for a few days I went ahead and implemented it in the production clusters. I haven’t yet gotten around to migrating the in-tree volumes to the new CSI based ones but there’s a bit of time for that.

Following my excitement I ended up doing a proof of concept using OpenEBS. I’m really excited about the possibilities there and will be writing about that setup soon.

It’s all about the data; a journey into Kubernetes CSI on AWS was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing the k8s-aws-ebs-tagger

Mike Tougeron — Tue, 12 Jan 2021 20:26:04 GMT

https://stock.adobe.com/images/silhouettes-of-taggers-drawing-graffiti/67460951

EDIT: The k8s-aws-ebs-tagger was renamed to k8s-pvc-tagger as the scope of the project was expanded to include more than just aws-ebs volumes. Don’t worry, it’s still backwards compatible.

The k8s-aws-ebs-tagger brings tagging to the AWS EBS volumes created by Kubernetes PersistentVolumeClaims (PVC). This new utility enables you to set arbitrary tags on the EBS volume so that you can better categorize and report on the state of your AWS resources. Having proper cost control tags can help you keep a handle on your AWS billing and resource utilization.

Let’s dive into how to install and use it.

Install the k8s-aws-ebs-tagger

The container images are released both on DockerHub and GitHub Container Registry and are built for both linux/amd64 and linux/arm64.

The first thing needed is an AWS IAM Role that is allowed to add & delete tags from EBS volumes. I recommend using kube2iam for assigning the role to the Pod(s) instead of using AWS access key/secrets.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:volume/*"
            ]
        }
    ]
}

Once you have the IAM Role you can get the app running via its Helm chart. Be sure to check the default values and adjust as appropriate for your environment.

helm repo add mtougeron https://mtougeron.github.io/helm-charts/
helm repo update
helm install k8s-aws-ebs-tagger mtougeron/k8s-aws-ebs-tagger

If you want it to only watch a single namespace you can set the watchNamespace value for the chart but it still needs a ClusterRole in order to get the volume ID from the PersistentVolume. Currently it only supports watching a single namespace or all namespaces (#9) but I plan on updating this soon.

Configuring the tags to set

The first approach is to use the (optional) --default-tags command line flag that takes a json encoded string of key/value pairs. It uses these tags as the base set of tags to add to all EBS Volumes when a PersistentVolumeClaim is added or updated. This is useful if you always want to add a fixed tag to all EBS volumes created in the cluster (or namespace). For example, you may want all volumes to have the tag Environment=Production.

The default tags can be extended by the aws-ebs-tagger/tags annotation on the PVC. This annotation also takes a json encoded string of key/value pairs and uses them for tags on the volume. This can be used to extend the list of tags you want set as well as override the default values.

Take for example, this deployment and PVC

apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-aws-ebs-tagger
spec:
...
  template:
    spec:
      containers:
        - name: k8s-aws-ebs-tagger
          args:
            - --default-tags={"Environment": "Production"}
...
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example1
  annotations:
    aws-ebs-tagger/tags: |
      {"Database": "true"}
...

The resulting EBS volume will have the tags Environment=Production and Database=true.

Let’s say that for databases you have a different Environment tag. You could use the same Deployment as above but on the Database PVCs you override the Environment tag in the annotation.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db1
  annotations:
    aws-ebs-tagger/tags: |
      {"Database": "true", "Environment": "DBProduction"}
...

And the result tags will be Environment=DBProduction and Database=true

If you have a PVC that you don’t want tagged at all you can use the aws-ebs-tagger/ignore annotation and no tags will be processed for that volume.

Currently you can only use fixed values for the tags. However I’m working on updating that in the near future to allow for templated tag values (#15).

JSON vs multiple annotations vs comma delimited values

I went back and forth over this a lot when I was first thinking about writing this app. There are pros & cons to each approach but the deciding factor was that I needed a tag name with a : in it and I couldn’t use that in an annotation name. I also considered using a comma delimited list of tags but that made it difficult to allow the , in the value of the tag. With those restrictions the greatest flexibility was to use a json string of key/value pairs. That unfortunately leaves the user with the hassle of hand-coding json key/value pairs for their tags (which I really hate) but it’s the best I could think of.

Thoughts on the future

While the CSI Drivers are the latest & future of storage on Kubernetes there are still a lot of us not using them yet. Even if you are using the aws-ebs-csi-driver it still has an open issue to allow adding arbitrary tags. Until the CSI driver is updated this utility app should provide the desired tagging.

Completing #9 & #15 will allow handling of the last scenarios I can think of to make this usable by the masses. However, if you have any suggestions for improvement I’m happy to hear them!

Introducing the k8s-aws-ebs-tagger was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 ways to handle AWS API rate-limiting

Mike Tougeron — Mon, 14 Dec 2020 20:33:31 GMT

When dealing with AWS API rate-limiting there are a few tips & tricks that I find helpful. If your environment is like mine and you have a lot of code interacting with the AWS APIs, sometimes poorly, handling the default rate-limiting without errors is important.

Top AWS API calls in a typical hour

Python’s Tenacity

I’ve found that Tenacity for Python is a life saver. Tenacity is a general purpose library that automates retry logic. By decorating your functions Tenacity will automatically retry, with behavior determined by the decoration, when an exception is raised. In the code below it automatically retries the API call 10 times after waiting with a random exponential backoff.

https://medium.com/media/d1ab55cb1b4f1bf352570583f2656564/href

The great thing is that it only takes a little bit of effort to refactor your code to take advantage of it. All you have to do is make each AWS API call a function and put the decorator on it. With the reraise=True option your existing error handling will continue to work as it is coded now.

AWS GoSDK’s CustomRetryer

The AWS Go SDK also has some default retry logic built-in. In addition to the defaults, it allows you to custom set when to do a retry and how often to do it. Once you initialize your session with the CustomRetryer it will automatically be used.

https://medium.com/media/dae3e47296fd5df1dc1706d0c2c9072a/href

The nice thing about this approach is that it also lets you set custom logic for when it should do the retries.

Caching

Sometimes the data you query from AWS can be fairly static. For example, the KubernetesCluster or Environmenttags on my Kubernetes EC2 instances never change. Instead of making an API call every time I need to know the tag values I can save it to a local file or to ElastiCache and reference it from there first. If it doesn’t exist, the script can fall back to making the API calls.

https://medium.com/media/89a02a9c1aca7b9f382b9560a186d2d6/href

Instance Metadata API

If you haven’t been keeping up with what’s available via the local metadata API features it’s probably time for a look. With the newer instance types (e.g., the c/m/r 5-series) more data is available than there was before. It unfortunately still doesn’t have my most queried resource (tags) but it still does have useful information. For example, on a 5-series instance, you no longer have to run aws ec2 describe-instance-status to find out if there are upcoming maintenance events. Instead you can can query the metadata API for that information at http://169.254.169.254/latest/meta-data/events/maintenance/scheduled. This change alone saved many thousands of API calls an hour across the fleet.

Requesting a limit increase

This one is kind of cheating but it is actually possible and the docs even say so. In practice though you’ll almost always get a response about doing backoffs and retries first. If you can make a good business case for why you should get a limit increase they can do that. For example, I ran into a situation where the Kubernetes external-dns provider was making too many requests per second when it was running on all of my clusters. There wasn’t a way for me to adjust it so AWS had to increase the limit (slightly) on the account.

In the end…

Unfortunately it all boils down to whether or not you should retry or if you should even make the API calls. Thankfully the approaches I described are fairly easy to implement. As you may have guessed from the stats shown we’re still getting rate-limited for some applications. This is currently a whack-an-app process where we are reducing the calls across quite a few applications a little at a time.

5 ways to handle AWS API rate-limiting was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kubernetes Cluster Autoscaler — In for a penny, in to infinity

Mike Tougeron — Wed, 11 Nov 2020 18:01:03 GMT

Kubernetes Cluster Autoscaler — In for a penny, in to infinity

I had an interesting conversation with a coworker in another business unit the other week where we were talking about instance types and planning for unknown workload sizes in our Kubernetes clusters. They asked what memory-to-cpu ratios my team used to decide the instance types to run for our clusters. I had to call timeout and talk about why ratios didn’t matter because I was using the cluster-autoscaler. I realized we needed to take a step back and go over the philosophy I use when running workloads on Kubernetes. I figured it might make an interesting blog post so here we are.

Like a lot of people, the Kubernetes clusters I manage look like they were setup by kubeadm for the control-plane with multiple apiservers, etcd, etc. Maybe you run etcd on a separate tier of VMs, maybe you don’t, but in the grand scheme of things we’re all doing things pretty much the same way. Where it starts to get interesting is with the worker nodes. Some people run one instance type, others run multiple types. Some people have a fixed size of the cluster and others run the cluster-autoscaler. As you probably got from the title, I’m one of those who use the cluster-autoscaler and rely on it heavily for my workloads.

From an overly simplistic perspective, as a cluster operator I don’t really care what someone requests for memory & cpu for their applications. I care a great deal from a SRE and/or business perspective but let’s ignore those hats for the moment.

What matters is that the Scheduler can schedule the requested workload.

What matters to me is that the cluster has the resources to run the workloads that rely on it. Obviously we can’t run thousands of nodes just in case something needs it one day; that’s where the cluster-autoscaler comes in. The cluster-autoscaler watches for when Pods fail to schedule due to unavailable resources and scales the cluster nodes so that those resources become available. It can also reschedule workloads from under-utilized nodes so that the cluster scales down to a smaller size. On AWS it can do this scaling via node templates or AWS AutoScalingGroups and in my environments we heavily rely on AutoScalingGroups.

Much like my colleague, when I first created Kubernetes clusters I thought the ratio for cpu & memory was important. We were not sure what the workloads requirements were going to be like so we went with the generic m5d.8xlarge instance type. This of course worked and the cluster ran fine. The problem was that we were having trouble getting the cluster utilization numbers above 10-15%. The binpacking wasn’t fitting the Pods in a way that used most of the resources for a node. I ended up spending a bunch of time working with teams to make sure they were setting appropriate resources requests/limits and educating people about how to figure that out on their own. The education portion was time well spent but overall the effort barely moved the needle. We got up to ~23% memory & 7% cpu utilization which means all that work barely made an impact.

After about a year of running this way we wanted to start running the majority of the development cluster on spot instances. The cluster costs were rising and it needed attention before it spiraled out of control. Enabling a pool of spot nodes wasn’t hard as the code already had the concept of multiple worker tiers. I had to add some information about spot pricing but otherwise it was good to go.

https://medium.com/media/2149d8bc6dd9579a086fca7a569d2c37/href

But a few days later alerts started firing that the cluster was not scaling due to a lack of spot capacity. The cluster-autoscaler was doing its job and trying to scale; there just wasn’t spot capacity of the instance type we were running. The quick & dirty solution was to add another pool of spot instances of a different size. This worked for a day or two before the alerts started again.

Should I create another node pool to fix it?

This is when I started to realize I was looking at the design of the node tiers wrong. I didn’t care if the workload ran on a c5d.4xlarge, m5d.4xlarge, or r5d.16xlarge. I just wanted it to run! So I copy/pasted some Terraform code around and I had 12 worker tiers; one for each instance type running spot. I tested this out in dev and the spot capacity problem went away.

It wasn’t long until it was noticed that the dev cluster’s memory utilization was up to 43% for memory. After discussing it in chat for a bit with my team the light bulb went off. The cluster-autoscaler was picking different instance types based on the size of the workload that needed to be scheduled. The least-waste config option was doing the work by picking what instance type would utilize most of its resources. Sometimes it went on a m5d.4xlarge and sometimes a c5d.4xlarge based on the blended score of CPU & memory.

Why isn’t this being done in production?

I don’t remember who first asked the question about the production cluster’s utilization but it was quickly decided that the same approach, except using non-spot, should be done in production too. Less than 24 hours and 36 AutoScalingGroups later (12 instance types * 3 AZs) the production clusters were running the same sort of configuration. We didn’t force workloads to reschedule immediately so it took a bit of time for the impact to be seen. The charts eventually started showing over 57% memory utilization and ~25% cpu utilization. It wasn’t perfect but it was 2–3x improvement over what was there before!

The max memory & CPU usage over an hour

Once spot workloads were added into production there were 72 AutoScalingGroups being managed by the cluster-autoscaler in each cluster. The switch to using so many AutoScalingGroups made a few shifts in thought necessary in order for it to not be overwhelming.

Treat the worker nodes as pieces of compute not cattle.

I found that it can be a different concept to treat something as a piece of compute instead of a herd of cattle. Even if each VM can be replaced, destroyed, or auto-remediated on a whim they have a type and are associated with a grouping. This makes them less ephemeral than something abstract like cpu or memory. Once the thought patterns shifted the code shifted as well. Our deployment script doesn’t need to check that the AutoScalingGroup replaced a node, instead, it can check whether or not the cluster has Pods that can’t be scheduled. No longer does the script need to check whether or not there is existing capacity for rescheduling, it can simply evict the Pods, respecting the PodDisruptionBudget, and trust that the cluster-autoscaler will scale up as necessary. As an added bonus the cluster deployment time went from around 6 hours down to 2–3 hours without noticeable impact on the running workloads.

I can now put back on my other hats and start caring about things like why the cpu usage is so much lower than the cpu requested.

I’m able focus on the things that can impact the bottom line and costs of running these workloads. I can put my attention to application performance and spend my time optimizing the way the apps run on Kubernetes.

Essentially the cluster-autoscaler lets you care about the workloads and not worry about the type of compute it runs on. In the end, isn’t that what really important?

Kubernetes Cluster Autoscaler — In for a penny, in to infinity was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Live migrating a Kubernetes cluster across VPCs without downtime

Mike Tougeron — Thu, 22 Oct 2020 18:22:35 GMT

Recently I ran into a situation where we had an IP conflict with another team’s Kubernetes cluster where they had a pod network CIDR block that conflicted with the CIDR block of the VPC my cluster (as well as legacy EC2 instances) was in. My team’s cluster could talk to their cluster over VPC peering but they couldn’t talk to me the same way. We didn’t want to put any of the application ingresses on the public internet and for internal limitations we couldn’t extended my VPC’s CIDR block. The only solution that could be found was to setup a VPC with a different CIDR block. This is easy enough to handle for the EC2 instances outside the Kubernetes cluster but live migrating a cluster without downtime was a bit of a challenge. Due to the application deployment pipeline the clusters have become pets to the engineering teams. That introduces a set of problems where spinning up & migrating to a different cluster isn’t possible without a significant time investment across many teams. Doing this migration without downtime seemed daunting but as the scope was defined if started to become a reasonable goal.

NOTE: At the last minute it was decided to not do this migration. The steps I describe below were executed several times in the lab environment without issue but it never went to the production migration stage. :(

Not all work ends up in production and this turned out to be one of those times. However, I still think the process is worth sharing. :)

What the cluster looked like

I’m a big fan of the Hashicorp’s Infrastructure As Code toolset and they work well for my environment. For cluster deployments I build an AMI using Packer, deploy it to the AWS AutoScalingGroup(s) with Terraform, and then run a custom Python script to cycle the nodes. The Terraform code has 3 main components; a module (worker-common) for shared resources (security groups, LBs, DNS, etc), a module (control-plane) for the control-plane, and a module (worker-{pool}) for the worker nodes. There are around 72 AutoScalingGroups (one per instance-type per AZ plus on-demand vs spot), about 200 worker nodes, and many thousands of Pods running on the cluster I needed to migrate.

The Terraform code essentially looked like this:

https://medium.com/media/b93b68ee259d5b92403e8c842be96a46/href

The problems

With Terraform the configuration is declarative and because of assumptions made in the code I had a few problems to solve.

Can’t duplicate names for InstanceProfiles, AutoScalingGroups, and LoadBalancers
Can’t register instances to a TargetGroup that is in a different VPC
SecurityGroups can’t be used across VPCs; though they can be referenced in rules
The DNS entries were CNAMEs to the LBs not A records

Walking through the solution steps

I broke the problem into pieces and tackled each one individually. The basic order of operations was something like this. I’ll be walking through each piece below.

Create the new VPC
Setup the shared resources in both VPCs
Create additional worker nodes in the new VPCs
Route traffic to both VPCs
Route traffic to just the new VPC
Migrate the workloads to the nodes in the new VPC
Migrate the control-plane nodes to the new VPC
Clean up the old resources

Again, this seems like a lot to do but it turned out to not be as much as I had originally thought it would be. One of the things that made it easier was a rich set of tags on all our AWS resources. I was able to clearly reference them in the Terraform code and do data resources accordingly.

Creating the new VPC

First thing was creating the new VPC. This was pretty straight forward except for coming up with a name that everyone was happy with. Both VPCs were peered with each other so that everything could talk privately to each other.

https://medium.com/media/e135889b75173d2da312358866f58818/href

Setting up shared resources

Once that was created, I started to tackle the resource naming pattern and that too turned out to be pretty easy to solve. I added a new parameter to the worker-common Terraform module that allowed me to set a suffix to each resource. By default the variable was empty so no prefix was added and the existing resources were not impacted. I could then setup a worker-common-migration module that creates the new resources in the new VPC. Because I didn’t need to move the InstanceProfile (it’s not VPC specific) I added a flag for whether or not the code should create it.

https://medium.com/media/f3f7bfeb4ebf88913b8c398771f98f0e/href

I put in a bit of a hack for the SecurityGroups so that during the migration the “shared” SecurityGroup references both the old & the new IDs. This is possible because the new VPC is peered to the old VPC.

Now I could run Terraform so the LBs & SecurityGroups would be created in the new VPC. Once the Terraform run is completed the new “common” resources exist but nothing is using them yet.

The worker nodes

Next up was creating worker nodes in the new VPC. Unlike with worker-common, I didn’t need to create a migration module for the workers. I just needed to add a new one (x24!) that referenced the newly created resources.

https://medium.com/media/40c40b9b0e942795f070e11145a883e2/href

Before I could run terraform apply though I needed to update the Ingress controller. I had to change the Service to externalTrafficPolicy: Cluster so when the ingress controller started running on the nodes in the new VPC the traffic would still be routed to them. Remember, I can’t add the new workers into the same TargetGroup as the existing ones because they are in a different VPC. This created an increase in latency for every request through the ingress controller because the NodePort had to be proxied but it was low enough, and for such a short period of time, that it was considered acceptable. After the new workers were created, I added a taint to the old AutoScalingGroups so that no new Pods would be scheduled onto them.

for node in $(aws ec2 describe-instances --filters "Name=tag:KubernetesCluster,Values=CLUSTER_NAME" "Name=vpc-id,Values=OLD_VPC_ID" | jq -r '.Reservations[].Instances[].PrivateDnsName')l do
    kubectl taint nodes $node migration=migration:NoSchedule
done

Now that I had new workers in the new VPC and they were all able to talk to each other in both VPCs, I was ready to cutover DNS and point it to the new LBs. So far, the total migration time has been only ~20 minutes and I’m meeting my goal of zero downtime.

Migrating the workloads

At this stage, I need to start shifting the workloads from workers in the old VPC to the workers in the new VPC. I manually evicted the Ingress controller pods which moved them onto workers in the new VPC. This allowed me to switch back to externalTrafficPolicy: Local and get the ingress latency back to normal. A couple minutes later this was done and I could start moving the live workloads. Luckily for me, our Python deployment script allows for cycling nodes based on a label filter. I kicked off the job and it started draining the old worker nodes. As each node was drained the workloads automatically shifted only onto the new worker nodes thanks to the taint I had added. Moving workloads while respecting the PodDisruptionBudgets can be slow and for the live cluster it was expected to take about 4 hours. Working with the lab cluster had this step done in ~30 minutes.

Code cleanup

While I still had the control-plane to migrate I wanted to clean up the Terraform code to start removing the dueling modules. I flipped the worker-common-migration module to use create_instance_profile = "true" and did the opposite in the original worker-common module. I then moved the resources in the Terraform state from one module to the other.

terraform state mv module.worker-common.aws_iam_instance_profile.this module.worker-common-migration.aws_iam_instance_profile.this

terraform state mv module.worker-common.aws_iam_role.readonly module.worker-common-migration.aws_iam_role.readonly

terraform state mv module.worker-common.aws_iam_role.this module.worker-common-migration.aws_iam_role.this

terraform state mv module.worker-common.aws_iam_role_policy.this module.worker-common-migration.aws_iam_role_policy.this

# etc, etc

I pointed all the worker modules to the new migration module’s output using sed.

find . -type f -name "*.tf" -not -path '*/\.terraform' -exec gsed -i 's/module.worker-common.instance_profile_id/module.worker-common-migration.instance_profile_id/g' {} +

I removed the original worker-common.tf and changed the module source that the worker-common-migration.tf was pointing to.

module "worker-common-migration" {
  source = "../modules/worker-common"
  # the rest of the code is the same
}

Even though the module was now called worker-common-migration the code it is using is now the same as all of the other clusters. The parameters still point to the new VPC but the code used is the same and that’s the important part for future development & maintenance. The next Terraform run removed the old AutoScalingGroups and SecurityGroups as they were no longer needed. Now, all that is left is to do the control-plane.

The control-plane

With the control-plane things start to get tricky. I can’t have more than one node using the same etcd volume and writing to it at the same time. This means that I am going to have to stop one of the nodes, recreate it in the new VPC and then bring it live with the other nodes still running. I can only do one at a time if I don’t want to lose quorum on etcd. Because of the way the control-plane security groups were created inside of the same module that creates the control-plane nodes I wasn’t able to do the same sort of trick as I did with the workers. Instead, I used a variable called extra_security_groups that could be used to attach an extra SecurityGroup to the control-plane nodes. I broke a rule and manually created a SecurityGroup in the existing VPC that had the same rules and manually attached it to each control-plane node. This meant it was now safe for Terraform to delete the original SecurityGroup and recreate it in the new VPC.

The code calling the module was then updated to point to the new VPC and subnets (e.g., data.aws_subnet.private-migration.*.id). Instead of running a general terraform apply I needed to run each migration step using Terraform’s -target flag for the resources that I wanted to migrate first.

terraform apply -target module.control-plane.aws_security_group.control-plane -target module.control-plane.aws_security_group_rule.control-plane-egress -target ... -target ... # etc etc

However, I ran into the problem of the LB that runs in front of the control-plane nodes. I needed to be able to balance across 2 different VPCs and that isn’t possible with a TargetGroup. Managing the DNS entries and changing them at precisely the right time during the migration was difficult with our code setup so I decided to break another rule and manually updated the DNS entry. I changed it from pointing to the LB and to be the A records of the first control-plane node in the new VPC. This enabled me to keep full uptime on the api calls made from outside the cluster.

I removed the extra_security_groups parameter and ran another targeted apply to recreate just a single set of control-plane resources.

terraform apply -target module.control-plane.aws_autoscaling_group.this.2 -target module.control-plane.aws_autoscaling_group.etcd.2 -target module.control-plane.aws_launch_template.this.2 -target module.control-plane.data.template_file.user_data.2 # etc etc

This left me with one part of the control-plane in the new VPC and the rest in the old. Now that I had part of the control-plane running in the new VPC I could safely have Terraform, through another -target apply command, recreate the LB in the new VPC. It would have just the single apiserver node in it but that’s okay because external calls to the Kubernetes API are pretty low and it could handle the load. The internal calls use the kubernetes.default Service and are unaffected by these changes. Once the LB was been recreated, I was able to switch the DNS back to the configuration that is a CNAME to the LB.

I ran the terraform apply -target ... -target ... again for the next piece of the control-plane and that was moved as well. Rinse & repeat one more time and the control-plane was running in the new VPC!

The migration of the control-plane was a lot more manual than I prefer but it got the job done in about 30 minutes. The majority of that time was waiting on the AWS resources to be created and to come online. In general, because of the way we build our AMIs, it takes 4–7 minutes from when a node starts to boot and become ready in the Kubernetes cluster.

More cleanup

At this stage the the cluster is fully migrated but the code was messy. I made another pass through the code and got rid of all the data.aws_subnet.private-migration code. It was updated to use the new VPC only in the lookups and the references were pointed back to the original data.aws_subnet.private. The worker-common-migration name for the module in the app will always be there. Well… unless it starts to bug me too much and I do all the terraform state mv commands to move it but that seems like a lot of risk for no real value.

However, all the AutoScalingGroup still have names with the -migration suffix and that could be considered confusing if someone was looking at the AWS resources. I setup another set of workers without the asg_suffix and set a taint on the ones with -migration the same way I did for during the migration. All new workloads will now go onto these new ASGs and slowly drain off the old over time or whenever the next deployment is released; there’s no benefit to cycling the cluster again now. Once all the workloads are off the -migration tier I’ll remove that code from Terraform as well.

Final thoughts

In short, this was a huge pain to go through but I’m kind of glad that I had to do it. I’m disappointed that it never went all the way through to production but that’s the way things work sometimes. I think I learned more about the way the code worked than I did writing it in the first place. Sounds strange to say that but it’s true. When I wrote most of the original code I had to think about how things related to each other but in the 2 years since then I’ve never really had to think about it.

In the future, I’m looking forward to when IPv6 is inside our VPCs and data centers. Assuming you don’t do something … unique … that’ll make IP conflicts a thing of the past. I’m not sure when we’ll end up doing this but thankfully Kubernetes has made this possible when the time comes.

Live migrating a Kubernetes cluster across VPCs without downtime was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Mental health & tech

Mike Tougeron — Sat, 10 Oct 2020 17:32:42 GMT

For a variety of reasons people in the US don’t like talking about mental health. There’s a fear that there will be negative reactions and impact from talking about it. From the simple “I’m stressed out today” to the complex “I’m feeling super depressed today” it is all something that we don’t talk about. I think that’s a load of BS and I would love to see it changed. Mental health is no different from a broken arm or a twisted knee. It’s all about your body’s overall health and there are ways in which a medical professional can help you deal with it or adapt to it. If there’s no stigma needing a pill for low blood pressure then there isn’t one for depression.

Okay, end-rant. :)

I suffer from depression, bipolar and ADD. My mental health issues were not officially diagnosed until I was in my late 20s but in hindsight they were there all my life. I remember having good days & bad days but I always associated them with being situational and related to something that recently happened. As an adult, I learned that the frequency of good & bad days (or multiple times a day) were a sign of being bipolar.

Frequently when I talk about my mental health in a public forum I get responses like:

Be careful what you say, it might come back & hurt you.
Are you sure you should tell people that?
Just don’t say anything about me when you talk about it.

But what makes it all worth it when I put my vulnerabilities out there, is when I also hear:

Thank you, it helped me realize that I can could get help too.
I wanted to let you know that I saw a doctor about it last week because of what you said.
Can I get your help today?

It also helps me remember that I’m not alone. It makes me feel good that I might have contributed to someone’s life and possibly even made it better.

What does this have to do with technology? Well, I thought I’d share some of the tech I’ve used over the years to help me manage. It may or may not help others but the more information that’s out there and the more mental health is discussed in general the better off we will all be.

RescueTime (referral link): I use RescueTime to help keep track of what I’m spending my time on at work.

Am I coding or am I visiting social media? Am I spending my time in meetings this week or did I work on something where I felt a sense of accomplishment? RescueTime’s reports help me get some good visibility into where I’ve been so that I can understand the why of where I’m at.

Fitbit: I use Fitbit to help keep track of how I’m sleeping.

A lot of people use Fitbit to track their steps & exercise but it also does a pretty good job of tracking sleep at night. I don’t sleep very well in general but some nights are worse than others. Knowing how well I slept the night before helps me better determine if I’m having a mood swing due to lack of sleep and should take a nap. Or am I cranky due to lack of sleep vs a fit of depression.

Alarm Clock: I have 3 alarms set each day for mental health.

My obsessive nature and ADD makes it easy for me to lose track of time. Using these basic alarms I remind myself to self-care and check-in on my current state of well-being.

Other recommendations: I haven’t personally used these tools (yet!) but they came highly recommended from people I trust & respect. The tools look useful so I’m sharing them too. The first is Mindfulness Coach from the VA and the other is Talkspace. If you use either of them let me know what you think.

Hopefully this will help someone feel more comfortable with themselves and feel like they’re not alone. Mental health is important and like anything else it takes care, feeding & attention to maintain a consistent level of health. The more we talk & share our challenges, successes and general state the better the world will be.

If you ever need to talk or want to know more about my experiences, feel free to reach out to me and I’ll be there to listen.

Mental health & tech was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Initial thoughts on Kubecost

Mike Tougeron — Mon, 05 Oct 2020 15:26:17 GMT

Recently I had the opportunity to install Kubecost on several of the AWS clusters I manage. The tldr is that it was a very helpful and useful system. But to be honest, my initial thoughts were leaning towards the negative until I got it all setup. IMO, like a lot of start-up products, the documentation isn’t the greatest. I felt kind of overwhelmed by what needed to be done and the names of the project vs the docs vs the Github repo didn’t exactly match up (kubecost vs cost-model vs cost-analyzer).

BUT, and this is pretty huge, the Kubecost team was great to work with. They got on a video call, walked us through what I was doing wrong, and helped bridge the gap around what I didn’t understand. And they have a Slack channel to help as well. Once I understood how the components worked together I was good to go. I sent them my feedback about the documentation and hopefully that’ll help the next person who comes along.

As part of my setup, I kept Kubecost isolated from the system Prometheus data so I had to setup a dedicated Prometheus. This meant that even though Kubecost was setup and running, the data wasn’t very useful until there was a day or two worth loaded into the system. I didn’t spend the time to use the exact pricing (RIs, spot, etc); the default pricing model was sufficient for this POC. Once I had the data history the usage patterns instantly became clear. I knew within minutes what namespaces I needed to look at in order to cost-optimize.

At a glance I knew that there was no way that nginx-ingress should account for 15% of a cluster’s costs.

A screenshot of reporting data from the Kubecost dashboard

When I looked at the deployment I saw it was requesting 10cpu and 3 pods for an internal cluster that barely gets any traffic. I adjusted the resources requested and the cluster autoscaler quickly reduced the number of workers running. I already had a win and I was just getting started!

Diving deeper into a different cluster I found that a staging namespace was costing more than the production namespace by more than 3x. Another quick click and I saw that the namespace was using 17TB of disk!

Screenshot from the namespace specific details

Turns out there was some performance testing done where the number of replicas was increased super high but when the test was stopped the PVCs were overlooked and not cleaned up.

Some of the applications are pretty dialed-in with resource requests & limits while others are not. Thankfully, Kubecost was able to help with this as well.

Using the namespace details dashboard I was able to see Kubecost’s resource recommendations vs what was set in the manifests. In this screenshot I was able to identify that a Thanos sidecar for Prometheus was misconfigured with 30Gi instead of 30Mi of memory. 30Gi of memory for each Prometheus pod across a dozen or more clusters and you’re talking about some real money.

One of the features I like is the ability to allocate “shared” costs proportionally across all applications. We run an ELK (with fluentd) stack for the container logs as a system service that all teams can use. We wouldn’t be running these Kubernetes specific ELK stacks so we consider them part of the cost of doing business on the platform.

Settings page

With Kubecost I’m able to distribute the costs of the control-plane and ELK, based on their labels, to each of the applications using the Kubernetes platform. I considered allocating the kube-system namespace the same way but I want to know how much these system components & daemonsets were costing us to run.

To set up Kubecost, I used Helm and the cost-analyzer-helm-chart chart from Kubecost. Because I was planning on running this in several clusters (at least 7) I created a wrapper-chart where I could set custom default values for all my deployments as well as a few custom resources for our setup. I kept the default values for the resource requests & limits but that was a mistake. For my larger clusters I was constantly being evicted & OOM’d. If you can, deploy it without limits first to see what’s actually being used and then set the requests/limits accordingly. In the wrapper-chart we also create our Ingress and cert-manager certificates. We create the Ingress in the wrapper so that we can use a shared library chart we have. Lastly we try and set as many of the kubecostProductConfigs variables as possible; though we found some were not able to be set via the Helm chart. One thing to watch out for with this chart is that the Kubecost team modified the values.yaml for the sub-charts distributed with it.

The good:

Easy to use once setup
Provides quick insights into where you’re spending $$
Able to distribute shared costs proportionally across applications

The bad:

Can be hard to find what you need in the documentation

Verdict:

Install it and try it! If your experience is anything like mine, the cost savings will more than makeup for the price.

Initial thoughts on Kubecost was originally published in GrepMyMind on Medium, where people are continuing the conversation by highlighting and responding to this story.