Stories by Kevin Yang on Medium

How we learned to improve Kubernetes CronJobs at Scale (Part 2 of 2)

Kevin Yang — Wed, 05 Aug 2020 19:06:22 GMT

Fr. Dougal McGuire / CC BY-SA (https://creativecommons.org/licenses/by-sa/2.0) https://commons.wikimedia.org/wiki/File:Bultaco_engine_exploded_view.jpg

This is Part 2 of a two-part blog series on Improving Kubernetes CronJobs at Lyft. If you haven’t already, checkout Part 1.

It became clear that Kubernetes CronJobs out-of-the-box were not going to be an easy to use, drop-in replacement for running repeated, scheduled tasks. If we wanted to move all of Lyft’s crons onto Kubernetes confidently, we needed to not only address the technical shortcomings of CronJobs, but the human experience of using them as well. Namely, we needed to:

Listen to our developers to understand what questions they wanted answered about their crons:

“Did my cron run?” (“Did the application code execute?”)
“Did it run successfully?”
“How long did the cron take to execute?” (“How long did it take the application code to execute?”)

Scale platform support by making Kubernetes CronJobs easier to reason about, their life cycles well-understood, and the platform / application boundary clear.

Instrument our platform with built-in metrics and alerts to reduce the amount of bespoke alarm configurations and duplicated cron wrapper scripts that developers need to write and maintain.

Build tooling to make it easy to not only recover from failures but test new CronJob configuration as well.

Fix long-standing, technical issues in Kubernetes like the TooManyMissedStarts bug that require manual intervention to remedy and cause an important failure scenario (when startingDeadlineSeconds is missed) to silently fail.

Solution

We solved these problems by:

Exposing observability that not only enables developers to debug their CronJobs, but allows platform engineers to define and monitor Service Level Objectives (SLOs) as well.
Adding a tool to make ad hoc invocations of a CronJob easy in our Kubernetes stack.
Fixing those long-standing issues inside Kubernetes itself.

CronJob Metrics and Alerts

An example of a dashboard generated by the platform to monitor a particular CronJob

We instrumented our Kubernetes CronJob stack with the following metrics that are emitted for all CronJobs at Lyft:

started.count — This is a counter that is incremented specifically when the application container of a CronJob invocation starts for the first time. This answers the question: “Did the application code execute?”

{success, failure}.count— These are counters that are incremented when a given CronJob invocation reaches a terminal state (when a Job has finished running and the jobcontroller no longer tries to execute it). These answer the question: “Did it run successfully?”

scheduling-decision.{invoke, skip}.count — These are counters that expose the decisions the cronjobcontroller makes when invoking a CronJob. In particular, skip.count helps answer: “Why is my cron not running?” and is parametrized by the following reason labels:

reason = concurrencyPolicy — The cronjobcontroller skipped invoking a CronJob because doing so would be a violation of its ConcurrencyPolicy .
reason = missedDeadline — The cronjobcontroller skipped invoking a CronJob because it has missed the invocation window defined by .spec.startingDeadlineSeconds .
reason = error — This is a catch-all for other errors encountered when trying to invoke a CronJob.

app-container-duration.seconds — This is a timer that measures the wall-time of the application container. It answers the question: “How long did it take the application code to execute?” This timer deliberately does not include time taken to schedule a Pod, startup sidecars, etc., which are part of what the platform team owns and is encompassed by start delay.

start-delay.seconds — This is a timer that measures start delay. This metric, when aggregated across the platform, enables platform engineers to not only quantify, monitor, and tune the performance of the platform, but also begin to define SLOs for things like start delay and maximum cron schedule frequency.

With these metrics, we were then able to create default alerts that notified developers when:

Their CronJob did not run when it was supposed to (rate(scheduling-decision.skip.count) > 0)
Their CronJob failed (rate(failure.count) > 0)

Developers no longer need to maintain their own alerts and metrics for crons on Kubernetes as the platform provides them built-in.

Ad-hoc Cron Run

We adapted kubectl create job test-job — from=cronjob/ to our internal CLI tool that every engineer at Lyft uses to interact with their services on Kubernetes to make it simple to invoke their CronJob ad hoc in order to:

Recover from intermittent CronJob failures.
Reproduce and debug run-time failures at a time that is not 3:00 AM (a more convenient time when you can inspect CronJob, Job, and Pod events in real-time) instead of trying to catch it in the act.
Test run-time configuration when developing a new CronJob or migrating an existing Unix cron without waiting for the cron schedule to pass by.

Fixing TooManyMissed Starts

We fixed the TooManyMissedStarts bug so that CronJobs would no longer get “stuck” after 100 missed starts in a row. In addition to removing the need for manual intervention, this patch also allowed us to actually monitor when startingDeadlineSeconds is exceeded. Hats off to Vallery Lancey for designing and writing this patch as well as Tom Wanielista for help come up with the algorithm. There is currently an open PR in Kubernetes to upstream this patch.

Implementation Deep Dive: Cron Monitoring

Where in the life cycle of Kubernetes CronJobs we have added instrumentation to emit metrics

Alerts that don’t depend on cron schedules

One of the tricky parts about implementing an alert on missed cron invocations is dealing with cron schedules (crontab.guru is tremendously helpful for deciphering these!). For example, consider a cron schedule like:

# At every 5th minute

*/5 * * * *

To instrument this cron, you might increment a counter metric every time the cron finishes (or use a cron wrapper). In your alerting system, you would then write a conditional query that says, “Look back through a 60 minute window, and alert me if the counter increased by less than 12”. Problem solved, right?

What if instead you had a cron schedule like:

# At minute 0 past every hour from 9 through 17 on every day-of-week from Monday through Friday.

# in other words, “business hours” (9–5, Mon-Fri)

0 9–17 * * 1–5

Now you need to get fancy with your query, or maybe your alerting system has some features that allow you to only be alerted during “business hours”. Regardless, these examples illustrate that coupling the cron schedule to the alert definition has several downsides:

Changing the cron schedule means changing the alert.
Some cron schedules require complex time series queries to replicate.
Crons that don’t start exactly on time will require some amount of “grace period” to be built in to the query to minimize false positives.

#2 alone makes generating default alerts for all crons on a platform a very difficult task, and #3 is especially pertinent to distributed platforms like Kubernetes CronJob where start delay is non-negligible. Alternatively, there are solutions that use dead man switches, which still requires coupling the cron schedule to the alert, and/or anomaly detection algorithms, which require learning expectations over time and thus don’t work immediately for new CronJobs nor changes to a cron schedule.

Another way of looking at the problem is to ask: what does it mean when a cron is supposed to run but hasn’t?

In Kubernetes, barring bugs in the cronjobcontroller or your control-plane being down (the latter of which should be very obvious if you are monitoring your cluster correctly), this means that the cronjobcontroller evaluated the CronJob, determined (according to the cron schedule) that it needed to be invoked, yet still deliberately chose not to.

Sound familiar? This is exactly what our scheduling-decision.skip.count metric captures! Now, we only need to check for changes in rate(scheduling-decision.skip.count) in order to alert someone that a CronJob was supposed to run but hasn’t.

This solution decouples the cron schedule from the alert itself, which yields several advantages:

No need to re-configure alerts when cron schedules change.
No complex time series queries and conditionals.
Easy to generate default alerts for all CronJobs on the platform.

This, combined with the other time series and alerts mentioned previously, helps paint a more complete and easier to understand picture of a CronJob’s state.

Implementing the start delay timer

Due to the complex nature of CronJob life cycles, we needed to be precise about where in our stack we added instrumentation in order to accurately measure this metric. This boiled down to capturing 2 timestamps:

T1: When the cron is expected to run (as dictated by the cron schedule).
T2: When the application code actually begins executing.

Then, start delay = T2 — T1. For T1, we added some code to the cron invocation logic in the cronjobcontroller itself to write the expected start time as a .metadata.Annotation on Job objects that the cronjobcontroller creates when invoking a CronJob. Then we can consume it directly from any API client by issuing a basic GET Job request.

T2 is trickier to get right. Because we are interested in capturing the tightest bound of start delay, we want T2 to be the first time the application container starts running. If instead we recorded T2 at any application container start (including restarts), then our start delay would include application code execution time as well. To accomplish this, when we detected that the application container for a given Job transitioned to Running for the first time, we wrote another .metadata.Annotation to the Job object, essentially creating a distributed lock so that future application container starts for a given Job would be ignored, and only the timestamp of the first start would be recorded.

Impact

Since rolling out these features and bug fixes, we’ve received a lot of positive feedback from our developers. To summarize, developers using our Kubernetes CronJob platform:

No longer need to maintain their own bespoke monitoring and alerts.
Can have high confidence that their CronJobs are working, and will be alerted when they are not.
Can easily recover from failures and test new CronJobs in this environment using our ad-hoc CronJob invoking tool.
Can understand the performance of their application code (using the app-container-duration.seconds timer metric).

Additionally, platform engineers now have another dimension (start delay) for measuring the user experience and performance of the platform.

Finally (and perhaps the biggest win), by building richer observability to make CronJob state easier to reason about, developers and platform engineers can now debug issues together while looking at the same data, and more often than not, developers can diagnose and solve issues themselves all-together using the tools the platform provides.

Conclusion

Orchestrating distributed, scheduled tasks is hard. Kubernetes CronJobs are just one of many ways of doing so. While they are far from perfect, CronJobs can work at scale if you are willing to invest the time and effort into adding observability, understanding how they can fail, and building what your users need.

Lyft is hiring! If you’re passionate about Kubernetes and building infrastructure platforms, read more about them on our blog and join our team!

Note: There is an open Kubernetes Enhancement Proposal (KEP) to address the shortcomings of CronJobs and graduate them to GA.

Big thanks to Rithu John, Scott Lau, Scarlett Perry, Julien Silland, and Tom Wanielista for their help in reviewing this blog series.

How we learned to improve Kubernetes CronJobs at Scale (Part 2 of 2) was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we learned to improve Kubernetes CronJobs at Scale (Part 1 of 2)

Kevin Yang — Mon, 03 Aug 2020 18:51:24 GMT

At Lyft, we chose to move our server infrastructure onto Kubernetes, a distributed container orchestration system in order to take advantage of automation, have a solid platform we can build upon, and lower overall cost with efficiency gains.

Distributed systems can be difficult to reason about and understand, and Kubernetes is no exception. Despite the many benefits of Kubernetes, we discovered several pain points while adopting Kubernetes’ built-in CronJob as a platform for running repeated, scheduled tasks. In this two-part blog series, we will dive deep into the technical and operational shortcomings of Kubernetes CronJob at scale and share what we did to overcome them.

Part 1 (this article) of this series discusses in detail the shortcomings we’ve encountered using Kubernetes CronJob at Lyft. In part 2, we share what we did to address these issues in our Kubernetes stack to improve usability and reliability.

Who is this for?

Users of Kubernetes CronJob
Anyone building a platform on top of Kubernetes
Anyone interested in running distributed, scheduled tasks on Kubernetes
Anyone interested in learning about Kubernetes usage at scale in the real-world
Kubernetes contributors

What will you gain from reading this?

Insight into how parts of Kubernetes (in particular, CronJob) behave at scale in the real-world.
Lessons learned from using Kubernetes as a platform at a company like Lyft, and how we addressed the shortcomings.

Prerequisites

Basic familiarity with the cron concept
Basic understanding of how CronJob works, specifically the relationship between the CronJob controller, the Jobs it creates, and their underlying Pods, in order to better understand the CronJob deep-dives and comparisons with Unix cron later in this article.
Familiarity with the sidecar container pattern and what it is used for. At Lyft, we make use of sidecar container ordering to make sure that runtime dependencies like Envoy, statsd, etc., packaged as sidecar containers, are up and running prior to the application container itself.

Background & Terminology

The cronjobcontroller is the piece of code in the Kubernetes control-plane that reconciles CronJobs
A cron is said to be invoked when it is executed by some machinery (usually in accordance to its schedule)
Lyft Engineering operates on a platform infrastructure model where there is an infrastructure team (henceforth referred to as platform team, platform engineers, or platform infrastructure) and the customers of the platform are other engineers at Lyft (henceforth referred to as developers, service developers, users, or customers). Engineers at Lyft own, operate, and maintain what they build, hence “operat-” is used throughout this article.

CronJobs at Lyft

Today at Lyft, we run nearly 500 cron tasks with more than 1500 invocations per-hour in our multi-tenant production Kubernetes environment.

Repeated, scheduled tasks are widely used at Lyft for a variety of use cases. Prior to adopting Kubernetes, these were executed using Unix cron directly on Linux boxes. Developer teams were responsible for writing their crontab definitions and provisioning the instances that run them using the Infrastructure As Code (IaC) pipelines that the platform infrastructure team maintained.

As part of a larger effort to containerize and migrate workloads to our internal Kubernetes platform, we chose to adopt Kubernetes CronJob* to replace Unix cron as a cron executor in this new, containerized environment. Like many, we chose Kubernetes for many of its theoretical benefits, one of which is efficient resource usage.

Consider a cron that runs once a week for 15 minutes. In our old environment, the machine running that cron is sitting idle 99.85% of the time. With Kubernetes CronJob, compute resources (CPU, memory) are only used during the lifetime of a cron invocation. The rest of the time, Kubernetes can efficiently use those resources to run other CronJobs or scale down the cluster all together. Given the previous method for executing cron tasks, there was much to gain by transitioning to a model where jobs are made ephemeral.

The platform and developer ownership boundary in Lyft’s K8s stack

Since adopting Kubernetes as a platform, developer teams no longer provision and operate their own compute instances. Instead, the platform engineering team is responsible for maintaining and operating the compute resources and runtime dependencies used in our Kubernetes stack, as well as generating the Kubernetes CronJob objects themselves. Developers need only configure their cron schedule and application code.

This all sounds good on paper, but in practice, we discovered several pain points in moving crons away from the well-understood environment of traditional Unix cron to the distributed, ephemeral environment of Kubernetes using CronJob.

* while CronJob was, and still is (as of Kubernetes v1.18), a beta API, we found that it fit the bill for the requirements we had at the time, and further, it fit in nicely with the rest of the Kubernetes infrastructure tooling we had already built.

What’s so different about Kubernetes CronJob (versus Unix cron)?

A simplified sequence of events and K8s software components involved in executing a Kubernetes CronJob

To better understand why Kubernetes CronJobs can be difficult to work with in a production environment, we must first discuss what makes CronJob different. Kubernetes CronJobs promise to run like cron tasks on a Linux or Unix system; however, there are a few key differences in their behavior compared to a Unix cron: Startup Performance and Failure handling.

Startup Performance

We begin by defining start delay to be the wall time from expected cron start to application code actually executing. That is, if a cron is expected to run at 00:00:00, and the application code actually begins execution at 00:00:22, then the particular cron invocation has a start delay of 22 seconds.

Traditional Unix crons experience very minimal start delay. When it is time for a Unix cron to be invoked, the specified command just runs. To illustrate this, consider the following cron definition:

# run the date command at midnight every night

0 0 * * * date >> date-cron.log

With this cron definition, one can expect the following output:

# date-cron.log

Mon Jun 22 00:00:00 PDT 2020

Tue Jun 23 00:00:00 PDT 2020

On the other hand, Kubernetes CronJobs can experience significant start delays because they require several events to happen prior to any application code beginning to run. Just to name a few:

cronjobcontroller processes and decides to invoke the CronJob
cronjobcontroller creates a Job out of the CronJob’s Job spec
jobcontroller notices the newly created Job and creates a Pod
Kubernetes admission controllers inject sidecar Container specs into the Pod spec*
kube-scheduler schedules the Pod onto a kubelet
kubelet runs the Pod (pulling all container images)
kubelet starts all sidecar containers*
kubelet starts the application container*

* unique to Lyft’s Kubernetes stack

At Lyft, we found that start delay was especially compounded by #1, #5, and #7 once we reached a certain scale of CronJobs in our Kubernetes environment.

Cronjobcontroller Processing Latency

To better understand where this latency comes from, let’s dive into the source-code of the built-in cronjobcontroller. Through Kubernetes 1.18, the cronjobcontroller simply lists all CronJobs every 10 seconds and does some controller logic over each. The cronjobcontroller implementation does so synchronously, issuing at least 1 additional API call for every CronJob. When the number of CronJobs exceeds a certain amount, these API calls begin to be rate-limited client-side. The latencies from the 10 second polling cycle and API client rate-limiting add up and contribute to a noticeable start-delay for CronJobs.

Scheduling Cron Pods

Due to the nature of cron schedules, most crons are expected to run at the top of the minute (XX:YY:00). For example, an @hourly cron is expected to execute at 01:00:00, 02:00:00, and so on. In a multi-tenant cron platform with lots of crons scheduled to run every hour, every 15 minutes, every 5 minutes, etc., this produces hot-spots where lots of crons need to be invoked simultaneously. At Lyft, we noticed that one such hot spot is the top of the hour (XX:00:00). These hot-spots can put strain on and expose additional client-side rate-limiting in control-plane components involved in the happy-path of CronJob execution like the kube-scheduler and kube-apiserver causing start delay to increase noticeably.

Additionally, if you do not provision compute for peak demand (and/or use a cloud-provider for compute instances) and instead use something like cluster autoscaler to dynamically scale nodes, then node launch times can contribute additional delays to launching CronJob Pods.

Pod Execution: Non-application Containers

Once a CronJob Pod has successfully scheduled onto a kubelet, the kubelet needs to pull and execute the container images of all sidecars and the application itself. Due to the way Lyft uses sidecar ordering to gate application containers, if any of these sidecar containers are slow to start, or need to be restarted, they will propagate additional start delay.

To summarize, each of these events that happen prior to application code actually executing combined with the scale of CronJobs in a multi-tenant environment can introduce noticeable and unpredictable start delay. As we will see later on, this start delay can negatively affect the behavior of a CronJob in the real-world by causing CronJobs to miss runs.

Container Failure handling

It is good practice to monitor the execution of crons. With Unix cron, doing so is fairly straight-forward. Unix crons interpret the given command with the specified $SHELL, and, when the command exits (whether successful or not), that particular invocation is done. One rudimentary way of monitoring a Unix cron then is to introduce a command-wrapper script like so:

https://medium.com/media/e83d72d76221ce17bd65c607298f1bf0/href

With Unix cron, stat-and-log will be executed exactly once per complete cron invocation, regardless of the $exitcode. One can then use these metrics for simple alerts on failed executions.

With Kubernetes CronJob, where there are retries on failures by default and an execution can have multiple failure states (Job failure and container failure), monitoring is not as straightforward.

Using a similar script in an application container and with Jobs configured to restart on failure, a CronJob will instead repeatedly execute and spew metrics and logs up to a BackoffLimit number of times on failure, introducing lots of noise to a developer trying to debug it. Additionally, a naive alert using the first failure from the wrapper script can be un-actionable noise as the application container may recover and complete successfully on its own.

Alternatively, you could alert at the Job level instead of the application container level using an API-layer metric for Job failures like kube_job_status_failed from kube-state-metrics. The drawback of this approach is that an on-call won’t be alerted until the Job has reached the terminal failure state once BackoffLimit has been reached, which can be much later than the first application container failure.

What causes CronJobs to fail intermittently?

Non-negligible start delay and retry-on-failure loops contribute additional delay that can interfere with the repeated execution of Kubernetes CronJobs. For frequent CronJobs, or those with long application execution times relative to idling time, this additional delay can carry over into the next scheduled invocation. If the CronJob has ConcurrencyPolicy: Forbid set to disallow concurrent runs, then this carry-over causes future invocations to not execute on-time and get backed up.

Example timeline (from the perspective of the cronjobcontroller) where startingDeadlineSeconds is exceeded for a particular hourly CronJob — the CronJob misses its run and won’t be invoked until the next scheduled time

A more sinister scenario that we observed at Lyft where CronJobs can miss invocations entirely is when a CronJob has startingDeadlineSeconds set. In that scenario, when start delay exceeds the startingDeadlineSeconds, the CronJob will miss the run entirely. Additionally, if the CronJob also has ConcurrencyPolicy set to Forbid, a previous invocation’s retry-on-failure loop can also delay the next invocation, causing the CronJob to miss as well.

The Real-world operational burden of Kubernetes CronJobs

Since beginning to move these repeated, scheduled tasks onto Kubernetes, we found that using CronJob out-of-the-box introduced several pain-points from both the developers’ and the platform team’s points of view that began to negate the benefits and cost-savings we initially chose Kubernetes CronJob for. We soon realized that neither our developers nor the platform team were equipped with the necessary tools for operating and understanding the complex life cycles of CronJobs.

Developers at Lyft came to us with lots of questions and complaints when trying to operate and debug their Kubernetes CronJobs like:

“Why isn’t my cron running?”
“I think my cron stopped running. How can I tell if my cron is actually running?”
“I didn’t know the cron wasn’t running, I just assumed it was!”
“How do I remedy X failed cron? I can’t just ssh in and run the command myself.”
“Can you explain why this cron seemed to miss a few schedules between X and Y [time periods]?”
“We have X (large number) of crons, each with their own alarms, and it’s becoming tedious/painful to maintain them all.”
“What is all this Job, Pod, and sidecar nonsense?”

As a platform team, we were not equipped to answer questions like:

How do we quantify the performance characteristics of our Kubernetes Cron platform?
What is the impact of on-boarding more CronJobs onto our Kubernetes environment?
How does running multi-tenant Kubernetes CronJobs perform compared to single-tenant Unix cron?
How do we begin to define Service-Level-Objectives (SLOs) to communicate with our customers?
What do we monitor and alarm on as platform operators to make sure platform-wide issues are tended to quickly with minimal impact on our customers?

Debugging CronJob failures is no easy task, and often requires an intuition for where failures happen and where to look to find proof. Sometimes this evidence can be difficult to dig up, such as logs in the cronjobcontroller which are only logged at a high verbosity log-level. Or, the traces simply disappear after a certain time period and make debugging a game of “whack-a-mole”, such as Kubernetes Events on the CronJob, Job, and Pod objects themselves, which are only retained for one hour by default. None of these methods are easy to use, and do not scale well from a support point-of-view with more and more CronJobs on the platform.

In addition, sometimes Kubernetes would just quit when a CronJob had missed too many runs, requiring someone to manually “un-stick” the CronJob. This happens in real-world usage more often than you would think, and became painful to remedy manually each time.

This concludes the dive into the technical and operational issues we’ve encountered using Kubernetes CronJob at scale. In Part 2 we share what we did to address these issues in our Kubernetes stack to improve the usability and reliability of CronJobs.

As always, Lyft is hiring! If you’re passionate about Kubernetes and building infrastructure platforms, read more about them on our blog and join our team!

How we learned to improve Kubernetes CronJobs at Scale (Part 1 of 2) was originally published in Lyft Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.