State Farm Engineering Blog - Medium

Then and Now: The Seismic Shift Happening Today in Test Automation

State Farm Engineering — Wed, 13 May 2026 17:29:45 GMT

Introduction

For as long as I can remember, the automation engineer was the loop, not in the loop. Every step of the test automation process required direct human execution: learning framework syntax from documentation and videos, manually translating requirements into test cases, using inspector tools to find every single element, writing each line of code, and updating documentation separately from the codebase. The human had to do all the work, not just guide it.

Today, we’re experiencing a fundamental shift. The human is now in the loop: guiding, reviewing, and making decisions while AI handles the repetitive execution. We’re not removed from the process; we’re elevated within it, focusing on judgment and strategy rather than tedious implementation details.

Not long ago, many aspects of UI test automation seemed held back by their very nature. We were trying to shove a square peg in a round hole: creating rigid, step-based, computer-friendly tests to interface with a constantly shifting frontend designed for human users. Now, issues that have kept us scowling at a dimly lit computer screen trying to click a button that’s right there, finally have answers.

Let’s examine three key areas where these answers are impactful: finding and healing locators, writing test cases and code, and everyone’s favorite — documentation.

AI is a Silver Bullet, Isn’t It?

AI is certainly the buzzword of the day in the tech community, and the testing sphere is no different. However, let’s not pretend this is a silver bullet. It’s how we apply AI that makes the difference. Every organization wants to race for AI to be the great fix for all our issues. But when used incorrectly we could be looking at spending more time for lesser quality (yikes!).

I like to think of AI as a great junior developer that doesn’t need any bathroom breaks. Lots of potential, is capable of great work, but needs tasks framed in a particular way. You don’t want to leave out important details, and you can’t assume they know things your team takes for granted. Often, you’ll want to give a bit of the why behind the ask as well. If they can buy into a vision, when you don’t explicitly spell something out, they might be able to use their knowledge to fill in the gaps.

Then and Now: Finding Locators & Self-Healing When They Fail

The “Then” — Brittle Locators Were a Constant Battle

Remember the feeling? You open your test results Monday morning to find that 30% of your suite is red; not because of actual bugs, but because the development team changed a button’s accessibility label or moved that heading yet again (not that we ever change our minds in tech). The frustrating part? The element you’re looking for is still right there on the screen, easily visible to human eyes, but your test can’t find it.

Screen elements are notoriously difficult to locate:

Every UI change means updating locators across multiple test files
Difficult screens frequently lead to brittle locators
Different platforms have their own unique challenges

The cycle was exhausting. Find the broken locator, inspect the new page structure, update the code, run the test, repeat. In my career of leading various automation engineering teams, I’ve seen locator maintenance often take 20–30% of automation engineers’ time and sometimes more. Not the creative problem-solving we signed up for, just tediously tracking down moved buttons and renamed fields.

The “Now” — Intelligent Self-Healing Locators

We are now leveraging AI to automatically heal broken locators in real time, with multiple fallback strategies and confidence scoring. Modern frameworks can detect when a locator fails and intelligently search for the element using contextual understanding. We’ve taken an open-source framework called Rocket Fuel and added self-healing capabilities. While running automation tests for our State Farm iOS and Android applications, sometimes a locator fails. This could be due to a code change, a dynamic page, locators found for Android but not yet iOS, or any other reason. We then make a call to a large language model (LLM) API with a prompt containing the failure context. This gets paired with the layout of the screen, either image or text, available element metadata, and matches are found. If a match meets the threshold set by the programmer, they will be tried in order of confidence. Tests can then continue execution with the healed locator, as if it never broke.

Here are some questions we asked and ideas we implemented in our framework:

Capture context — What are we looking for? What’s currently on screen? What info do I have to describe what I need?
Try AI image matching — try to match by image or use layout to describe what you need. You can even pass this into the next suggestion
Try AI text analysis — source of the page or a good text description can be used to generate locators
Try traditional approaches — You don’t always need AI! To save some tokens, try standard variations on broken locators. Maybe all you need is a different by or to try Login instead of Log In.
Give your strategies a weight so you can rank confidence (1–100, low/medium/high confidence, etc.)
Cache for current run, output after — you can reuse healing during the test run, then save the suggestions after. These can then be automatically updated or set for human review.

A new spin on a classic problem

Imagine you have a login button that your test uses called “login_button”. After a UI update, it is decided “submit_button” better reflects its multi-purpose nature, and the button gets updated.

When your test runs, the original locator fails. Instead of failing the test for someone to examine later, self-healing activates. The AI analyzes the current page and finds a button with “Log In” text in the appropriate context, near username and password fields. It returns a new locator with high confidence and your test continues successfully. Later the test needs the same button again. After failing, the new locator is pulled from cache and the test continues. After execution, the new locator is written to a file for human review.

These tests can be run automatically per new build to find broken locators before you’ve gotten your coffee. In our framework, what used to take 30 minutes of manual work now takes less than a minute — a 97% reduction in maintenance time.

The beauty is you can control when healing runs. During active development when things change frequently, let it heal automatically. Once tests are hardened and stable, you might want to disable it or gate it behind a review process. You stay in control of how much automation you want.

Writing Test Cases & Automation Code

The “Then” — Writing Automation Was a Surprisingly Manual Effort

This workflow was highly pattern-based and repetitive. Screen elements were tedious and error-prone to track, with little creative license; you just needed to find them reliably. The elements themselves were highly volatile, changing with every UI update.

The efficiency question loomed: “Not bad, but is it efficient?” We all knew the answer.

The “Now” — AI Handles Translation, We Provide Knowledge

UI Automation has been trying to automate the creation of test cases for as long as I can remember. Flawless record and playback at the top of every other vendor’s feature list. And why not! Just watching your tests run feels like a screen recording already! Surely building software to capture that would be easy. Right? Right… As we know, there’s a lot going on under the hood that makes this much harder than it appears. Coming from a programming background, I never loved the idea of losing the level of control that it provides, but this idea of easily building out your test cases by testing what you would anyway is so compelling. Thankfully, we no longer have to choose. Given the limited number of options for interacting with software (tap, scroll, type, click, etc.) and a solid framework, UI automation is well positioned for quality code generation.

Let’s dive into some specifics for enabling automation engineers to supercharge their workflow. Custom instructions are essentially teaching materials for AI. These are documents that explain your team’s coding patterns, naming conventions, and architectural decisions. Framework-aware extensions, such as Model Context Protocol (MCP) servers or agent skills, extend AI assistants with specialized capabilities specific to your automation framework, allowing them to generate code that follows your exact patterns rather than generic examples. Using custom instructions and framework-aware tools, AI can understand your framework patterns: how to structure page objects, when to create reusable methods versus inline actions, platform differences (perhaps get locators on iOS after getting them for Android), and naming conventions for test methods and variables, just to name a few. An MCP server or agent skills can also standardize many common tasks in automation. This not only helps generate code but also helps onboarding less experienced automation engineers. There’s a whole blog’s worth of content on when to use which tool, so I won’t get into it here. Instead, play around with what you have available to you and see what yields the best results. In my experience, AI-assisted test generation can cut test writing time by roughly 50%, and there’s plenty of room to grow.

This shift doesn’t replace us, but lets us shift focus. Instead of spending hours finding elements with inspector tools and writing boilerplate code, we’re building on a generated foundation of the team’s design and adding the nuanced business logic that requires human understanding. The edge cases, the business rule validations, the error handling that goes beyond happy paths are where human insight matters.

I believe maintenance and debugging time is the biggest hidden cost in our test efforts. Noticing patterns, assimilating information, and stepping through code snippets are a large time sink yet essential in debugging. AI can shoulder much of the load here. When tests fail, self-healing handles locator issues automatically. It can also help analyze failure patterns from reports, try a fix, run the test again, and show me the outcome without me doing anything. We’re not spending time on “why can’t we find this button anymore?” we are able to move our focus closer to how users experience an application. Does this feel easy to use? Did the payment amount calculate correctly? Did the workflow follow the expected path? Can I use what we learned years ago when we built a similar feature? That’s where my expertise adds value.

Then and Now: Documenting Tests & Integration with Test Management

The “Then” — Documentation Drift was Inevitable

We often have barely enough time to write and maintain our tests, meaning keeping up with the test plan can feel like a pipe dream. Then you add on test automation and the complexity balloons. My test case might say “automated: yes” but does that include negative cases? Why is this code different from the documentation? Was this case never updated after the redesign?

It comes down to the question that has haunted many a team: “what is automation actually covering?” The question is simple enough but faces many hurdles to achieve. For example:

Automation execution reports are often time consuming to read and tedious to manually tie back to stories
Documentation is forever outdated
Discrepancies between code and test plan
Duplicated tests because someone didn’t know it was covered, or coverage gaps because they thought it was
Writing documentation is often the lowest priority and skipped

The “Now” — Code Becomes the Source of Truth

The problem has always been, I have the automation code I want to write, but how do I keep up with all the steps needed to share what it’s doing back to the team? What if we don’t need many (if any) steps to communicate what our tests are covering? At that point, documentation takes care of itself and there’s nothing to keep up with. We don’t need to worry about updating because it updates as we update our code. You might be thinking “that sounds great but how can we get there?” I’m glad you asked.

If we flip the relationship to make automation code be the source of truth, and documentation is generated from it, then all we need is to translate our code into easily human readable test cases. Turns out, that’s a process AI is really good at.

Our Code First Documentation System:

AI is key in this so that it can take the automated test steps and output them to a file. Automated creation/updates in your test case tracking system makes it so your tests are always up to date.

The transformation:

Before: “Is this test case automated?”

Check the code repository (if you know where to look)
Ask team members (if they remember)
Hope for an accurate answer

Now: “Is this test case automated?”

Look at automation stories in the issue tracker
No uncertainty of what is covered, code is essentially the documentation
Clear link: User story → Test steps → Automation code → Automated Test Case

This isn’t revolutionary, frameworks like Behavior Driven Development (BDD) have explored documentation driven by code for years. However, a system like this traditionally required various tradeoffs. For example teams may add an extra layer like Gherkin, try manually keeping up with documentation, or employ a business analyst to help keep documentation synced.

However, now that we have a tool that can reliably transform code into English steps, we’ve unlocked new possibilities. We can have code independent from a dedicated English layer and still generate documentation from it. Using code as the source of truth, we are no longer building something that tries to match a list of requirements we’re constantly playing catch up with. Instead, when our team asks if something is automated, we show them stories that are as good as code.

Conclusion

So, what does all this add up to?

The seismic shift in test automation isn’t just about AI — it’s about rethinking our entire workflow with AI as an enabler. Through this transformation, we’ve seen a 50% reduction in test case writing time, near-elimination of locator maintenance, and a solution to the seemingly unsolvable problem of keeping documentation visible and up to date. As automation engineers we used to be the loop, now we are in the loop. Guiding, reviewing, directing, as we delegate tedious tasks where they should be: our tools.

Despite the promising returns so far, we are just getting started. We are still experimenting with automation fixing and retry, live smart recovery beyond locators, sophisticated image matching, more AI-assisted debugging. At the rate things are advancing, it feels like there will be new ideas to try out as soon as I finish typing. We are truly experiencing an unprecedented rate of technology growth. All we can do is our best to grow with it and to be flexible.

Remember: AI is that great junior developer who doesn’t need bathroom breaks. Give it good instructions, clear patterns, and architectural guidance. Review its work critically. Use it to handle the tedious, repetitive work that none of us enjoyed anyway, freeing you to focus on the interesting challenges: the business logic, edge cases, and creative problem-solving that actually need human insight.

We have the technology. The question isn’t whether to adopt AI in your testing workflow; it’s which problem you’ll solve first. What are you going to build?

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

Information contained in this article may not be representative of actual use cases. The views expressed in the article are personal views of the author and are not necessarily those of State Farm Mutual Automobile Insurance Company, its subsidiaries and affiliates (collectively “State Farm”). Nothing in the article should be construed as an endorsement by State Farm of any non-State Farm product or service.

Then and Now: The Seismic Shift Happening Today in Test Automation was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Responsive Labeling with polygon-labeler — State Farm Open Source

State Farm Engineering — Mon, 06 Apr 2026 14:31:02 GMT

By Michael Keller

Introduction

At State Farm, we always try to be a good neighbor and today we are excited to announce we are giving back to the open source community with a new NPM package for labeling polygons called polygon-labeler. In this post, we’ll explore its features, demonstrate how to use it, and highlight what sets it apart from other polygon labeling solutions.

Problem

Placing labels on polygons sounds simple: find the center of a polygon and render a label there. In practice, this logic produces incorrect label placements or labels that are invisible to the user. At State Farm, we operate our own internal mapping PaaS and required a solution that was both accurate and high performing for labeling polygons within our platform. While we explored existing packages like polylabel and Turf, each presented limitations that inspired me to develop polygon-labeler. This lightweight, specialized package consistently computes optimal label points for both individual and grouped polygon features. In this post, I’ll cover common labeling challenges, explain how polygon-labeler addresses them, and show you how to integrate it into your mapping application.

polygon-labeler

A package for generating a single ideal location to label polygons. This is useful for mapping libraries like Maplibre GL that will generate multiple labels for polygons by default. The package uses an algorithm to find the visual center of polygons and supports grouping features by properties, handling multi-polygons, and viewport clipping for optimal performance.

Centroids

When trying to label a set of polygons the first solution that people may gravitate towards is using the centroid of each polygon. While on paper this may sound great there are many pitfalls to using centroids. Turf.js provides a method called centroid that allows you to easily compute the centroid by passing in each feature.

Why centroids are a poor default:

Centroids are calculated via the mean of all vertices within the polygon. For concave shapes, centroids often lie outside the polygon entirely.
For MultiPolygons and features with disjoint islands, like Hawaii, a single centroid can land in the water between the islands or inside a tiny island instead of the largest visible landmass.
Map panning and tiling makes the problem worse while choosing centroids. If you pre-compute centroids on geometries but display a clipped view, the centroid may fall outside the clipped area or even worse not at all.

Point on Feature

Using a point that lies on the polygon boundary,for example pointOnFeature from Turf, is a common alternative to centroids because it guarantees the returned location is on the geometry. However, this approach has its own pitfalls when used for labeling.

Why point on features are a poor default:

The point may end up on the polygon’s outer edge or on a narrow sliver.
For complex polygons, pointOnFeature often returns a point on a small polygon rather than the visually dominant area.
Points on edges can cause label collision detection to behave poorly, this leads to polygons not being labeled

These conditions lead to labels that overlap unrelated features, sit off the polygon, or are invisible on the current map viewport. For readable maps and consistent UX, we need a label position that is visually central and lies on or inside the polygon that is actually visible to the user.

Why should I use this package over polylabel?

polylabel is a great package, we even use it inside polygon-labeler, to find the pole of inaccessibility (a visually centered interior point) for a polygon, but it solves one of the many problems with creating dynamic labels. polylabel is effective at finding the polygon’s greatest interior point, but it doesn’t address several key challenges involved in labeling polygons within web mapping applications covered below.

Key Challenges

Single Polygon Geometry

For MultiPolygons it returns a point on every tiny island or disjoint part instead of the visually dominant polygon.

Viewport Awareness

polylabel does not clip polygons to the current map view. This may lead to the label being generated outside of the current viewport. In the example below, we can see the label was generated outside of the map viewport and would not be visible to the user.

Fallback Handling

In some cases, polylabel cannot guarantee a point within the polygon. This edge case leads to labels being placed potentially in neighboring polygons, in unintentional locations, or nowhere.

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "coordinates": [
          [
            [
              -111.1815838,
              45.7768091
            ],
            [
              -111.1816944,
              45.776695
            ],
            [
              -111.1812484,
              45.7764847
            ],
            [
              -111.1811378,
              45.7765988
            ],
            [
              -111.1815838,
              45.7768091
            ]
          ]
        ],
        "type": "Polygon"
      }
    }
  ]
}

[NaN, NaN, distance: NaN]

Design goals for polygon-labeler

When building polygon-labeler

Guarantee the point is inside the polygon.
Always return a point that is visually central to the most relevant polygon geometry.
Respect grouping and unique identifiers so multi-part features are labeled where users expect them.
Support clipping to the current map view so labels remain visible and meaningful.

How polygon-labeler works

The package determines optimal label placement through the following steps:

Group Features: Features are grouped by a unique identifier property (e.g., “state_name”). All polygons with the same identifier are collected together.
Clip to Viewport: Only retain polygons that are clipped to fall within the current map view bounds. This ensures labels are visible on the map.
Find Largest Polygon: For each group, the package calculates the area of all polygons and identifies the largest polygon by area. This ensures the label is placed on the most prominent part of the feature.
Calculate Pole Of Inaccessibility: Using polylabel, determine the most distant internal point from the polygon outline for the largest polygon.
Validate Position: The package checks if the point falls within the polygon boundaries: If inside: The pole of inaccessibility is used as the label point. If outside: If the point falls outside the polygon, calculate a point on the feature to be used.
Return GeoJSON: The result is a GeoJSON FeatureCollection of Point features, each representing an optimal label location with the specified property attached.

Installation

npm install @statefarmins/polygon-labeler

Usage

import { getLabelPoints } from '@statefarmins/polygon-labeler';
import type { FeatureCollection, Polygon } from 'geojson';

const featureCollection: FeatureCollection = {
    type: "FeatureCollection",
    features: [
    {
        type: "Feature",
        geometry: {
            type: "Polygon",
            coordinates: [[[0, 0], [0, 4], [4, 4], [4, 0], [0, 0]]],
        },
        properties: { name: "Area1", id: "1" },
    },
    ],
};

// Define map bounds
const southWest = { lat: -10, lng: -10 };
const northEast = { lat: 10, lng: 10 };

// Get label points
const labelPoints = getLabelPoints(
  featureCollection,
  'name',
  'id',
  southWest,
  northEast
);

console.log(labelPoints);

Output

{
    "type": "FeatureCollection",
    "features": [
        {
            "type": "Feature",
            "geometry": { 
                "type": "Point", 
                "coordinates": [ 2, 2 ] 
            },
            "properties": { "name": "Area1" }
        }
    ]
}

Example Image

Maplibre GL Usage

import maplibregl from 'maplibre-gl';
import { getLabelPoints } from '@statefarmins/polygon-labeler';


let lastBounds: { sw: { lat: number; lng: number }; ne: { lat: number; lng: number } } | null = null;
let lastZoom: number | null = null;

/**
 * Determine if labels should be recalculated
 * 
 * @name shouldRecalculateLabels
 * @param {map} maplibregl.Map A maplibre map
 * @param {number} tolerance The amount of allowed shift in map bounds in degrees
 * @returns {boolean} if labels should be recalculated
 */
function shouldRecalculateLabels(map: maplibregl.Map, tolerance = 0.00001): boolean {
    const bounds = map.getBounds();
    const zoom = map.getZoom();
    const sw = { lat: bounds.getSouth(), lng: bounds.getWest() };
    const ne = { lat: bounds.getNorth(), lng: bounds.getEast() };

    if (!this.lastBounds || !this.lastZoom) return true;
    if (Math.abs(this.lastZoom - zoom) > 0.01) return true;

    return (
        Math.abs(this.lastBounds.sw.lat - sw.lat) > tolerance ||
        Math.abs(this.lastBounds.sw.lng - sw.lng) > tolerance ||
        Math.abs(this.lastBounds.ne.lat - ne.lat) > tolerance ||
        Math.abs(this.lastBounds.ne.lng - ne.lng) > tolerance
    );
}

/**
 * Generate unique labels for each feature
 * 
 * @name generatePolygonLabels
 * @param {map} maplibregl.Map A maplibre map
 * @param {str} layerName The name of the source for the layer
 * @param {str} labelField The field used to label each polygon
 * @param {str} uniqueIdentifierField The field used to distinguish unique features
 * @returns {boolean} if labels should be recalculated
 */
function generatePolygonLabels(map: maplibregl.Map, layerName: str, labelField: str, uniqueIdentifierField: str) {
    if (!this.shouldRecalculateLabels(map)) {
        return;
    }

    const bounds = map.getBounds();
    const sw = { lat: bounds.getSouth(), lng: bounds.getWest() };
    const ne = { lat: bounds.getNorth(), lng: bounds.getEast() };
    this.lastBounds = { sw, ne };
    this.lastZoom = map.getZoom();

    polygonFeatures = map.queryRenderedFeatures({
        layers: [layerName]
    });

    let polygonFeatureCollection = {
        type: "FeatureCollection",
        features: polygonFeatures
    }

    const polygonLabelPoints = getLabelPoints(
        polygonFeatureCollection,
        labelField,
        uniqueIdentifierField,
        sw,
        ne
    );

    map.getSource(`${layerName}_geojson`).setData(polygonLabelPoints);
}

/**
 * Reset the zoom and bounds
 * 
 *  @name clearCache
 */
function clearCache() {
    lastBounds = null;
    lastZoom = null;
}

map.on('idle', () => {
    generatePolygonLabels(map, "polygons", "name", "id");    
});

Common pitfalls and recommendations

Even with an optimized labeling solution, there are a few best practices to keep in mind when integrating polygon-labeler into your mapping application

Avoid static label computation

Computing labels only once for an entire dataset and reusing them across all map views is a common mistake. As users pan and zoom, the visible portion of a polygon changes dramatically. A label point calculated for the full geometry may fall outside the current viewport, leaving users with unlabeled features. Instead, either clip geometries to the viewport before computing labels or recalculate labels dynamically as the view changes. The Maplibre GL example above demonstrates this pattern using the shouldRecalculateLabels function to trigger updates only when the map bounds shift significantly.

Handle small polygons gracefully

For very small polygons, such as tiny islands or narrow slivers, the visual center may be virtually indistinguishable from the centroid. In these edge cases, the sophisticated pole-of-inaccessibility calculation provides little benefit over simpler methods. When dealing with such features, prioritize label visibility and rely on your mapping library’s built-in collision detection to ensure labels remain readable. Consider adjusting label priority or styling to prevent small polygon labels from cluttering the map at certain zoom levels.

Cache and throttle updates

Recalculating labels on every frame or minor map movement can degrade performance, especially with large datasets. Implement caching strategies and tolerance thresholds to avoid unnecessary recomputation. The example code demonstrates this with lastBounds and lastZoom tracking, only recalculating when the view has shifted beyond a defined tolerance.

npm Link

polygon-labeler

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

Responsive Labeling with polygon-labeler — State Farm Open Source was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The State Farm® Android Team’s Journey to 99.9+% Crash-Free Sessions

State Farm Engineering — Mon, 02 Mar 2026 15:31:00 GMT

By Andrew Erickson

Introduction

The State Farm Android team fiercely defends a 99.9+% crash-free rate for our flagship State Farm® App. A team culture that focuses on availability and resiliency for our customers and a set of coding best practices that ensure a seamless experience when using the app.

In this post, we’ll take you through our journey of achieving and maintaining a 99.9+% crash-free rate for the State Farm Mobile App through an insider view of our strategies for monitoring and preventing crashes through every day engineering and testing practices.

A new start: Measuring success

A great starting point to our crash journey was in 2017 when the Android team faced a huge engineering endeavor to rebuild our flagship app “Pocket Agent” (which originally launched on Google Play in 2010) with a new app identity. Our new app, the State Farm App, featured a new design system, architectural patterns and features. Around the time of our efforts to build the app, Google Firebase Crashlytics emerged as an essential tool on our tool belt, getting virtually 100% coverage on most crashes. Now that we had a true measure of our crash-free rating, it was game on.

When you can measure it, you can improve it.

As we prepared for the big launch of our app redesign, we discovered several new ways to discover crashes and work towards a more resilient app. When we finally went live at the end of 2017, our app’s key features like paying a bill, filing a claim, viewing policy info were all going strong and highly available. We aimed for 99%, and we hit 99.1% crash-free. While we hit our first goal, what we actually had was a wealth of new data through Crashlytics.

We didn’t stop at 99.1% though; we learned, we modernized and became experts at building an incredibly reliable mobile app… eventually reaching a remarkable, sustained 99.99% crash free rating! At times, we even peaked at an incredible 100% crash-free, with Crashlytics reporting fewer than one crash per 10,000 users. This is no small feat, considering our app has nearly 5 million installs on active devices. According to the Google Play Console, the State Farm Mobile App ranks significantly higher in stability compared to most apps within the Insurance category.

Today, the State Farm App has more than 400 unique screens, calling more than 200 unique endpoints. Every release, we implement new features, enhance existing ones, and increase stability through bug fixes and modernizing our tech frameworks. The engineering team is composed of 15 native Android engineers, and works side-by-side with a dedicated team of manual and automation testers who help ensure every new app release is stable for our customers.

Mobile chaos engineering (destructive testing)

To fix things, you need to know how your users can break them, before they break them. While we can code to happy golden path scenarios at our desks, we needed new tools in our tool belts to simulate the real world, and the myriad conditions our app is used in. We doubled down on testing efforts and made a sport out of finding novel ways to find crashes before our customers did.

Recovering from Android process termination

We also discovered that combining two Android system Developer Options, 1) “Do not keep activities” (DKA) and 2) limiting the background process limit to zero (0PL), were a reliable way of simulating Android process termination. These are settings users can find in their device’s Android OS system settings after enabling a developer mode. It’s not uncommon for users “in the wild” to have these settings enabled thinking they may enhance device performance, when actually increasing the likelihood of unexpected app behavior for many of their apps.

Enabling these settings revealed to us a more direct way of simulating Android process termination (when the system ends your app’s process due to current resource constraints). For us, this meant that when a user resumed on the same screen days later, but without the data the screen assumed was there, our app would crash. These conditions are also similar to what happens when a user downgrades an app permission via Settings while our app is in the background. There are also other techniques to simulate and/or cause process termination within Android Studio via Logcat (i.e., Force stop application, Kill process and Crash application).

When process death occurs, the Android OS will still resume the user on the last screen they saw, even if it was last seen days, weeks, or months ago. So a great rule of thumb is to test each screen and how it recovers when a user backgrounds and restores the app with DKA and 0PL enabled. This often uncovers issues when screens don’t independently load their own screen data (e.g., Screen B assuming Screen A called required APIs, then crashing when Screen B gets restored after process death) as well as testing the consequences of long-running async processing finishing after a view has been destroyed.

Simulating unpredictable user behavior

This became an endless well of new problems to solve before our big release. In the real world, our users didn’t have the strong connections we had at our desks, they answer phone calls while using the app, they tap around and navigate rapidly, and they run on thousands of different Android devices, all of which reveal crash risks through detached Fragments, localization quirks and device specific nuances.

Through DKA + 0PL, we started finding that the crashes and the stack traces we found in the Developer Console and Crashlytics were finally replicable, locally and en-masse by simulating what customers often did: open our app from the background after hours, days or months since the last time they used it. What it revealed were state and session management issues, e.g., assuming any given screen had its required data in-memory, when in fact, the Android OS had started it from a clean slate.

This was the beginning of our destructive testing efforts. We started asking ourselves:

What happens if I background the screen or navigate away while a service is running?
What happens if I rotate?
What happens if I double-tap or tap three buttons at the same time?
What happens when users have non-US English locales?

What we found were new problems to solve and more techniques for earning higher crash-free rates.

API chaos testing through stubbing responses (test every scenario)

Another essential aspect of destructive testing is to test how our app responds to sporadic API failures, so that if things go wrong retrieving and submitting data, our app is ready to gracefully handle anything thrown our way. A key part of this is an in-house “Stub” framework where we can mock the behavior of all 200+ API calls in a highly detailed way, through specifying HTTP status, response time, payload data, and even simulating what happens on subsequent calls to the same service to test recovery from errors. Our stub test suite has more than 2,400 unique scenarios and that number grows each sprint.

"sample1" : {
    "description" : "Sample API fails on first load, then recovers",
    "expectation" : "User sees an error message. When the error is tapped, or user revisits the screen, then the API is successful and data is loaded.",
    "map" : [
      {"status":500, "matcher":".*/endpoint/sample", "file":"error", "isVariablePayload":true, "sleep":5000},
      {"status":200, "matcher":".*/endpoint/sample", "file":"samplePayload", "isVariablePayload":true, "sleep":500}
    ]
  }

Monkeying around

One last wily tool we discovered was Android’s “monkey testing” ADB commands, which allowed us to send tens of thousands of randomly triggered UI events (taps, swipes, system interactions). While coding at our desks, we’re a sample size of one, but we can never predict how our millions of users will use the app, and how the thousands of Android devices, each with their own performance specs and limitations, will perform in the real world.

How we monitor releases and what we look for

The Android team releases a new version of the app to the Play Store every three weeks. Every release typically has a combination of either new features or feature enhancements, bug fixes and technical upgrades.

After rounds of dedicated iterative manual testing from engineers and testers as well as our automation team running extensive regression test suites, how do we know that our release is stable and performing well in the public once it goes live?

Phased rollouts and early monitoring

We start with a phased rollout for a week on Google Play, which allows us to get the latest version out to the public, but at a controlled pace, so that if an early issue arises, we can halt the release, fix it, and continue forward.

When a rollout begins, Google Firebase Crashlytics alert emails for new crashes and velocity spikes let us know if something requires urgent action within the first hour. Typically, we look for multiple users impacted once, and especially multiple users impacted multiple times (crash looping). This allows us to estimate potential impact and decide what has potential to be an outbreak crash (e.g., will 10, 100, 1,000 users crash in the full release cycle? Does the crash have potential to self-resolve?). The context of a feature matters too: Is it a feature used a thousand times per month or a thousand times per minute? When we clearly see a potential major issue, we’ll chat, collaborate, talk about impacts and decide if it’s best to halt the release and fix now.

Two sources of truth: Crashlytics and the Developer Console

It was also surprising to learn that you need both Google Firebase Crashlytics and the Google Play Developer Console to see the fullest picture of crashes. The Developer Console often catches crashes that Crashlytics cannot, such as crashes in native code as well as crashes that occur prior to the initialization of the Crashlytics SDK. So checking both places regularly are key to staying on top of stability. For both platforms, prior to release, we also run a “Crashlytics health check” that helps us verify that crash reporting is working as expected and that our obfuscation mapping has successfully been uploaded (so we can easily decipher crashes in the consoles when they occur).

Pre-release readiness review with all Android engineers

Even before releases, we monitor crashes in a separate Crashlytics test project and verify any major issues introduced in development during our sprint have been properly addressed in our pre-release “Android Readiness Review”. We make sure any open crashes are fixed or closed if necessary. The Android Readiness Review is also a great chance for the team to spend dedicated time reviewing each others features on a release build, create follow-up issues, and make sure our release is ready for our customers in the coming days. We started this Review in response to a few rough patches with crashes a few years back, but have kept it going and continued evolving as it is always fruitful for discussion and resiliency.

Signals from even more channels

In addition to crash-oriented metrics, we heavily monitor analytics through multiple dedicated channels, including Adobe Analytics, Splunk and a customer-feedback platform, in addition to Play Store reviews and customer support channels. So even if our crash data suggests app health, we have multiple signals that help give us a full picture.

Fix now or fix later

When we see lower volume edge case crashes that don’t necessarily require an immediate fix, we still try to prioritize quick fixes in our next sprint to avoid having crashes pile up. Even if just a single user crashed once, if it’s an easy fix (they usually are), then we fix it while working on other sprint feature work. While we always try to replicate all crashes we attempt to fix, it’s not always possible, so we’ll make sure the issue is addressed. Closing crashes in the Crashlytics console and then getting subsequent alert emails about a resurfacing crash is a great way to know when we need to dig further.

Crash out loud in dev, log in prod

Our golden rule: Users should not experience a crash. The “should never happens” may indeed happen, and if it does, handle it gracefully. Engineers and testers are equally keyed in on covering our happy paths thoroughly and then bulletproofing through destructive testing. For the “should never happen” edge cases, we want to know as soon as possible in sprint development when they do occur. A technique is to explicitly throw crashes in BuildConfig.DEBUG while logging troubleshooting info to Crashlytics using non-fatal event logs for BuildConfig.RELEASE builds.

if (theImpossibleHappened) {
    if (BuildConfig.DEBUG) throw IllegalStateException(…)
    CrashlyticsNonFatalEventLogger.log("theImpossibleHappened ${moreContextDetailsAboutWhatHappened}")
}

Kotlin non-null assertions (!!) and nullability annotations in Java

As we migrated more Java to Kotlin, we had some hidden problems: un-annotated @Nullable fields, particularly data deserialized from network calls. So while our Kotlin code was equipped with null-safety, it can't handle nulls safely if it doesn't know about them. So as we saw crashes happen as our mobile API layer change data optionality, one thing was true: avoid !! like the plague. It was key that while we moved hundreds of thousands of lines of code from Java to Kotlin, in between, we had to make sure to mark @Nullable on our Java code.

One thing we learned is never assume a field will be returned. What this means to us: assume all API field data could potentially be null.

class SampleSaferVehicleJavaModel {
    @Nullable
    final String year;
    @Nullable
    final String make;
    @Nullable
    final String model;
}

data class SampleSaferVehicleKotlinModel(
    val year: String?,
    val make: String?,
    val model: String?,
)

Along these lines, while we were all learning Kotlin, we needed to become smarter about smart casts, never assuming data is 100% guaranteed or cautiously handling expected data types.

fun processResponse(responseData: Any?) {
    val isSuccessful = responseData as Boolean // crashes if responseData is null
    val isSuccessful = responseData as Boolean? // crashes if responseData is not a boolean
    val isSuccessful = responseData as? Boolean ?: false // handles null and non-Booleans safely, with a default value of false.
}

Feature flags and Feature Blocks

Our apps fully embrace feature flagging as well as the idea of Feature Blocks. If we see a new problem emerge, we can quickly toggle any related feature flags through Google Firebase Remote Config at a granular level (app version specific). Our Feature Block framework lets us block more than 60 unique features within our app with a high level of customization. At a high-level, Feature Blocks are a way to tell customers, “This feature is still here, but we’re working on fixing an issue”. This allows custom messaging, offering alternatives to customers, such as navigating to the equivalent feature on statefarm.com or through a call-in channel. So in a crash outbreak scenario, our app could temporarily turn off a feature, while letting non-impacted features to continue working as usual. This framework also gives us an opportunity to ask users to update to the newest version of the app on the Play Store to use features that have been repaired.

Even when things remain stable for long periods of time, feature flags and feature blocks provide us with flexibility, safety, and confidence that when something goes wrong, we can impact the fewest number of users possible.

// Feature flagging sample for quickly toggling features remotely
if (FeatureFlags.SAMPLE_LOCAL_FLAG.isEnabled && FirebaseRemoteConfigFeatureFlag.RELATED_SAMPLE_REMOTE_FLAG.isEnabled()) {
    handleAutoClaimSelected()
} else {
    handleAutoClaimSelectedLegacy()
}

Lifecycle safety

Before we moved to more modern patterns like MVVM and Jetpack Compose, our MVP and XML-based architecture, detached Fragments were a leading source of crashes.

A simple example:

A user submits a preference update (that took 4 seconds on a poor connection).
After 1 second, the user decides to navigate back.
The Activity and Fragment are popped from the back stack.
After 4 seconds passed, the Presenter called back to the View, and the View incorrectly assumed the Activity was present.

With detached Fragments, calls to Fragment#getString, Fragment#getContext, Fragment#startActivity as well as showing error AlertDialogs would all cause a crash. Ultimately, this revealed flaws where our Presenters were living longer than they should have while also leaking Views. So we got better at testing and removing View callbacks, while leveraging WeakReferences when it made sense.

In more modern architecture, Compose helps us better manage state and how our data layer triggers updates to the UI, particularly with StateFlow#collectAsStateWithLifecycle.

Even lifecycle-safe coding practices can be vulnerable to edge cases, either in the form of crashes or unexpected behavior. So a few useful extensions we built:

Extension: fun NavController.navigateSafely(…)

NavControllers can be prone to edge case crashes, particularly when users double or triple-tap buttons when device resources are low. Each of our NavController#navigate invocations flow throw an exception catching extension. Our extensive manual and automation testing efforts will catch any happy path issues, while the extension covers us on the edge cases.

fun NavController.navigateSafely(…) {
    try {
        this.navigate(…)
    } catch (navigationException: Exception) {
        Logger.e(TAG, Log.getStackTraceString(navigationException))
    }
}

Extension: fun LifecycleOwner.isAtLeastStarted()

When navigating from Compose UIs, we perform a check to verify that the UI is in an interactive lifecycle state that is ready to navigate. This also prevents multiple destinations from stacking up in the event a user multi-taps different actions on a screen simultaneously.

fun LifecycleOwner.isAtLeastStarted(): Boolean {
    return lifecycle.currentState.isAtLeast(Lifecycle.State.STARTED)
}

Remote crash absorbing through UncaughtExceptionHandler

In rare cases, background crashes emerge in the wild that aren’t user-facing, but nonetheless still impact crash reporting metrics. App crashes that occur in the background are imperceivable to users, but may interrupt any ongoing background processing work. These types of crashes can happen with new OS releases or mid-release changes in dynamic libraries. While we never want to fail silently, sometimes we choose to absorb 100% background crashes that would have no user impacts.

We do this through setting a custom UncaughtExceptionHandler. Through Google Firebase Remote Config, we can configure different crash signatures to match on based on their stack trace. In some cases, we may absorb a targeted crash and log the instance to our analytics suite. In other cases, where there’s potential for user recovery, we’ll intercept the crash to show resolution advice in a quick Toast message, before allowing the Android OS to continue with standard crash handling behavior.

[
  {
    "checkForCrashMessageContaining": "can't deliver broadcast",
    "toastMessage": "",
    "absorbCrash": true,
    "developerNotes": "Handle Google IssueTracker bug /245258072 for Android 13"
  }
]

Trust but verify 3rd party dependency safety and external integrations

Once our team nearly eliminated crashes from our own business logic, lifecycle, or implementation issues, crashes from third party libraries became a top source of crashes. While we vet the technical quality of dependencies we bring in, the magnitude of users, devices, and device conditions means vendors can’t always guarantee things will be 100% stable. So a general approach on the team is to wrap interactions to third party SDK functions in try/catch blocks and log caught exceptions to Google Firebase Crashlytics via non-fatal logging for observability.

When things do go wrong, it’s important for our team to establish a tight, rapid feedback loop of reporting crashes to our vendor partners so they have visibility on issues, create tickets and prioritize fixes in their products. The State Farm Android team is actually often one of the first companies to surface crashes to vendors, which has a nice benefit of improving stability in the larger Android ecosystem. Reporting SDK crashes helps keep the provider/consumer relationship strong and creates a two-way value proposition.

Aside from library dependencies, it’s also important to double-check and safely handle launches to third party apps for browser destinations, maps, contacts, calendars and more, since Android users are free to disable and/or uninstall any given app.

Ongoing challenges and conclusion

Our app continues to grow in features and users and our technology continues to evolve. The latest evolution of our app was the merger of the Drive Safe & Safe app with the State Farm Mobile App, which brought telematics capabilities and accident detection to our users. New challenges include working with sensors, background data processing and additional partner integrations.

Despite our evolving engineering and testing best practices, it’s always unpredictable what can surface through dependency updates, updates in the Android ecosystem (new OS versions, updates to WebView, Chrome, Maps and more). What remains constant is our team’s ability to adapt and carry a team culture of providing the best possible user experience for our customers, keeping an everyday engineering focus on code safety and stability, and recovering from the unexpected.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers

The State Farm® Android Team’s Journey to 99.9+% Crash-Free Sessions was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Tension to Trust: Rethinking How Architects and Engineers Work Together

State Farm Engineering — Fri, 13 Feb 2026 16:02:56 GMT

By Ben Justick

Introduction

Engineers and Architects have both played important roles in designing and building solutions from as far back as I can remember. However, the working dynamics between these roles have changed a lot over the past 10+ years as the industry and technology has advanced at a rapid pace around us. The days of groups of Architects being huddled together in a room for weeks coming up with the “perfect” design to hand over to Engineering teams have come and gone, and have been replaced by teams of Engineers collaborating in real time to iterate on a design. As Engineers have continued to grow their architectural acumen and responsibilities over time, I’ve started to observe some quiet tension start to build between some groups of Architects and Engineers as both groups struggle to adapt and find ways of working together that are mutually beneficial. Some Architects have a tendency to lean harder into the past and produce more design documentation in a bubble primarily to be consumed only by other Architects. Meanwhile, some Engineers are starting to produce the design documentation they need by themselves without needing to engage an Architect.

There’s a way to address this expanding rift, but it’s going to take changes in mindset and behavior from both sides. I’ve personally observed and experienced how a great partnership between an Architect and an Engineer can be a true game changer when it comes to taking a great idea and making it a reality. In this post, I’ll expand on the key behaviors and practices that I feel are the most essential to maximize this partnership for mutual success.

What Makes Us So Different?

In order to better understand how we can work together, we need to spend a bit of time digging into what makes us different. I realize that not all Engineers and Architects will fall into these patterns and there may be some overlap in tendencies depending on the person. However, these are the key tendencies that I’ve observed over the years of working with Architects and Engineers in many different areas of focus and contexts.

Key Tendencies of Engineers

Excel at coming up with new concepts and ideas that make use of new technologies.
Prefer to convey new ideas by writing code.
Tend to care more about building out end-to-end solutions rather than navigating organizational dynamics and edge cases.
Very productive working independently.
Tend to think more practical and near term.

Key Tendencies of Architects

Excel at taking complex ideas and making them easy for others to understand.
Prefer to convey new ideas by creating diagrams and visualizations.
Tend to think a lot about organizational alignment and agreements that need to be made across area boundaries to build out an end-to-end solution, and non-functionals/edge cases that could “break” a solution.
Very productive working in groups and leading group collaboration activities.
Tend to think more abstractly and long-term.

Are you seeing a pattern emerging here? Engineers and Architects both want to design and build great solutions at the end of the day, but the things they tend to care about the most and their strengths are almost at direct odds with each other. Being aware of these strengths and weaknesses is the first step in learning how to maximize this partnership in a way that leans in to the things that each role does best while also working proactively towards growth.

Collaboration for Acceleration

A few years ago, I moved to a new area in our organization and was paired up with a very talented Engineer that was working on the design and build of a solution that was targeted at enabling a new bundled quote and purchase experience in the Customer Channel. He had been working for a couple of months on a new API that would be needed to enable this experience and had made significant initial progress on the API interface design and writing some of the code that would enable this design. My initial assignment was to work with him to refine the API interface design, but we soon realized we could do a lot more than that.

This new API was targeted to be fully built and owned by a product team that had a lot of experience in developing frontend components, but limited experience with developing backend APIs. Handing over an API interface design spec along with some partially written code to this team and saying “get to work” on building this out would have certainly been a slow, confusing mess. However, this Engineer realized that he could lean in to the skillset that I had to help with enabling the target team with getting up to speed with what needed to be built in a way that they could hit the ground running as effectively as possible. This started with some detailed code walkthroughs/reviews to gain an understanding of how the code was structured, what code was already written, and what code was left to build out. This gave me what I needed to then create both high level and detail level architecture documentation that was targeted at initially getting the team to see the full scope of the solution we needed to build, and then easing them into the details of the specific work that was needed. I walked the team through sequence diagrams instead of code, so they could understand the problem from a logical perspective before trying to think about how to adjust to a new context and syntax. By working together, we were able to rapidly bring this team up to speed with the design and the work they needed to do next.

Obviously, this approach worked really well, or I wouldn’t be telling this story, but what are the key behaviors that led to that success?

From the perspective of the Engineer, it was:

Acknowledgement that architectural documentation could help bring the team up to speed with the vision and details faster than with just the API interface spec and code alone.
Willingness to invest time in walking through the details of the code to bring an Architect up to speed with the details.

From the perspective of the Architect, it was:

Being vulnerable to step outside of the typical architecture comfort zone to build an understanding of the details of a code base.
Willingness to go beyond the initial parameters of the assignment to create detailed solution architecture documentation that could help to accelerate a product team.

Getting the Organization Aligned to Enable Big Ideas

More recently, I’ve been working as part of our Enterprise Architecture area, and I’ve observed how the same types of collaborative behaviors between Architects and Engineers can be applied at a different level to achieve great outcomes. A few months ago, one of our Principal Engineers informally met with me and a couple of other Architects to talk about the high level vision for a new solution that would enable end-to-end traceability across our systems. Over the course of this initial conversation, it was clear that this Engineer had already spent a good amount of time researching and thinking about this and had already jumped ahead to identifying detailed solution and design ideas. However, there was a need for multiple leaders across the department to become aligned on a common vision and outcomes, or these ideas were just going to remain just that… ideas. We already had several teams that owned solutions that provided a portion of the capabilities that were needed for traceability, and there was some perceived overlaps in scope of ownership/responsibility across these teams. There were also several other areas across the department that were very interested in specific outcomes related to a solution like this, and some of them started to build out their own “homegrown” solutions to address the parts of the scope they cared about the most.

We needed to get our department aligned on what the problem was, who the key players in the space were, the key gaps, and next steps that needed to start moving forward in the short term to begin to enable the longer term vision. This is where Architecture was the perfect fit. A small group of Architects and I were able to create a set of visuals in a relatively short timeframe that effectively broke down the problem space and provided a recommended near term plan of action for 1st and 2nd line leaders across the department to react to. As we iterated on our work, we shared our draft documentation back with the Principal Engineer that initially reached out to ensure we were on the right track and make updates based on feedback. Our visuals included:

A scope breakdown to clarify what was core to the problem space vs. non-core/adjacent scope.
A view of the different products in the current state that were solving for parts of the problem that showcased the points of overlap between these solutions.
A gap analysis to show where we currently had traceability across different platforms within one or more current state solution vs. where we had gaps.

We shared our output as a pre-read to the 1st and 2nd line leader group and though we setup an hour to discuss and walk through the documentation, we only needed 30 minutes because the documentation spoke for itself. It helped to clear up the different ideas of what was meant by “traceability” in everyone’s heads and illustrate exactly where the current state overlaps in products were rather than everyone going off of a general feeling that there was redundancy. It also helped to get everyone focused on the gaps that needed to be addressed collectively by the teams that were already working in this space and provided a clear plan of action for bringing these teams together to work on near term next steps.

Would it have been possible for our Principal Engineer to get the key players in the department aligned and focused without engaging Architects to help with articulating the vision? Maybe, but it likely would have taken a lot longer with significantly more potential for organizational swirl. Instead, an Engineer decided to engage Architecture for help/assistance with one of the things that they do best, and was able to rapidly gain organizational support for a big idea.

Giving Architecture a Reality Check

All of the examples I’ve provided so far have been stories of Engineers taking the initiative to engage with Architects. So, what does it look like when the engagement starts in the other direction? To be honest, it’s hard for me to highlight one specific story because it happens so frequently and usually has a really short feedback loop. So, I’ll highlight how this usually goes more abstractly to illustrate the general pattern.

As an Architect, I get asked to think through a lot of different high level problems at a conceptual level, and I’m often asked to create some initial conceptual architecture diagrams to illustrate possible high level design options. These are often the “boxes and lines” diagrams that a lot of Engineers aren’t the biggest fans of, because they often can provide a false impression of the complexity or amount of work necessary to actually enable the solution. It’s these types of situations where early engagement with some trusted Engineers can result in the feedback needed to either make some key adjustments to a proposed conceptual architecture or to even eliminate possible design options before they get shared with other Engineers, Architects, or Leadership audiences.

There are several Engineers I’ve collaborated with over the years that I’ll reach out to for informal architecture review requests in situations like this. In many cases, I find it ideal to review initial design concepts with an Engineer that may be fairly close to the problem space and a different Engineer that might not be familiar with the scope of the problem space at all. This helps to vet the conceptual design from a practical perspective to ensure that there aren’t major technical feasibility issues or missing points of integration that need to be represented to better illustrate the potential complexity, while also ensuring that the design is easy to understand and consume by a more general engineering audience. It would be impossible for me to count how many times I’ve reached out to an Engineer with “got a few minutes to look at something…”, and within a few minutes of hopping on a call, some critical insights are shared with me that either make or break the design. While it requires me to be vulnerable about draft designs that aren’t fully “baked” yet, it’s so much better to find out a critical flaw in a design early and either adjust or scrap it entirely rather than continuing to take time and effort iterating on something just to find out later with a much wider audience that you should have done your homework.

Building Connections Organically

Over the course of this post so far, I’ve provided you with some practical examples of the mutual benefit to be gained through close collaboration between Architects and Engineers. However, unless you’re used to already working in that way, it takes some focused and intentional effort to break through those invisible barriers of misunderstanding. This starts with a mindset shift to be more vulnerable on both sides. For Architects this often means letting go of the potential fear of harsh criticism and/or being willing to admit when you don’t have a detailed enough understanding of how the code works and taking the next step to reach out to an Engineer for help. For Engineers this often starts with the acknowledgement that there’s value in good design documentation and having a willingness to engage with an Architect to help with creating the documentation you need to articulate your ideas to the masses. Sometimes a seemingly small behavioral and/or mindset shift is all that’s needed to reignite the spark of collaboration between an Architect and an Engineer. This may mean that you need to lean away from some of your existing tendencies for a bit and lean into someone else’s tendencies to demonstrate that you value their perspective and contributions. This starts to build a foundation of trust that will begin to go both ways over time, but you can’t expect the other person to acquiesce and lean into your tendencies first.

When you’re ready to take that next step to build or foster a connection, sometimes all it takes is reaching out to that Architect or Engineer you’ve worked with on a recent effort and asking, “Hey, any chance you have a few minutes to take a look at something I’m working on?”. The person on the other side of that request will more than likely feel honored that you reached out to them for their opinion, and likely will end up returning the favor.

Finding someone to reach out to could prove to be more challenging if you are new to an organization or if your organization is structured in a way where Architects and Engineers don’t often collaborate organically all that often. In these cases you may need to rely your manager, a mentor, or a peer for help with identifying someone that could be a good new connection for you, and you may want to approach this in the form of a cross-mentoring opportunity to begin with. These present a great opportunity for Architects to possibly take a deeper dive into a code base to build their low level design acumen, and/or for Engineers to expand their high level design acumen and diagramming skills in a setting that’s all about growth and learning.

Leadership’s Role in Creating and Fostering Perfect Pairings

While most of this post has been focused on the behaviors of Architects and Engineers, leaders also play an essential role in ensuring strong collaboration and partnership. Leaders should invest in understanding the key tendencies of the Architects and Engineers within their purview to look for opportunities to pair people together that may already have some overlaps in tendencies. This will create an ideal situation where the Architects and Engineers involved should be able to collaborate naturally and hit the ground running when it comes to sorting out and driving the work that needs to get done. Reflecting on the examples above, this was the case with the first example, where a strong strategic leader was able to pair me up with an Engineer that was a great fit for me, and I was able to start producing results almost immediately when I moved to the new area.

In an ideal world, leaders would always be able to create perfect pairings, but there are many situations where the people within a leader’s purview may be limited, and the Architects and Engineers may have very strong leans towards their respective tendencies. In these situations, it’s recommended leaders be more engaged and proactive in fostering and building the partnership and collaboration they would like to see between roles. This may involve taking on a more proactive role in leading initial working sessions to assist with breaking down the work and ensuring it gets assigned appropriately. It may also involve being proactive in showcasing and highlighting the benefits of the contributions and output that each role is providing in context to the overall work effort. While some Architects and Engineers may be able to effectively showcase and advocate for the value of their work, in these situations sometimes an extra boost from a leader that really sees the value of both roles is what is needed to break down pre-existing barriers and reshape people’s perspectives.

Conclusion

Over the course of this post, I’ve provided a breakdown of why the working relationships between Architects and Engineers often can have so much tension and several real-life examples of why it’s vital to change individual behaviors and mindsets to work through that tension to get to a place of trust if you want to maximize the value of both roles within your organization. In order to truly be effective, this involves change at the individual level for Architects and Engineers, and strong proactive leadership that understands both the tendencies and the value that each role can provide in context to the work at hand. While changes in mindset usually occur over time, small behavioral changes can start right away. So, what are you waiting for? Take some small action today to build or foster a connection. A small action today will eventually lead to more significant changes in mindsets over time that will result in trust and collaboration between Architects and Engineers that will lead to amazing outcomes.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

From Tension to Trust: Rethinking How Architects and Engineers Work Together was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unifying Our Mobile Experience — How State Farm Integrated Telematics Into Its Flagship App

State Farm Engineering — Thu, 04 Sep 2025 13:45:22 GMT

Unifying Our Mobile Experience — How State Farm Integrated Telematics Into Its Flagship App

By Scott Anderson and Travis Kessinger

We’ve had the privilege of working on the State Farm Mobile app for quite some time — Scott on Android (all the way back when the app was called Pocket Agent!) and Travis on iOS. Over the years, the app has undergone major transformations, from a complete rewrite to add tablet support, to a full redesign in 2017 — the same year it got its current name. In 2020, we added support for dark mode and rebranded the app to align with State Farm’s modernized vision.

Starting in September 2023, we faced our next big challenge: merging our standalone telematics app, Drive Safe & Save®, into the State Farm Mobile app. This effort spanned both Android and iOS platforms, requiring innovative solutions to deliver a unified experience across millions of devices.

In this post, we’ll share why we made this move, how we planned a seamless migration for users, the engineering challenges (and wins!) behind the scenes, and what we learned along the way. Plus, we’ll reveal the real-world impact with hard numbers and reflect on how this integration paves the way for future telematics innovation at State Farm.

Why Combine the Apps? Listening to Our Customers

Let’s talk about the two apps at the center of this effort. First, there’s Drive Safe & Save, a telematics app designed to empower drivers to personalize their auto insurance premiums based on how they drive. Using driving data, the app offers feedback to help users improve their driving habits. It also includes a feature called Accident Assistance, which detects crashes and can automatically notify emergency services, reducing response times and potentially saving lives.

Then there’s the State Farm Mobile app, our flagship app. It’s built around three key areas that customers rely on: Insurance, Claims, and Billing and Payments. It’s the go-to app for managing policies, viewing and downloading insurance cards, paying bills, and starting claims all in one place.

So why combine these apps? Simply put, it’s what our customers wanted. In a 2021 survey, about 80% of respondents said they preferred a single, integrated app experience. At the time, most users were only interacting with one app: the State Farm Mobile app or Drive Safe & Save, but not both. By merging the two, we’re giving more customers access to more features while simplifying their experience. For users, this means fewer apps to manage and more value in one place. For us as engineers, it meant tackling a unique and challenging migration to make this a reality.

The Challenge: Migrating Millions Without Missing a Beat

Planning the Migration: Strategy and Rollout

To ensure the migration proceeded smoothly, we implemented a controlled rollout strategy based on users’ auto policy state. This allowed us to implement the migration flow with a subset of users before expanding to the broader user base.

On both the Android and iOS platforms, the rollout was managed using Firebase Remote Config, where we maintained a list of auto policy states eligible for migration. This configuration allowed us to dynamically update rollout criteria without needing app updates or disruptions for users:

Guiding Users Through a Seamless Transition

When a user in the rollout group launches the Drive Safe & Save app, they are greeted with an “It’s moving day!” screen. This screen includes a button to initiate the migration process by launching the State Farm Mobile app. At this point, the Drive Safe & Save app continues to record trips and provide Accident Assistance to ensure there’s no interruption in functionality until the migration is complete.

Tapping the “Go to the State Farm app” launches a Firebase Dynamic link to direct users to the State Farm Mobile app, where the migration flow begins:

Permission Acceptance and App Deactivation

After logging into the State Farm Mobile app, users are required to accept all necessary permissions to enable trip recording. After the user accepts the permissions, a configured intent is broadcast to the Drive Safe & Save app, signaling it to deactivate.

Deactivation involves:

Turning off trip recording in the Drive Safe & Save app
Disabling the Accident Assistance feature, which is now handled by the State Farm Mobile app.

Balancing Complexity and User Experience

One of the biggest challenges in this migration process was balancing technical complexity with user experience. The goal was to make the migration flow as intuitive as possible while maintaining safeguards to ensure data integrity and continuity of features. By leveraging Firebase tools and carefully designed app interactions, we were able to achieve a migration experience that was controlled and user-friendly.

Engineering at Scale: How We Made It Happen

Modern Mobile Architecture: Under the Hood

To successfully merge Drive Safe & Save into the State Farm Mobile app, we needed a solid foundation that could support new features, scale to millions of users, and allow for rapid development on both Android and iOS. This migration was more than just moving code — it was an opportunity to modernize and align our architectural patterns across both platforms.

We focused on:

Modularization: Continue breaking features into independent modules for better code ownership and parallel development.
Feature Flagging: Using local and remote configuration to safely control rollout and minimize risk for users.
Reactive State Management: Adopting modern, reactive patterns to keep UI and data in sync.
Modern UI Frameworks: Leveraging Jetpack Compose and SwiftUI to accelerate development and improve maintainability, even as we integrated with our established codebases.
Robust Security and Privacy: Ensuring all telematics features were migrated with strict attention to user permissions and data protection.

With this foundation in place, our Android team concentrated on Jetpack Compose and Model-View-ViewModel (MVVM) to deliver scalable, maintainable features, while our iOS team integrated a new tab using SwiftUI within our established UIKit app for a seamless user experience. In the following sections, we’ll share some technical details and insights from each platform.

Android: Jetpack Compose, MVVM, and Type-Safe Navigation

When we began developing the new Drive Safe & Save feature within the Android version of the State Farm Mobile app, the team already had some experience implementing features with Jetpack Compose and Model-View-ViewModel (MVVM) architecture using StateFlow. This migration effort gave us the opportunity to build on that foundation and gain even more valuable experience with these tools.

The Drive Safe & Save feature includes over 30 screens. By implementing consistent patterns across these screens, we were able to significantly improve development speed, quality, and maintainability. Below are a few ways we leveraged Compose and MVVM principles during this project:

Type-Safe Navigation with the Navigation Component for Compose

For the Drive Safe & Save feature, we used the Navigation Component for Compose to manage navigation between composables. This allows us to take advantage of type-safe navigation, reducing the risk of runtime errors and improving code readability:

private fun navigateToVehicleDetailsScreen(lifecycleOwner: LifecycleOwner, uniqueVehicleKey: String, navHostController: NavHostController) {
    if (!lifecycleOwner.isAtLeastStarted()) {
        SFLogger.d(TAG, "onNavigateToVehicleDetailScreen called, but lifecycle not at least started: not navigating")
        return
    }

    val route = DssNavigationDestination.VehicleDetailsTO(uniqueVehicleKey)

    SFLogger.d(TAG, "Navigating to $route")
    navHostController.navigateSafely(route)
}

Reusable Composables for Consistency and Efficiency

To ensure consistency across screens, we embraced creating reusable composables. These composables are prefixed with “Sfma” to standardize their naming to indicate reusability. For example, we leveraged an SfmaCard reusable composable for the “About your discount” screen:

SfmaCard(
    sideMarginResourceId = baseR.dimen.sfma_screen_side_margin_always_zero,
    topBottomMarginResourceId = baseR.dimen.sfma_screen_side_margin_always_zero,
    backgroundColor = SfmaCardBackgroundColor.GRAY,
    strokeColor = SfmaCardStrokeColor.NONE,
) {
    Text(
        modifier = Modifier.padding(24.dp),
        text = stringResource(id = R.string.dss_about_your_discount_reminder_body),
        style = sfmaTextStyleBody(),
    )
}

SfmaCard is now being used hundreds of times in the app, helping to ensure consistency and maintainability.

State Management with StateFlow Emitting Repositories

Our repositories emit StateFlow to provide a stream of state updates for ViewModels. This ensures that the flow of data from the repository to the ViewModel and eventually the UI is seamless and efficient.

class DssAuthIndexRepository : WebServicesManager.WebServiceCallback, RemoveServiceListenerCallback {

    private val _dssAuthIndexStateTOMutableStateFlow = MutableStateFlow(DssAuthIndexStateTO())
    val dssAuthIndexStateTOStateFlow = _dssAuthIndexStateTOMutableStateFlow.asStateFlow()

...

Screen State Defined with Sealed Interfaces

To manage screen specific state, we use sealed interfaces. For example, the DssVehicleDetailsScreenState interface encapsulates the various states the screen can be in, helping to simplify state management and eliminate warnings.

sealed interface DssVehicleDetailsScreenState : Serializable {
    data object LoadingTO : DssVehicleDetailsScreenState
    data class ContentTO(val dssVehicleDetailsContentTO: DssVehicleDetailsContentTO) : DssVehicleDetailsScreenState
    data class ErrorTO(var appMessages: Set = mutableSetOf(), val vehicleDetailsErrorReason: VehicleDetailsErrorReason) : DssVehicleDetailsScreenState
}

ViewModels and StateFlow for Reactive Data Handling

Our ViewModels use StateFlow to expose state updates to the UI. This helps keep the architecture reactive and ensures that the composables always reflect the latest data.

class DssVehicleDetailsViewModel(private val uniqueVehicleKey: String, private val savedStateHandle: SavedStateHandle) : ViewModel() {

    val screenStateTOStateFlow = 
        savedStateHandle.getStateFlow(KEY_SCREEN_STATE_TO, DssVehicleDetailsScreenState.LoadingTO)

Composables for Screen Composition

Finally, our screen composables consume the ViewModel state and render the UI accordingly:

val screenStateTO by viewModel.screenStateTOStateFlow.collectAsStateWithLifecycle()
...
when (screenStateTO) {
    DssVehicleDetailsScreenState.LoadingTO -> {
        SfmaLoading(
            loadingConfigurationTO = 
                LoadingConfigurationTO.LoadingWithDelayedTextConfigTO(stringResource(id = R.string.dss_landing_loading_label)),
        )
    }
    is DssVehicleDetailsScreenState.ContentTO -> {
        DssVehicleDetailsScreenContent(
            scaffoldPaddingValues = scaffoldPaddingValues,
            contentTO = screenStateTO,
            onDiscountTapped = onDiscountTapped,
            onAddOdometerReadingTapped = onAddOdometerReadingTapped,
            onOrderNewBeaconTapped = onOrderNewBeaconTapped,
            onPairNewBeaconTapped = onPairNewBeaconTapped,
        )
    }
    is DssVehicleDetailsScreenState.ErrorTO -> {
        when (screenStateTO.vehicleDetailsErrorReason) {
            VehicleDetailsErrorReason.DSS_AUTH_INDEX -> onDssAuthIndexTechError()
        }
    }
}

By embracing Jetpack Compose and MVVM, we modernized our Android development approach, resulting in a seamless and reliable Drive Safe & Save integration within the State Farm Mobile app.

iOS: Blending SwiftUI into a UIKit Legacy

The iOS State Farm Mobile app has been around for some time now. Of course this means the app started out using UIKit for its user interface. Over time, as we updated our minimum supported iOS version (currently iOS 16) and gained experience with SwiftUI, we began integrating SwiftUI into the app. With the migration of the Drive Safe & Save functionality into the State Farm Mobile app, one of the first decisions was where to place this functionality. Prior to the migration, we had 5 tabs: Overview, Insurance, Claims, Finances, and More. The More tab had little functionality so we decided to remove it and create a new tab called Safe & Save for all of the new features.

The Setup

Around the time we started the Drive Safe & Save migration we were also starting to break pieces of our code up into more manageable pieces. For this Drive Safe & Save functionality we decided to create a target that would contain all of its functionality. While there are still lots of pieces of functionality in our main State Farm target, creating a new target allowed for overall better code organization.

For our UI related changes, we have an existing UITabBarController that was modified to include this new tab. Since our app is UIKit-based, we used UIHostingController. We created a DSSHostingController with content called DSSLandingView. This DSSHostingController lives in the State Farm target, allowing it to navigate to views in the State Farm target, such as the profile and preferences screen. For example, the following function is in DSSHostingController:

func didTapProfileAndPreferences() {
   self.performSegue(withIdentifier: Segue.profileAndPreferences.identifier, sender: nil)
}

Managing State

The DSSLandingView populates its content from an API call. There are multiple states a user could be in with their vehicles. State Farm is an insurance company that offers more than just auto products, so it's valid for a user to have no vehicles. A user can also have one or more vehicles, with some being eligible for Drive Safe & Save, some enrolled, and some not eligible. In the view model, we have a published property representing this state as an enum:

enum DriveSafeSaveLandingState {
   case determining
   case enrolled
   case notEligible
   ...
}

And in the view:

var body: some View {
   switch state {
   case .determining:
      EmptyView()
   case .enrolled:
      EnrolledView()
   case .notEligible:
      NotEligibleView()
   }
}

Navigation

Another item SwiftUI makes extremely easy to handle is navigation. We used .navigationDestination in places we need to push on views. For example:

.navigationDestination(for: ProfileDestination.self) { destination in
   switch destination {
   case .communicationSettings:
      DSSCommunicationSettingsView()
   case .contactUs:
      DSSContactUsView()
   case .helpTopics:
      FAQTopicsView()
   case .aboutTheApp:
      AboutTheAppView()
   case .profilesAndServices:
      ProgramsAndServicesView()
   }
}

This integration of a new SwiftUI tab into our existing UIKit-based iOS app allowed us to deliver Drive Safe & Save features with a modern, flexible user interface, all while maintaining seamless navigation and a consistent user experience within the State Farm Mobile app.

Improving the User Experience: Before and After

Merging Drive Safe & Save into the State Farm Mobile app was never just about reducing the number of apps on a user’s phone — it was about making every interaction simpler, more intuitive, and more valuable.

Before the migration:

Users who wanted to enroll in Drive Safe & Save or view their telematics data needed to download, log in to, and manage a separate app.
Many State Farm customers were unaware of Drive Safe & Save, or missed out on features like trip feedback and Accident Assistance simply because they weren’t using both apps.
Switching between apps to manage policies, pay bills, and access telematics features created friction and increased the likelihood of missing important information.

After the migration:

Everything is in one place: users can enroll in Drive Safe & Save, access driving feedback, manage policies, pay bills, and start claims all from the State Farm Mobile app.
Drive Safe & Save features are now more prominent and accessible, leading to increased enrollments and engagement.
The migration flow was carefully designed to ensure users didn’t lose access to critical features like trip recording and Accident Assistance, so the transition felt seamless.
Unified navigation and consistent UI patterns make it easier for users to discover and use new features.
With fewer apps to juggle, users have a more streamlined, reliable, and satisfying State Farm experience.

By bringing everything together under one app, we’ve not only simplified the customer journey, but also set a new baseline for what users can expect from their State Farm app going forward.

Lessons Learned: What Worked and What We’d Do Differently

No major migration comes without a few surprises. Along the way, we encountered unexpected challenges, uncovered opportunities for smarter solutions, and learned valuable lessons about both engineering and project management. In this section, we’ll highlight a few of the key insights and takeaways that will help guide us in future efforts.

Geocoding at Scale: How We Turbocharged the Trips List with Smart Caching

One of the most interesting technical challenges we tackled during the Android migration was optimizing the performance of our Trips Landing screen. This screen displays all trips taken by users and other drivers on their policy over the past 30 days. For larger households, this can mean well over 100 trips — each with its own data to load.

A key piece of information we display for each trip is the destination city, which we derive by reverse geocoding the trip’s ending latitude and longitude using Android’s Geocoder class. While the destination is being fetched, we show a loading state in place of the city name.

The Problem: Geocoder Bottlenecks

When displaying a large number of trips, we initially encountered performance issues related to reverse geocoding each destination in real time. This led to slow loading times for users with extensive trip histories, in part due to external service rate limits and caching behaviors.

The Solution: Lazy Loading and Smarter Caching

Credit goes to fellow engineer Andrew Erickson, who devised an innovative two-part solution that made the Trips Landing screen performant:

1. On-Demand Geocoding

Rather than processing every trip’s destination at once, we now trigger geocoding only for destinations that are likely to be viewed soon. This reduces unnecessary processing and network calls, especially when users scroll quickly through their trip history.

2. Optimized Caching

We enhanced our caching mechanisms to better handle repeat destinations and minor GPS variations. By grouping similar locations and leveraging in-memory storage, we minimize redundant geocoding requests and improve response times.

The Impact

Thanks to Andrew’s combination of on-demand loading and optimized caching, we slashed unnecessary Geocoder calls and cut down on loading times — even for users with very large trip histories. Users now see their trip destinations populate quickly, and the Trips Landing screen remains fast and responsive.

Optimizing destination city loading on the Trips Landing screen was a great example of how thoughtful engineering and innovative solutions turned a sluggish feature into one that feels seamless for users. These improvements translate directly to a more polished and reliable experience for our users.

Reducing Friction in Permission Handling

For Drive Safe & Save to work correctly, the Android version of the app needs several permissions from the user during onboarding. Shortly after release, we noticed a high drop-off rate on the location permission screen. We suspected that users may be downgrading the location permission from the settings screen, for example, choosing “Don’t allow”, then switching back to “Allow all the time.” On Android, this will trigger the OS to kill the app process.

The State Farm Mobile app’s security logic returns users to the login screen after a process death. This meant that when users downgraded a permission, that required users to log in again and navigate back to the Drive Safe & Save tab, creating a major friction point.

The Solution: Restore Sessions After Permission Downgrades

We added logic to detect permission downgrades and, when possible, restore the user’s authenticated session. This allowed users to pick up where they left off without having to log in again.

These changes led to an improvement in onboarding completion rates. Monitoring analytics and implementing Firebase non-fatal events helped us quickly identify and confirm the root cause of the drop-off, reinforcing the value of closely tracking critical user flows.

Estimating the Unknown is Hard

At the outset of the migration, we underestimated just how challenging it would be to predict our delivery timeline. Our initial approach to project management was rough around the edges — we were dealing with shifting requirements, new technical hurdles, and the complexity of coordinating two platforms. As a result, our story tracking and estimation lacked the rigor and clarity needed for a project of this scale.

After a few sprints of missed estimates and unclear progress, we realized we needed a better system. We invested in more disciplined story management: breaking down work into smaller, well-defined stories, setting clearer acceptance criteria, and regularly updating progress. We improved communication between engineers, product owners, and stakeholders to ensure everyone had a shared understanding of priorities and blockers.

With improved visibility into our backlog and progress, we could finally provide more accurate timelines. This new level of transparency also made it easier to make the case for bringing on additional engineering talent — helping us stay on track and meet our objectives.

The lesson: big migrations demand more than just technical skill — they require intentional, evolving project management practices to keep everything moving forward.

Need for Continuous Regression Testing

As development progressed, many stories impacted the same areas of the app’s codebase. After several sprints, both testers and engineers occasionally discovered that features completed in previous sprints had defects. This highlighted the need for a plan to maintain the quality of previously completed work throughout the project.

To address this, our testing team committed to ongoing regression testing for the duration of the migration effort. Each sprint, they revisited and validated features from previous sprints to ensure that recent changes had not introduced new issues. This continuous regression testing helped us catch and resolve defects early. It was easier for us engineers to resolve defects that were introduced recently.

This proactive approach to regression testing ensured that quality remained a top priority throughout the migration. By continuously validating previous work, we minimized the risk of defects slipping through and preserved high quality as new features were implemented.

The Results: Adoption, Stability, and Satisfaction

Millions of users have successfully migrated from the standalone Drive Safe & Save app to the State Farm Mobile app. This seamless transition has resulted in a significant increase in app adoption and engagement, with more users exploring features and returning to the app regularly. Here’s a look at the impact so far:

User Growth: Since the migration began, the State Farm Mobile app has seen an approximate 20% increase in active users.
Drive Safe & Save Enrollments: Monthly Drive Safe & Save mobile app initiated enrollments have doubled.
Drive Safe & Save Trip Recording: Currently, 85% of all Drive Safe & Save trips are now recorded through the State Farm Mobile app instead of the legacy app. This number is projected to reach over 90% as more users complete the transition.
Accident Assistance: Enrollments doubled.
Exceptional Stability: Despite the complexity of the migration and the increase of new users, the app continues to deliver a crash-free experience, with an average crash-free rate of 99.98% on both platforms.
Customer Satisfaction: Across both Android and iOS, the app maintains an impressive average customer satisfaction rating of 92.6%.

The Road Ahead: Expanding Telematics

This migration effort was a large team effort involving collaboration across numerous teams at State Farm. Engineers, designers, product owners, testers, and other stakeholders all worked together to achieve this milestone. For both of us, being part of such a successful and collaborative effort has been one of the highlights of our careers.

As proud as we are of this achievement, we know this is only the beginning. The integration of telematics into the State Farm Mobile app opens up exciting new possibilities for innovation. We’re just scratching the surface of what telematics can do to empower users, help improve driving habits, and enhance safety. The future is bright for telematics in the State Farm Mobile app, and we’re ready to continue driving forward!

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers

Unifying Our Mobile Experience — How State Farm Integrated Telematics Into Its Flagship App was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

DevSecFinOps: The Challenge of Implementing a Secure and Cost-Effective Container-Based CI/CD…

State Farm Engineering — Thu, 17 Jul 2025 15:01:31 GMT

DevSecFinOps: The Challenge of Implementing a Secure and Cost-Effective Container-Based CI/CD System

By Eddie Northcutt and Lane Leake

Introduction

Running CI/CD at scale can feel like juggling on a unicycle; it’s all about balance. One misconfiguration and everything comes crashing down: timeouts, long build times, and unhappy engineers. At State Farm, we process millions of CI/CD jobs each month with a self-managed GitLab and GitLab CI/CD implementation. In this article, we’ll share how we do it using a combination of custom GitLab runners leveraging containers with the Sysbox runtime, and EC2 auto-scaling.

DevSecFinOps: The Challenge of Implementing a Secure and Cost-Effective Container-Based CI/CD System

When you’re pushing thousands of commits a day and each commit triggers multiple pipelines, high concurrency is just the first challenge. We also need:

Fast Feedback: Engineers shouldn’t have to wait hours for a build to complete or for their job to even start.
Security & Isolation: Each build environment must be ephemeral and isolated to protect against malicious code or accidental resource hijacking.
Cost Optimization: At millions of jobs per month, even slight inefficiencies can multiply into major bills.
Flexibility: Engineers should be able to rapidly develop prototypes based on new technology and not be constrained by CI/CD infrastructure.

Given all these requirements, it may seem like we’re trying to have our cake and eat it too, however, that’s simply the reality of modern software development at scale. GitLab Runner supports a variety of compute engines (executors), and some of these allow us to use containers, which greatly increases the ease of supporting additional software stacks. Containers also introduce new layers of complexity and vulnerabilities that must be addressed.

Our Architecture in a Nutshell

While this simplified diagram might seem like a lot of moving parts for running some containers, continue reading to learn about why this is more challenging then it might appear

Challenge # 1: Container Security and Isolation

Security is a critical aspect of modern software development, especially when it comes to containerization. Honestly, it’s really hard to get right. One of the significant challenges we faced was a container escape vulnerability identified during penetration testing (pen test). The Pen Test team successfully executed a Docker container escape, gaining access to the root EC2 system. This incident highlighted the urgent need for enhanced security measures in our containerized environments. Ultimately, the issue was due to the fact that we were creating privileged containers in order to run Docker-in-Docker (DinD) for our GitLab runners. While this is a common pattern, using DinD on shared infrastructure poses significant security risks.

Traditionally, your only options to solve this problem were to use a privileged container, bind mount the host’s Docker socket (equally insecure), migrate to virtual machines and manage those, or use a tool like Kaniko to build images without DinD. However, these solutions either compromise security, limit functionality, or require significant changes to existing workflows. All things we wanted to avoid.

Thinking Outside the Box with Sysbox

To address this critical finding, we implemented Sysbox. If you haven’t heard of Sysbox, it’s an open-source container runtime developed by Nestybox (now acquired by Docker) that enhances container isolation by utilizing Linux user namespaces and virtualizing portions of procfs and sysfs. This allows containers to run system-level software seamlessly and securely. Acting as a "container supercharger," Sysbox enables existing container managers and orchestrators to deploy containers with hardened isolation; without requiring modifications to workflows or images all while coexisting with other container runtimes on the same host.

In other words, it allows containers to run workloads typically reserved for virtual machines, without compromising security. Ultimately, it enables the deployment of our Docker-in-Docker setups without requiring privileged containers, mitigating potential security risks.

Playing Nice in the ~Sand~Sysbox

Thankfully, this works quite well out of the box with GitLab Runner. All we need to do is configure the runner’s config.toml to use the Sysbox runtime by ensuring the runners.docker.runtime is set to sysbox-runc. Below is an example configuration that illustrates how to set up the GitLab runner with Sysbox:

concurrent = 1
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "runner-01234"
  unhealthy_interval = "15m0s"
  url = "https://private-instance.gitlab.com/"
  # ...rest of the configuration...
  executor = "docker"
  [runners.cache]
    Type = "s3"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      AccessKey = "[REDACTED]"
      SecretKey = "[REDACTED]"
      BucketName = "[REDACTED]"
  [runners.docker]
    runtime = "sysbox-runc"
    privileged = false
    # ...rest of the configuration...

When this is set, the GitLab runner starts up the job in a system container powered by Sysbox, granting it specific capabilities that allow it to spin up an inner container image running a Docker daemon. Very similar to a Docker-in-Docker setup. Our build container then interacts with this Docker daemon, enabling us to continue using Docker without sacrificing security, all while further enhancing the isolation between the EC2 host and the job container. As a bonus, we were able to make this change without requiring any modifications to our user’s existing CI/CD pipelines, which is a huge win for all of us.

Challenge # 2: Slow EC2 Auto-Scaling

While Sysbox solved our security challenges, we still faced performance issues with our EC2 auto-scaling setup. Engineers expect automated pipelines to be fast, efficient, and reliable. Waiting for infrastructure to be provisioned can be a bottleneck, as each time a new EC2 host is scaled up it needs to run through the initial cloud-init process. In our observations, this added nearly 3–5 minutes of job time in the worst cases. Additionally, these machines are cordoned off after 20 jobs to ensure reliability, meaning that the process of creating a new machine is repeated frequently.

Pulling ourselves up by our bootstraps

To solve this problem, we created a custom AMI that is pre-loaded with all the tools and configuration needed to start the machine. Now, our EC2 instances can be provisioned in under 60 seconds. This results in a significant improvement over the previous 180–300 seconds.

This change has resulted in a substantial reduction in pipeline job durations and wait times, allowing developers to focus more on coding and less on waiting. Additionally, provisioning is more reliable, as dependencies are bundled with the AMI and we ensure each machine is created using the same tooling.

Time is Money

As some of you know all too well, part of the EC2 pricing is based on the time the instance is running, so this change has also resulted in a decrease in our AWS spending. We estimate that this improvement saves approximately 187,200 compute minutes each month and 6,240 minutes each day.

With infrastructure spin-up times slashed, we could finally turn our attention to another major operational concern: controlling the ballooning costs and performance impacts associated with networking and container image management.

Challenge # 3: Networking Costs

While the previous challenges were primarily focused on security and performance, we also had to address the cost of networking. With millions of CI/CD jobs running each month, the data transfer costs can add up quickly. To throw salt in the wound, many engineering teams fall into the trap of creating “Swiss army knives” for their CI/CD needs, resulting in Docker images that are often larger than necessary. This not only increases the complexity of the build process but also leads to significant challenges: AWS Network Data Transfer fees, performance issues from pulling large images, and poor network design relying on NAT Gateways to reach Docker registries.

Lean, Mean, CI/CD Machines

To solve these problems, we encouraged the optimization of GitLab CI/CD images and categorized them into build-time and runtime images. Encouraging build-time images to be single purpose, as well as promoting the use of reusable CI/CD components with very specific uses has improved CI/CD image sizes and reduced the number of “One-Image-To-Rule-Them-All” images. We also created VPC Endpoints to one of our SaaS container registry providers, which dramatically reduced the cost of our NAT Gateway expenses and even improved network performance when retrieving public images.

Cache in the Bank: Custom Docker Registry Proxy

While these changes helped, they didn’t have the impact we were hoping for. Given the container-first approach of our CI/CD solution, another challenge is retrieving and storing container images used in jobs. Additionally there is the network cost associated with pulling these images from external registries, especially when images need to be pulled frequently due to the ephemeral nature of the machines. AWS NAT Gateway costs are no joke and, at scale, add up quickly. To address this, we implemented a Custom Docker Registry Proxy that caches frequently accessed images pulled from our private GitLab container registry, reducing the need for repeated data transfers and minimizing costs for our most commonly pulled images.

Our implementation of this solution is inspired by the great work @rpardini has done with the Docker Registry Proxy project. This solution acts as a man-in-the-middle (MitM) intercepting proxy based on Nginx, positioned between the GitLab Shared AWS Runners and the primary GitLab container registry housing CI/CD images used at State Farm.

Once a request to the GitLab registry is intercepted, we cache large blob/layer requests, which tend to incur significant latency and data transfer costs. Future requests for the same blob/layer are served from the cache, reducing the need to transfer data from upstream registries. We do not cache manifests, as that allows us to see if the image has changed and ensure we are only pulling blobs that we do not already have cached.

One of the benefits of our implementation over the project’s is that GitLab’s access controls are still enforced. If a pipeline user lacks permission to access a Docker registry image, they will receive a 403 error, even if the image is available in the cache. As a bonus, it allows us to control what registries are proxied and cached and which are not, so we can avoid caching images from registries that we do not want to cache, such as internal Elastic Container Registries.

After this solution had some time to bake in and the cache had been populated, we saw a significant reduction in the number of requests made to the GitLab registry. This has led to a substantial decrease in our AWS NAT Gateway costs, as well as improved performance for our CI/CD pipelines. The Custom Docker Registry Proxy has become an essential component of our CI/CD infrastructure, allowing us to efficiently manage container images while keeping costs under control. Most recently we are seeing:

~93% cache hit rate
~25–30TB of data served per day from the cache

This solution has not only helped us slash our network costs, but also improved the speed of our CI/CD pipelines by reducing the time it takes to pull container images. By caching frequently accessed CI/CD images, we have minimized the networking hops required for pulling a Docker image for GitLab pipeline jobs.

Conclusion

Operating CI/CD at scale is a continuous journey of balancing security, performance, and cost; each decision introducing new considerations and opportunities for innovation. At State Farm, we’ve architected a solution that leverages containerization, advanced runtimes like Sysbox, custom AMIs, and network optimizations to create a robust, secure, and highly performant CI/CD ecosystem. Along the way, we’ve encountered and solved real-world challenges that many organizations face as they scale their development pipelines.

Our experience has shown that success at scale isn’t about a single tool or breakthrough, but about layering solutions that reinforce each other. By focusing on secure isolation with Sysbox, speeding up infrastructure provisioning with custom AMIs, and reducing network and storage costs through caching and image optimization, we’ve been able to deliver a developer experience that is both agile and sustainable.

As CI/CD requirements continue to evolve, so too will our architecture. We’re excited to keep exploring new ways to empower our engineering teams while keeping security and costs in check. We hope our journey inspires others facing similar challenges, and we welcome your thoughts, questions, and war stories in the comments below.

References

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

DevSecFinOps: The Challenge of Implementing a Secure and Cost-Effective Container-Based CI/CD… was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Rendering Equation: Client + Server + Framework

State Farm Engineering — Mon, 19 May 2025 14:36:30 GMT

By Jordan Leeper

Introduction

When I first started my career 12+ years ago, I had no idea how client-server interactions worked. I often see new software engineers that have learned a JavaScript framework in college or in a bootcamp lacking some critical understanding of how their frontend application interacts with its API(s). I wish that I could have had someone explain to me some of those basic concepts, like what is client-side rendering (CSR)? What is server-side rendering (SSR)? How do they work together? How do they work for different frameworks? Hopefully after reading this, you will feel confident in answering these questions!

The Basics

Regardless of CSR or SSR, the browser interacts with the server the same way.

Here’s a high-level breakdown:

The client (browser) sends a GET request to the server (for an application)
The server responds with an HTML file
The client (browser) requests any additional assets provided in the HTML file with or

Almost no HTML elements exist in the file, because it is all built out at runtime when the browser loads the application’s JavaScript. As you can see, the tag will run to load the JavaScript which constructs the views for the application. This is true for every CSR framework that dynamically creates HTML via JavaScript in the browser.
CSR is often very appealing because of the abundance of frameworks available and how easy they are to use. Generally, they manage routing in the browser instead of going back and forth to a server to generate HTML. They also provide an excellent developer experience when paired with tool like Vite that can manage hot reloading and file bundling. Another advantage of a CSR framework is that the final result of the build is a set of static files that don’t need to be run on a server. This makes deploying and managing a UI as simple as uploading files to cloud storage rather than managing containers or servers.
Server-side Rendering (SSR)
Server-side rendering is when the view (HTML) is created by the server and then sent to the client (browser).
When it comes to SSR, this is the tried-and-true experience that many web developers have used since the beginning. The browser makes an HTTP GET request for a file and the server returns a fully built out HTML file filled with elements that the browser uses. That HTML page may sometimes use some client-side JavaScript for things like form validation, animations, reactivity, and occasionally adding an element to the UI, but generally most of the HTML is already there and was provided by the server. Things like Spring JSP, Node + EJS, .NET Razor, and PHP are probably some frameworks that come to mind.
These days, most front-end developers don’t actually mean those frameworks when talking about SSR. They are usually referring to frameworks that combine both CSR and SSR together. For our context, let’s consider SSR to refer to frameworks that provide those capabilities like Next.js or Nuxt or SvelteKit.
So, what is it then?
Basically, an SSR framework allows you to render the HTML on a server and send it to the client so that the content renders immediately. Then, the client-side part of the framework (such as React, Vue, etc.) kicks in. Now you are now able to take advantages of your CSR framework’s features such as event handling and easy ways to template your UI. Let’s keep digging in!
SSR Frameworks
These SSR frameworks allow you to use the CSR framework/library you might be familiar with, but also take advantage of all of the features of SSR!
Other frameworks that offer SSR built in are:
- Angular
- Astro
Each of these frameworks offer different pros and cons and are at different stages in terms of what features they offer. For example, at the current time of writing, only Next.js offers a fully robust Incremental Static Regeneration capability. This can be set up manually with other frameworks, but requires more overhead.
With Angular, for example, enabling SSR takes only one CLI command to add to your existing non-SSR Angular project: ng add @angular/ssr. The advantage here is that you don't have to migrate your CSR application to a completely different framework to enable SSR like you would need to do with React to Next.js or Vue to Nuxt.
The CSR + SSR Problem
Like I mentioned earlier in the post, CSR frameworks look for an element in the HTML and then mount onto it. They will start to append and build out the HTML via JavaScript. When this happens with an already fully created HTML view that was provided via SSR, the HTML would actually get destroyed and then re-created. This results in a flash of the page as the content is visible for a split second, before it’s destroyed by the JavaScript and then added back.
The CSR + SSR Solution = Hydration
How can we fix this page flashing? The answer is something you’ve probably heard about. Hydration!
Hydration, while complex to implement, is actually quite simple to understand. Rather than the CSR framework destroying the already rendered HTML elements, let’s provide them to the framework instead! JavaScript often needs to apply interactivity to the HTML by attaching event listeners to buttons, the window, or other elements. This is where the concept of hydration comes from, since the page is not interactive until it’s been “watered” by the JavaScript that has been loaded. This also prevents the page from flashing, since the CSR framework doesn’t rebuild the HTML elements.
What advantages does SSR have?
SSR still plays a very crucial role in front-end development. While React, Angular, and Vue might be sufficient for most UI applications, there are several gaps that SSR fills.
1. SSR applications can often provide improved performance when it comes to first contentful paint (FCP) since the browser does not need to download JavaScript in order to display the views.
2. When building out complex and performance heavy visuals or elements, it can often be faster to do so on the server which generally has more powerful resources available than a user’s browser.
3. Search Engine Optimization (SEO) was a very common use case, but perhaps not anymore. It was generally thought that search engine crawlers that traversed web pages to index them had a lot of problems when it comes to JavaScript heavy (CSR) pages since they might not wait for the view to be constructed (among other things). However, this article from Vercel demonstrates with evidence from over 37k different HTML pages that JavaScript heavy pages may no longer have as many issues when it comes to Google’s page indexing and SEO processing (At least for the Googlebot crawler).
Infrastructure
Running an SSR framework locally is usually very easy. The complex part comes in the big differences required when deploying to a test/QA/production environment. The standard approach for a CSR app is to use something like Amazon S3, Cloudflare Pages, etc. that acts as a simple file store since we don’t actually need a server. However, this is no longer possible with SSR (due to that Server which needs to construct the HTML per request) and we can’t just rely on static content being delivered via cloud storage.
Many providers offer capabilities that can help simplify SSR infrastructure management. Vercel offers SSR capabilities for Next.js, SvelteKit, Nuxt, and Astro. NuxtHub offers full stack capabilities for Nuxt applications. AWS has several services that could enable SSR applications such as Amazon EKS, Amazon ECS, or AWS Amplify. Firebase offers App Hosting for Next.js and Angular applications that leverages Google Cloud Platform behind the scenes.
One important note is that using a cloud storage solution to serve static content, such as a CSR app, is generally cheaper than running a server which hosts an SSR application. It is also often much simpler to set up and maintain.
Other acronyms?
Oftentimes, several other concepts are talked about when it comes to SSR and I think it’s critical to understand what they are. SSR occurs at runtime. When a web request reaches the server, it will build out the HTML response dynamically and send it back to the client. Sometimes we might want to build out the HTML ahead of time since it won’t change per request. This is where Static Site Generation (SSG) and Incremental Static Regeneration (ISR) come into play. Many SSR frameworks offer these capabilities so that it will lessen the load on the server and improve efficiency.
Static Site Generation (SSG)
Static site generation is when the associated pages of your application are pre-rendered at build time rather than dynamically when requested by the server. Many frameworks are able to do both SSR and SSG in the same application! Angular does this out of the box. This can be a very efficient approach since all of the work is done up front. However, it can be cumbersome to easily make changes to content since it is all done at build time.
Incremental Static Regeneration (ISR)
ISR allows your applications to periodically regenerate pages that were created via SSG. This means that you no longer need to do a redeploy and rebuild of your statically generated pages to pick up new content as long as the content is pulled dynamically from a source, such as a content management system (CMS), at build time. Not all frameworks offer ISR, but it is becoming more popular.
What does State Farm do?
In general, State Farm enables teams to use the tools which best fit the problems they are trying to solve for. This means that each of our applications have the potential to be built with different frameworks depending on the team’s knowledge, the product area, and the different ways we need to interact with customers. State Farm has a large number of UI frameworks in use including Angular, Astro, Ember, Next.js, React, Vue, Spring MVC + JSP, etc.
Some of the common patterns I recommend to our engineering teams are:
- Do you have a form heavy application with a lot of inputs? Save yourself some time and use a framework that enables two-way data binding like Vue or Angular.
- Spinning up a quick project or proof of concept? React might be an easy option since it has many learning resources and open-source dependencies.
- Building a large, enterprise application with many engineers and you need to have an easy time managing updates and technical debt? Use Angular. It prioritizes backwards compatibility and easy version updates.
- Do you need to build a UI where first page load speed is super critical? Use an SSR framework like Next.js or Angular with SSR.
Review — when to use CSR or SSR?
This is a question a lot of teams ask me, but I’m hopeful that now that we have more of a grasp on their differences, we can determine what makes the most sense to use. Let’s review.
When to use SSR
- You have an application that needs to display as soon as possible. We can achieve this by sending pre-rendered HTML to the browser and not needing to load JavaScript to display the UI.
- Your application needs to maximize its SEO statistics. While the aforementioned article from Vercel is promising, our best bet is to keep things easy for web crawlers and send all of our content on the first load of the HTML.
- Your application needs to do heavy data processing which might cause very slow load times or interactions in the client. Process the data on the server rather than relying on the browser.
When to use CSR
- For most other scenarios, using CSR is the right choice. It’s cheaper to host, easier to manage the infrastructure, and makes for a great user experience.
Exceptions…
- One scenario where it might make sense to use something other than CSR would be for certain types of pages that don’t require much user data or interactivity. For example, marketing pages that display static content. In this scenario, you might want to consider using SSG. Oftentimes JavaScript frameworks such as Angular and Astro can let you choose what routes of your application can be pre-rendered at build time.
- Another scenario might be a need for a limited backend API or Backend for Frontend (BFF). By creating an SSR application, you could also include one or more API routes that the client calls. However, this can grow complicated. As API needs grow and change, you should be cognizant of when pulling those capabilities into their own managed service makes more sense.
Conclusion
Understanding client-server requests and how your framework operates helps teams with debugging, building efficient applications, and providing the best experience possible to users. Those basics, along with an understanding of CSR vs SSR, empowers teams to craft their ideal application. Now that you’ve had a chance to learn more you can ask yourself: Do I need SSR for my application or will CSR suffice? Can my application’s content be pre-rendered with SSG? Is the cost of running an SSR application server worth it versus a CSR application hosted in static cloud storage? Thanks for reading and let’s get building!
To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.
Information contained in this article may not be representative of actual use cases. The views expressed in the article are personal views of the author and are not necessarily those of State Farm Mutual Automobile Insurance Company, its subsidiaries and affiliates (collectively “State Farm”). Nothing in the article should be construed as an endorsement by State Farm of any non-State Farm product or service.
The Rendering Equation: Client + Server + Framework was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Level Up Your Design Practices

State Farm Engineering — Mon, 21 Apr 2025 19:16:37 GMT

By Ben Justick

Introduction

As an Architect, I’m always thinking about how to articulate a technical design or architectural change that’s in my head in a way that it can be easily consumed and understood to accelerate engineering efforts. When you’ve been doing this for a long time, the most effective design techniques and practices almost become like second nature. However, I’ve found that less experienced engineers and architects often struggle with articulating their design in a way that others on the team can understand. Typically, the right pieces are there, but they are organized in a way that doesn’t complete the puzzle. In this post, I intend to cover the approach and best practices I use in context to practically every architecture and design assignment. My hope is that you gain a practical guide to design that you can start using on your next assignment that leaves your team impressed with the end result.

Step 1 — Do Your Homework

When you first get a design assignment, there’s a very strong possibility that you are going to want to immediately start drawing up design diagrams or creating design documentation. This is something I’m often guilty of myself, especially if I’m excited about a new product or capability and want to start sharing ideas. However, spending a bit of time up front for some homework to research the problem space and to connect with key people almost always yields a high return on investment. There’s a common approach I like to follow when it comes to doing this homework.

I always start by thinking through the questions that I have about the problem space and writing those down:

Some questions are typically targeted at gathering a robust understanding of the current state application and/or services that may need to change with a special focus on understanding the key points of integration within a current state application or service context.
Other questions are based on initial assumptions of what may need to change or be built to solve the problem at hand.

For each question I use different approaches (or a combination of approaches) to get the answer:

Review existing documentation and code. — Almost every design will involve some changes to, or an articulation of, current state components. Often times, there is design documentation you can reference to get up to speed with the current state, but for more detailed changes, or if there is a lack of documentation, spending some time reviewing code is necessary.
Find out who the key people are, and determine the questions you need to ask them. — For every IT product, there’s usually an expert that either designed it, wrote the code for it, or at the very least understands how it works. Determining who your key contacts are, and how best to engage with them to get the information you need will be critical to the success of most design efforts.
Some questions require industry related insight and perspective. — You can start with some industry research on the internet, but ideally there are some key people within your organization that have already spent a bunch of time researching a specific service or technology and might have even gone as far as a proof of concept. Reach out to these resources to ask about the information that they have and if there’s anything conclusive based on their work.

This part of the work leans heavily on developing and maturing soft skills that are prone to being overlooked in tech centric organizations. Learning how to adapt to different communication styles and work with people across the organization takes time and practice, but it’s well worth the investment.

Step 2 — Outline the Context and Problem Statement

One common mistake that I often see inexperienced Engineers or Architects make, is starting off design documentation with a complex design diagram. While a great design diagram is an essential part of almost every design assignment, some context is typically necessary to get the reader to a place where they can make sense of what the diagram is trying to articulate. Leading in with some background context on why you’re working on this design assignment and the problem that intends to be solved helps to guide the reader into your headspace and set up the more detailed design that’s forthcoming. This section should intentionally be kept brief and often times a short bulleted list is the most effective way to introduce all of the key historical context and the general problem statement the reader needs to be aware of.

Step 3 — Clarify the Scope and Assumptions

I know that most of you are eager to get to a diagram, but trust me, spending a bit of time to clarify what’s in vs. out of scope, and/or key assumptions that are part of your design will make for a more concise and focused design diagram. This section should really be intended to draw the boundaries around the things you plan to cover as part of your design vs. the things that were considered as part of thinking through the design, but don’t necessarily need to be depicted because they either aren’t changing or are existing points of integration that are depicted in detail in other diagrams. By incorporating this section, you are further guiding the reader into your headspace and getting them prepared for what they are going to see next. Similar to the Context section, this section should be kept as brief as possible and a bulleted list usually is the most effective technique.

Step 4 — Create a Beautiful Diagram

Think about a major purchase you’ve made recently. You likely visited several manufacturer websites while researching that purchase, and made immediate subconscious judgements about the related quality of the products just based on the underlying aesthetics of each website. Similar subconscious judgements are made within the first few seconds of seeing a new design diagram. A diagram that is ugly and challenging to consume might leave you questioning the accuracy of the information it is attempting to depict, or wondering about the overall quality of the product(s). Subconsciously, an unorganized ugly diagram is going to make you think the solution itself is unorganized and ugly regardless of how good it may actually be. In contrast, beautiful diagrams immediately instill confidence in the accuracy and quality of the design.

Simply put, when it comes to design diagrams, aesthetics matter. Let’s take a moment to review some best practices I’ve collected and used over the years, and then explore how we can apply them to make more beautiful diagrams.

Best Practices

Borrow from another diagram. — Find a diagram that either you or someone else created that you think is a similar to the diagram you want to create and utilize the symbols, shapes, colors, and spacing from that diagram.
Use colors and symbols strategically. — Use color to group like and contrasting items in the design, to emphasize a key focus area on the diagram, and/or to highlight changes to an existing system. Too much color can create noise or be a distraction, so spend some time to find the right balance of colors that is pleasing to look at and makes the information easy to consume. Also, always have a legend to explain what the different colors and symbols mean.
Use numbering to articulate flow/steps or for informational call-outs. — For end-to-end flow diagrams, use numbering to articulate the sequence of end-to-end steps. Numbering may also be helpful in context to other types of diagrams if there’s additional details to expand on in text.
Use abstraction strategically. — Use abstraction to make the diagram less messy. (As long as it doesn’t remove important information from the diagram.)
Connection arrows should be used consistently — i.e. — All straight arrows or all curved arrows, but not some combination of both. Also, try not to cross arrows if possible.
Connection arrows should be labeled to clarify ambiguity — e.g. — An unlabeled connection arrow from an API to a Datastore could mean many different things (create, read, update, delete, or some combination). Use labeling to clarify the intent of the connection as needed.
Make the alignment and spacing of objects as symmetrical as possible — Symmetry helps to minimize visual distraction and facilitates more rapid consumption of the information depicted.
Approach your diagram like an artist — This practice is more abstract, but is important to take your diagrams to the next level. Everyone has different stylistic preferences, but experiment with options to create your own visual style that makes your diagrams stand out. Over time you may develop a signature style that others may come to admire and replicate.

Theoretical Example Case Study

I’m going to depict a fairly generic system topology diagram that has UI, API, and Data layers. I created all of these example diagrams using Draw.io. The components on the diagram are also labeled generically to keep the example simple. (i.e. — In real life UIs and APIs would have explicit names/labels.)

A Not So Beautiful Diagram

For this first example, I’ve created a diagram that goes against several of the best practices above just to illustrate how poor aesthetic choices make a diagram very challenging to consume.

A not so beautiful diagram

My guess is that you’ve come across diagrams that look something like this several times in your career. It’s both challenging to consume and has informational gaps due to poor design choices. For example, this diagram probably has you asking questions like:

How does the Data Access API interact with the databases depicted? Is it reading, writing, or some combination of both?
Is there some meaning behind the different colors used for the APIs and UIs?
The call to Downstream API 1 has a chain of calls to other APIs that then calls off to the same databases that the Data Access API calls. How are these calls different from the calls the Data Access API is making?

A Better Version

The version below uses some of the best practices to make the diagram more consumable.

A better diagram

What are some of the initial instant reactions you have seeing this diagram after reviewing the first example? You likely were able to immediately start consuming the information in the diagram rather than spending the first few moments trying to decipher what the diagram was trying to depict. I’m assuming it also made you feel more confident about the accuracy/quality of the design. A few notes on some of key best practices that were applied:

Colors, symbols, and connection arrows were applied consistently and a Legend was added.
The detailed call flow behind Downstream API 1 was abstracted to make the diagram less busy, and a numbered call-out was used to offer a point of departure to those details. Note: This assumes that these details aren’t essential to be depicted in context to this diagram and a more detailed end-to-end diagram for Downstream API 1 is available.
Some of the connection arrows were labeled to reduce ambiguity around the purpose/intent of the connections.
Alignment changes were made to make the diagram more symmetrical.

Making it Stand Out

There’s nothing wrong with the updated example above. It follows best practices to optimize readability and consumption, and likely represents a good enough stopping point for sharing the design with or across a few product teams. However, what if your goal is to create a diagram that’s going to be shared at the Enterprise level or to possibly be embedded within a webpage to share outside of your organization. You might want to consider some additional aesthetic enhancements to really make your diagram stand out. The example below highlights a few techniques for stylizing your diagram. Everyone’s personal style is unique, so experiment with what resonates with you and create your own signature style!

A more beautiful diagram with style added

A few notes on the techniques that were used to give this diagram more of a hand drawn style:

A few different hand written font sets were selected and applied. (Draw.io allows you to apply fonts from the Google Fonts Library.)
“Sketch” styling was selected for some objects and connectors.
Connectors were switched from “Sharp” to “Rounded”.
A background object was added to help make the stylistic choices stand out more.

Next Steps & Concluding Thoughts

After articulating your design diagram(s) there’s several potential additional sections that could make sense to add to your design documentation based on context. You may have a need to elaborate on design details in context to the diagram, break down high level features/stories, sizing, and ownership of work based on the diagram, or explore alternatives of design decisions that may need to be made. Regardless of the next steps that apply to your specific design assignment, there are a couple of additional key best practices to apply as you start to wrap up the initial draft of your design documentation:

Review and revise your design documentation with peers and experts in the problem space to ensure that the design is accurate and easy to understand. Design should be approached as a collaborative process rather than something done in isolation.
Refine the story you want to tell. — At the end of the day, your design should tell a story. In some cases, it will be the story of why changes need to be made in order to achieve a new business goal or priority, while in other cases it might be a more detailed “how to” manual to help a team deliver some high priority work faster. Revisions to your design documentation should be made with your story and intended audience in mind.

Following this approach takes time and experience to hone, but you should start to see immediate benefits as soon as your next design assignment if you take the time and effort to apply it. I hope this also gives you a better understanding and appreciation for the important aspects of design that go beyond just creating an amazing design diagram, while also providing you with the best practices and techniques you need to create that diagram.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

Level Up Your Design Practices was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Adopting GraphQL Federation at State Farm — Our First Baby Steps

State Farm Engineering — Mon, 17 Feb 2025 18:02:28 GMT

Adopting Federated GraphQL at State Farm — Our First Baby Steps

By Austin Mehmet and Brian Vanderbusch

Overview

State Farm’s digital strategy has evolved over the years, from the early days when we had a simple static home page, to today where we have numerous APIs and multiple separate frontends stitched together to provide a cohesive customer experience. A core part of our web architecture relies on numerous Backend for Frontends (BFFs) that provide central orchestration of REST APIs to ensure our customers get a unified experience across our digital landscape. These API orchestration layers have served us well, but ongoing maintenance has become more cumbersome over the years as the downstream APIs upgrade and evolve over time. Some of our larger orchestration APIs may have been generic enough for multiple various clients to consume, but they never grew to become enterprise-wide solutions. While this architecture is functional, it is far from optimal. Recognizing the need for a more efficient solution, we turned our attention to GraphQL and a federated architecture.

The Shift to Federated GraphQL

GraphQL was not new to State Farm. It had been developed in isolated pockets within the organization but had never achieved widespread adoption. However, the adoption of federated GraphQL changed our perspective entirely. Why? Because it allows us to unify our numerous web services under a new concept: the supergraph. The supergraph acts as a single point of entry across our wide sprawl of APIs and enables clients to request only the data they need while also not having to navigate a maze of various web service that span paradigms like REST, SOAP, GraphQL, and gRPC. It opens the door to no longer having to maintain massive monolithic backend-for-frontend style APIs, but instead have our downstream domain areas power the client facing applications theoretically removing large chunks from our architecture that could potentially improve performance and reduce cost. We also saw other various benefits such as enabling security at a field level, increasing developer productivity, and greater flexibility to adapting to new business features. But maybe one of the most important aspects was that we saw a large benefit in building and maintaining a single unified schema for describing common business terminology to our clients. It isn’t uncommon for areas to call the same concepts different things. Simple things like addressLine1 being called street1 or just line1 or concepts like a vehicle were being represented as vehicle, car or, in some systems, generically as an insurableRisk. Apollo GraphQL’s approach to GraphQL federation lets us wrangle some of that in and have a more unified schema for modeling our business processes and enables a common language in which we can all talk in.

The Proof of Concept

We decided to validate this approach with a proof of concept. Collaborating closely with Apollo and leveraging their product GraphOS, we set out to build a thin slice of what we believed would showcase the full potential of federated GraphQL at State Farm. Our concept focused on modeling a policy retrieval system. This system allowed us to fetch policy details, agent details, claims details, and customer details all through a single supergraph. We ended up building 10 subgraphs that exposed 15 queries, 1 mutation, 63 types and 378 fields. Pages that would require multiple network requests to gather up the data they needed to paint a screen could now all be done with one call against our supergraph. Most of the subgraphs built were facade layers that sat on top of existing REST APIs. This was the simplest and quickest way forward to demonstrating the value of the supergraph to our organization. Much of the complexity we had to wade through was reorganizing existing schema into cleaner consumable parts and enabling the federation of our entities. In terms of languages and frameworks, we mainly stuck to Typescript using Apollo Server but also enjoyed branching off that path and trying Java with DGS and Go with gqlgen. Many of our internal services are already written in Java with Spring and so DGS has offered a more unique approach of simply embedding a GraphQL interface into existing APIs rather than spinning up new services.

Example query plan from our PoC

Challenges We Faced

Like any transformative project, our journey with federated GraphQL was not without its challenges:

Complexity of Integration: Integrating numerous existing services into a unified supergraph required significant effort. Each service had its own intricacies and dependencies that needed to be carefully managed. In the context of our concept, some constraints of our legacy systems introduced unexpected complexities in tasks like policy retrieval across our various domains.
Schema Management: Coordinating schema across various teams and services presented some challenges. Initially, we identified opportunities to improve the consumer-friendliness of our existing REST and SOAP schemas as most were never really intended to be exposed to a UI. We spent a lot of effort working back with these various areas to rethink that schema. We also needed tooling and processes to handle schema evolution and versioning. Apollo’s GraphOS helped solve that problem with their schema proposal changes feature which allowed us to have complete conversations on supergraph schema. We also encountered some difficulties with composition and collaboration within a shared type pool. Specifically, aligning on consistent naming conventions for types used by different teams was an area that required extra coordination.
Security: Ensuring the security of our federated graph was paramount. We are still exploring ways to implement stringent access controls and data validation mechanisms to protect sensitive information through the implementation of the @policydirectives via a coprocessor.
Scale: When faced with a large task, such as creating a sustainable supergraph for a large organization, it is important to take extra time to plan and keep the long term goal in mind from the beginning. We knew that our proof of concept would need to be focused on 10 teams spread across our IT operations workforce, and they would become the catalyst for expansion to hundreds of teams in the near future. We started small and very intentional around scope, so that we could build a framework for rapid expansion.
Forging New Ground: Our teams span across a vast landscape of products and purposes. In our discovery process, we found many supergraphs at various other organizations that were limited either in their scope or in the size of their graph development workforce. We needed to find a platform and also a vendor that would be willing to grow with us as we test the capabilities and limits of the supergraph concepts.

Where We Are Now

The success of our proof of concept convinced us of the value of federated GraphQL, and we decided to commit to this new approach. Many of the challenges listed above pointed to the need of some form of a governance team. Additionally, with there being some infrastructure setup and maintenance required to support Apollo federation (running the Apollo Router), our next step was to build a dedicated platform team around GraphQL at State Farm. This involved assembling a new team and laying down the foundational infrastructure to facilitate easy onboarding for development teams across the organization. This team is responsible for ensuring collaboration across engineering teams, maintaining a clean federated schema, management of any shared infrastructure, and continued advocacy for growing the supergraph. We additionally continue to work closely with Apollo and fully leverage their GraphOS product for our GraphQL federation journey. We have already put one use case into production that enables online customer channel business lines quoting and we have many more on the horizon.

Currently, our focus is on:

Gaining Additional Adoption: We are actively promoting the benefits of federated GraphQL within State Farm and encouraging more teams to adopt this new approach. We have four more use cases we are onboarding onto our production supergraph along with actively attracting more.
Increase Education and Training: GraphQL being new to so many areas of State Farm, our platform team is building an education and training plan that leverages self-paced materials and the Apollo Odyssey courses to get team’s up-to-speed.
Building Out Automation: To streamline the development process, we are investing in automation for our GraphQL platform. This includes automated shared infrastructure deployments, automated schema validation, CI/CD pipelines for subgraphs, testing tools, and monitoring tools.
Exploring Apollo Connectors for REST APIs: We are diving deep into a new feature called Apollo Connectors. This feature promises to further enhance our ability to integrate existing REST services into our federated graph seamlessly. One of the challenges we face is difficulty getting work prioritized when it involves another team. With Apollo Connectors, we see a large potential in the ability to grow our supergraph without impacting these teams. Streamlining supergraph development and showing a faster return is the goal.

The Future Holds Promise

Our journey with federated GraphQL is just beginning. As we continue to expand our GraphQL footprint at State Farm, we are excited about the possibilities that lie ahead. In the next six months, we expect to achieve significant growth and learning and plan to share our experiences, insights, and best practices with the community.

Stay tuned for more updates on our progress as we navigate this exciting transformation. Federated GraphQL has opened up new avenues for innovation at State Farm, and we look forward to seeing where it takes us.

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

Adopting GraphQL Federation at State Farm — Our First Baby Steps was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

When the Spark Execution Plan Gets Too Big

State Farm Engineering — Mon, 13 Jan 2025 18:02:05 GMT

By Hunter Mitchell

Background

Apache Spark has been a dominant big data processing tool for around a decade. Its popularity stems from its multi-language support, fault-tolerance, performance, and scalability. The core data structure used by Spark is the RDD (Resilient Distributed Dataset), which represents an immutable collection of objects. One of Spark’s unique features is its execution plan, which gets evaluated lazily to allow optimization once an “action” is called (i.e., a result is returned). Keeping the entire execution plan in memory is what allows Spark to be fault-tolerant; that way, it can rebuild the data if a worker is lost. However, the execution plan can become quite extensive when performing many data transformations, potentially leading to memory bottlenecks.

I’ve been a Data Engineer for around 3 years and have been working with Spark for most of that time, but I hadn’t run into this problem until I joined State Farm a year ago. This is due to the heavily nested nature of State Farm’s data, which requires us to perform many transformations on it. After these transformations, we noticed the applications would run slower than expected, and the Spark UI would fail to load when we tried to view the execution plan DAG (Directed Acyclic Graph). Then, I learned I could break up the execution plan to alleviate this problem. Creating a breakpoint in the lineage of the data allows us to free up the memory retained by all of its previous transformations. After researching the various ways to do this, I wanted to perform an in-depth analysis of these different options and understand how each one works.

In this article I’ll be exploring what happens to the Spark execution plan when implementing the following techniques on a transformed dataframe:

Caching
Checkpointing
Local Checkpointing
Temporarily Writing to Disk
Rebuilding From the RDD

Scenario

Disclaimer: Unless otherwise stated, I will be referring to the Pyspark DataFrame API. The Dataset API and the RDD API may have slight differences.

To illustrate this, let’s consider the following simple dataframe:

df = spark.range(100)

Now let’s perform a self cross-join to force a wide transformation:

df_transformed = df.withColumnRenamed("id", "id1").crossJoin(df.withColumnRenamed("id", "id2"))
df_transformed.show()

Now we have a cartesian product of the numbers up to 100:

+---+---+
|id1|id2|
+---+---+
|  0|  0|
|  0|  1|
|  0|  2|
|  0|  3|
|  0|  4|
|  0|  5|
|  0|  6|
|  0|  7|
|  0|  8|
|  0|  9|
|  0| 10|
|  0| 11|
|  0| 12|
|  0| 13|
|  0| 14|
|  0| 15|
|  0| 16|
|  0| 17|
|  0| 18|
|  0| 19|
+---+---+
only showing top 20 rows

Here’s what the Spark execution plan for this dataframe looks like up to this point:

Note: I’m using the SQL section of the UI to get the execution plan DAG, as it tends to provide a better visual than the Jobs/Stages tabs.

Although this is just one transformation, you can see how the DAG can grow large if you’re doing many. For each transformation, Spark stores intermediate results, metrics, shuffle information, and execution context metadata. All of which utilize memory.

Let’s say we want to perform further analysis on this dataframe. If you’re like me, your first thought would be to cache the dataframe.

Where Caching Falls Short

Caching is the standard approach when planning to reuse a dataframe, and for good reason. It allows quick access to the dataframe by storing the data at the storage level of your choosing. The problem with caching comes from the fact that it retains the history of the execution plan prior to the cache, which can slow down performance if the history is long. As previously mentioned, this is for fault-tolerance purposes. Let’s see an example to prove this:

df_cache = df_transformed
df_cache.cache()
df_cache.count() # force caching
df_cache.show()

Here’s what the execution plan DAG looks like from the show() action:

As you can see, the execution plan prior to the cache is still there.

Despite not breaking up the execution plan, caching should still be the go-to method when needing to reuse a dataframe. I don’t suggest replacing all instances of caching with one of the methods outlined below, though you may consider combining them in certain scenarios.

What Other Options Do We Have?

Checkpoint

The first option we have is dataframe checkpointing, not to be confused with Spark Streaming checkpointing. Dataframe checkpointing materializes and stores the dataframe’s underlying RDD to a directory. This can be a local temp directory, distributed filesystem like HDFS, or object store like S3. There, it is replicated to ensure fault-tolerance. For this to work, you must specify the directory to store the RDD files. You must also reassign the dataframe from the output of the checkpoint method to properly retrieve the checkpointed dataframe.

Here’s how this works in practice:

df_checkpoint = df_transformed
spark.sparkContext.setCheckpointDir("./checkpoint") # specify location of RDD files
df_checkpoint = df_checkpoint.checkpoint() # returns checkpointed dataframe
df_checkpoint.show()

Note: The checkpoint method can also be lazily evaluated, like caching. To do this, set the eager argument to False.

This stores the serialized RDD files in the .checkpoint/ directory. Now let's see what the execution plan DAG looks like from the show().

Voila! It’s now reading from the stored RDD files, and has essentially forgotten about what happened prior to that. The total size on disk of the RDD files from this sample dataframe was 425 KB.

A few other notes:

The stored RDD files will not automatically disappear after your SparkSession ends, so it may require extra cleanup depending on your cluster setup and where you store the files.
When checkpointing an RDD directly (not the dataframe), it’s recommended to cache the RDD before checkpointing. This is because there are actually 2 jobs happening on the RDD, requiring recomputation. For more info on RDD checkpointing, check out the docs.

LocalCheckpoint

Similar to checkpointing, local checkpointing also materializes the dataframe’s RDD. However, this gets stored in memory rather than on disk, making it perform faster. This makes it work the same as caching, but without keeping the prior execution plan. The storage level that gets used is MEMORY_AND_DISK_DESER, just like the cache default, though it doesn't appear to be configurable. The problem with local checkpointing is that it's not fault-tolerant, meaning that if you lose an executor, the RDD will not be able to be reconstructed because it doesn't have the transformations that created it. This is especially troublesome if:

you’re using dynamic allocation within your cluster, which can drop executors
you’re using auto-scaling with a service like AWS Glue/EMR, which can also drops workers
you’re using AWS Spot Instances for your workers, which can be interrupted or shut down at any time

I recommend avoiding dataframe local checkpointing except for one-off cases, like adhoc analysis notebooks where you can afford things to fail.

Here’s how this looks in code:

df_local_checkpoint = df_transformed
df_local_checkpoint = df_local_checkpoint.localCheckpoint() # need to reassign the df just like checkpoint
df_local_checkpoint.show()

Here’s what the execution plan DAG looks like:

It looks just like the checkpointing DAG, which is to be expected. One difference is that the stored RDD files are bigger than they were when checkpointing — for this example they take up 840 KB of memory. This is likely due to them not being serialized, which provides quicker access to them.

Temporary Write to Disk

The third option is explicitly writing the dataframe to disk, and reading it back. This allows flexibility because you can write the dataframe anywhere, even your preferred database. It also allows storage-optimized file types like parquet which occupy less space and may incur less external storage costs as well. Here’s how you can implement this:

df_temp_write = df_transformed
df_temp_write.write.mode("overwrite").parquet("./temp_write")
df_temp_write = spark.read.parquet("./temp_write")
df_temp_write.show()

Which provides the following DAG:

This is slightly different from the other DAGs because it needs to translate the parquet files back to an RDD. Therefore, if you plan on re-using this dataframe heavily, you may want to cache it after reading it from the temporary location.

This option, like checkpointing, will likely require manual cleanup of the files. However, there is less data to cleanup because of the optimized storage. For this example, the parquet files on disk amounted to just 18 KB! It’s also worth mentioning that the order of the dataframe can be different when reading it back, so you may need to sort it before writing the files.

Rebuilding the Dataframe from the RDD

I saw a few examples online of people saying that recreating the dataframe from the RDD and the schema will effectively break up the execution plan just like the other options. Let’s see if this works:

df_rebuild = df_transformed
df_rebuild = spark.createDataFrame(df_rebuild.rdd, df_rebuild.schema)
df_rebuild.show()

When we look at the DAG in the SQL section of the Spark UI, it looks like it works:

However, if we look at the DAG from the Job section of the Spark UI, we’ll see a different story:

This appears to show the original df_transformed job's execution plan with a few extra steps of converting it to an RDD and back to a dataframe. I believe the SQL section is misleading because the transformation is actually happening with the RDDs, which is a lower level than the SQL interface. Since this doesn't actually break up the execution plan, this isn't a viable option.

Comparison

Here’s a simple table to show the main differences in the techniques we’ve covered:

In an attempt to get a direct comparison of how well these methods perform, I designed an experiment. The experiment creates a dataframe, runs a bunch of transformations on it to build up the execution plan, then iteratively performs a number of actions which reuse the dataframe. Before the last step, I implemented each strategy outlined above to see how it impacted the resulting processing time.

One hurdle I encountered when designing this was Spark implicitly storing some of the results, which would make the first test always run slower than the following tests. To get past this, I introduced randomness to the actions so that each test would calculate slightly different aggregates.

The 3 major variables which I expect to impact the performance are:

Size of the dataframe
Size of the execution plan
Number of times we reuse the dataframe

Therefore, I ran this experiment with a few different values for each of these. Here are the results:

The first column # Dataframe Rows simply represents how big the dataframe is. The second column # Dataframe Transformations represents how big the Spark execution plan is. The third column # Dataframe Actions is how many times I reuse the dataframe after implementing each strategy. This is all local, so I ran into memory errors when trying to go past these numbers. The Baseline Time column is how long the code took when I didn't implement any strategy.

A few interesting things to note with these results:

Caching performed relatively poorly in each test, and even worse than the baseline in two of them
Local checkpointing, despite its risks, outperformed the others in all tests
Checkpointing performed slightly better than temp writes in each test, however, temp writes seemed to be getting better when we increased our variables
Rebuilding the dataframe performed okay when our dataframe was small, but severely degraded as it grew

Of course, these will not always hold to be true. Another potential factor which I didn’t explore is the complexity of the dataframe actions we’re performing after the strategy. In my experiment, we run a simple filter and sum aggregation on a single column, but doing something more complex might change which strategy performs the best.

The full code I used for my experiment can be found here.

Which to Use?

As with most Spark applications, there’s not one solution that will work the best for everyone. Therefore, I recommend experimenting with the above methods, and determining which provides the most uplift for your situation. With that being said, there are a few conclusions we can accurately draw:

Caching should still be the default technique when you need to reuse a daframe which doesn’t have a big execution plan.
Local Checkpointing is generally the most efficient strategy to break up your execution plans. However, there are major drawbacks when it comes to fault-tolerance.
The next best option is typically either checkpointing or temporarily writing to disk. These tend to perform similarly.
Rebuilding the dataframe from the RDD does not properly break up the execution plan and should be avoided.

For our application, we found that temp writes with caching afterwards was the best option.

Departing Thoughts

Breaking up the execution plan can help speed up any Spark applications where you are reusing a transformed dataframe. This approach works by removing the unnecessary information stored from all previous transformations, which frees up memory. Therefore, it tends to be most effective when there have been many transformations on the dataframe or when you’re reusing it many times. These strategies often perform better than simply caching, which keeps the metadata from previous transformations. For us, breaking up the execution plan provided more than 40% reduction in cost over caching. We found this strategy to be useful in our data processing pipelines, but I have also seen mentions of it being used in MLlib and GraphX apps. Hopefully it can benefit your Spark apps as well!

Please don’t hesitate to reach out with any questions or thoughts.

All of the code I used is available on my github.

If you’re interested in researching this yourself, I found the following resources especially helpful:

To learn more about technology careers at State Farm, or to join our team visit, https://www.statefarm.com/careers.

When the Spark Execution Plan Gets Too Big was originally published in State Farm Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.