Medium Engineering - Medium

Making AI Write Android Code Our Way: A Practical Guide to Agent Skills

Pierrick CAEN — Tue, 17 Mar 2026 08:25:04 GMT

Generated by DALL-E

Turning knowledge into reusable AI agent instructions for a small, fast-moving team.

We're a small Android team at Medium, just a handful of engineers maintaining and evolving the Medium Android app. Our codebase follows Clean Architecture with Kotlin, Jetpack Compose, Hilt, Apollo GraphQL, and a growing number of feature modules. Like most Android teams, we have strong opinions about how code should be structured: where ViewModels get their data, how analytics events flow, how feature flags are checked, what a "new screen" looks like from Fragment to preview function.

The problem? Those opinions lived in PR review comments, Slack threads, and the heads of engineers who'd been around long enough to know the patterns. When AI coding assistants arrived, they could generate Kotlin code but not our Kotlin code. The output was generic. It missed our conventions, our component library, our testing style.

Six months ago we started using Cursor as our companion IDE. What changed the game wasn't Cursor itself, it was skills and AGENTS.md: a way to encode our team's playbook so the AI follows it every time.

This post walks through what we built, how we structured it, and what impact it's had.

The Foundation: AGENTS.md as Project Context

Before skills, we wrote an AGENTS.md file at the root of our Android project. Think of it as a README for the AI, a document that's automatically loaded into context whenever any Agents works on our code.

Our AGENTS.md covers:

Architecture overview: Module structure (data, domain, design, feature modules), layer responsibilities
Key patterns: How we do dependency injection (Hilt), state management (StateFlow + SharedFlow), navigation (centralized Router), repository pattern (Apollo + Result)
Conventions: Compose best practices, ViewModel patterns, testing strategy
Common commands: Gradle tasks for building, testing, and running Detekt

This gives Agent baseline awareness of our project. When it generates a ViewModel, it already knows to use @HiltViewModel, StateFlow, and @Immutable sealed interfaces. When it creates a test, it reaches for MockK and Turbine instead of Mockito and LiveData.

But AGENTS.md is passive context. For multi-step, opinionated workflows, we needed something more structured.

Skills: Step-by-Step Playbooks for the AI

An Agent skill is a Markdown file (stored in .agents/skills/) that teaches the AI a specific workflow. It's not a template, it's closer to a runbook: "here are the files involved, here's the order of operations, here are the patterns to follow, here's the checklist to verify."

We've built 13 skills so far. They fall into four categories.

Scaffolding Skills - "Create This From Scratch"

These are the highest-leverage skills. They replace the 30-60 minutes an engineer spends setting up boilerplate for a new screen, module, or layer component.

create-compose-screen: Our most detailed skill. It walks through creating a ViewModel with assisted injection, listener interfaces (in separate files), a Screen composable with a @VisibleForTesting overload, previews for every state, and test tags. A single prompt like "create a new screen for user notifications" produces 6-8 files that follow our exact patterns.

The skill specifies structure like two versions of every screen composable:

// ViewModel-injecting version
@Composable
internal fun MyFeatureScreen(
    itemId: String,
    referrerSource: String,
    listener: MyFeatureListener,
    viewModel: MyFeatureViewModel = hiltViewModel { factory: MyFeatureViewModel.Factory ->
        factory.create(itemId = itemId, referrerSource = referrerSource)
    },
)

// @VisibleForTesting version (for previews and tests - no ViewModel dependency)
@VisibleForTesting
@Composable
internal fun MyFeatureScreen(
    viewState: MyFeatureViewModel.ViewState,
    dialogState: MyFeatureViewModel.DialogState?,
    snackbarHostState: SnackbarHostState,
    listener: MyFeatureInternalListener,
)

Without the skill, the AI consistently generates a single composable tightly coupled to the ViewModel, which makes previews and UI tests painful.

create-feature-module: Handles directory structure, build.gradle.kts with the correct plugins and base script, settings.gradle.kts registration, and app-level dependency wiring.

create-use-case and create-repository Enforce our Clean Architecture layers. Use cases always use operator fun invoke(), return Result, log with our Logger, and track analytics on success. Repositories use @Singleton, safeExecuteNotNull (our Apollo wrapper), and support FetchPolicy.

Migration Skills - "Modernize This Code"

We're in the middle of two long-running migrations, and skills let the AI do the mechanical work.

material3-migration: Contains an exhaustive mapping table of Material 2 to Material 3 component replacements (60+ components). It covers scaffold changes, LocalMinimumInteractiveComponentSize, theme references, and the subtle naming conventions in our design system (MediumScaffold becomes MediumScaffold3, imports shift from component to component3). Without this skill, the AI would have no way to know that MediumPullRefreshIndicator becomes MediumPullToRefreshBox that's not in any public documentation.

compose-viewmodel-migration: Guides migrating screens from the old pattern (Fragment creates ViewModel, passes streams to composable) to the new pattern (composable creates ViewModel via hiltViewModel with assisted injection). It covers the BundleInfo pattern, listener splitting, and the @VisibleForTesting overload.

Pattern Enforcement Skills - "Do It the Right Way"

Some patterns are subtle enough that even experienced engineers occasionally get them wrong. These skills exist to prevent specific categories of bugs and review comments.

viewmodel-flags-usage: Our most opinionated skill. Feature flags must be checked once at ViewModel initialization and saved as a private val. The result is passed through ViewState as a boolean. Never check flags in composables. Never recompute in a Flow.

// Check once at init, save - screen won't change during use
private val isAddressBookEnabled: Boolean = flags.isEnabled(Flag.ENABLE_ADDRESS_BOOK)

source-referrer-tracking: Defines the chain: a screen's source becomes the next screen's referrerSource. The skill explains SourceParameter serialization, the convention that source should be the last parameter in ViewState data classes, and the anti-pattern of accidentally passing referrerSource forward instead of source.

implement-analytics-event: Covers the full lifecycle: proto registration in Wire config, tracker interface in core, default implementation in app, Hilt binding, SourceNames constants, and the reportScreenViewed() pattern with deduplication.

Workflow Skills - "Handle This Repetitive Task"

add-deeplink: Our deeplink handler is a 1000+ line first-match-wins dispatcher. The skill explains the ordering rules (narrow before wide, fragment matches before path matches), provides five patterns (simple path, path + fragment + auth, regex-based dynamic segments), and specifies the test template including both logged-in and logged-out variants.

add-medium-uri: Three files must be updated in a specific order (interface, NoOp, default implementation) with naming conventions that vary by URL domain. Small task, but easy to get wrong without the skill.

check-and-add-translations: Finds missing translations across all modules by diffing values/ against values-X/, then adds them with the correct typography conventions (typographic apostrophe, never escaped).

write-unit-tests: Defines our testing conventions: backtick test names as human-readable sentences, Given/When/Then structure, MockK annotations, MainDispatcherRule for coroutine testing, Turbine for Flow assertions, Robolectric for Compose UI tests, and always wrapping screens in MediumTheme3.

What We Learned

Skills are living documents. We’ve iterated on most skills 3–5 times. The first version of create-compose-screen didn't mention the listener splitting pattern. The add-deeplink skill originally lacked the SUSI destination rule. Each time we caught a pattern break, we updated the skill.

Specificity beats generality. The skills that work best are hyper-specific to our codebase. material3-migration is essentially a lookup table. add-deeplink describes the exact ordering of our handler. These aren't portable to other projects — and that's the point.

Skills compound. A single feature request might trigger create-feature-module, then create-compose-screen, then create-use-case, then create-repository, then implement-analytics-event, then write-unit-tests. Each skill handles its slice correctly. The AI chains them based on what you ask for.

Consistency is the real win. With a small team, the risk isn't that code is bad it's that it's inconsistent. One engineer checks flags in a Flow, another checks at init. Skills eliminate that drift. Every new screen looks structurally identical, regardless of who (or what) wrote it.

Speed is the visible win. Setting up a new screen with ViewModel, listeners, composable, previews, test tags, and tests used to take most of a morning. Now it takes a prompt and a review pass.

Skills can also be written by an Agent, not just Developers

You don't have to write skills yourself. An Agent can write them for you.

The process works by having the Agent ask you the right questions, then observe your existing code to draft the skill. Here's how a typical session looks:

"I want to create a skill for adding a new analytics event. Can you walk me through how you usually do it?"

The Agent asks:

Which files are involved, and in what order?
Are there naming conventions to follow?
What's the checklist you mentally run before opening a PR?
Can you point me to a recent PR where you did this correctly?

That last question is key. PR examples are the fastest way to ground a skill. When you share a PR link (or paste the diff), the Agent can reverse-engineer the pattern: what changed, in which files, in what order, and what the code structure looks like. It then drafts the skill as a Markdown runbook, which you review and refine.

A good prompt to get started:

"Look at this PR: [link]. I want to write an Agent skill that teaches an AI to reproduce this pattern from scratch. Ask me any clarifying questions you need, then write the skill file."

The bar for a useful skill isn't perfection on the first try. It's one less PR review comment next week.

How to Start

If you're on an Android team (or any team with strong conventions), here's how we'd recommend starting:

Write your AGENTS.md first. Document your architecture, patterns, and conventions. This is the foundation.
Start with scaffolding skills. Pick your most boilerplate-heavy task (for us: new screens) and write a skill for it. Include the checklist, the file structure, and code patterns.
Add migration skills for active migrations. If you're migrating from Material 2 to Material 3, from RxJava to Coroutines, or from XML to Compose encode the mapping.
Encode your review feedback. Every time you leave the same PR comment twice, consider writing a skill for it.
Keep skills in your repo. Ours live in .agents/skills/ and are version-controlled. When patterns change, the skills change with them.

What's Next

Our AGENTS.md currently carries a lot of weight. It describes our architecture, patterns, conventions, testing strategy, and common commands all in one file. That worked as a starting point, but it has limits: everything is loaded into context all the time, even when only a fraction is relevant to the task at hand.

Our next step is breaking AGENTS.md into dedicated Agent rules scoped, file-aware instructions that activate only when relevant. For example:

A rule for Compose conventions that activates when editing *.kt files under ui/ packages
A rule for repository patterns that activates when working in data/ directories
A rule for testing conventions that activates when editing files under src/test/
A rule for ViewModel patterns (state management, SavedStateHandle, error handling) scoped to ViewModel files

This is the natural evolution: AGENTS.md gives the AI everything upfront, rules give it the right knowledge at the right time. Smaller context windows, more precise output.

Skills teach the AI how to write code. Commands go further — they automate workflows. One we're actively working on: a release diff command that compares the current release branch to the previous one, summarizes the changelog (new features, bug fixes, migrations), and creates a Linear ticket with the formatted release notes. Today that's a manual process: someone digs through git log, writes up the changes, copies them into Linear. A command could do it in seconds. We see potential for other commands too, generating weekly team reports from merged PRs, auditing a feature module's dependency graph, or preparing QA checklists from the diff.

Other explorations:

Skills for more complex workflows
Skills for functional testing
Skills for Compose screen testing with more sophisticated interaction patterns

The “bet” we're making is simple: the value of an AI coding assistant scales with how much of your team's knowledge you can encode into its context. Skills are how we're doing that.

Making AI Write Android Code Our Way: A Practical Guide to Agent Skills was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

We’re Hiring a Principal Backend Engineer to Shape the Future of Medium

Michael Margolis — Wed, 22 Oct 2025 17:19:00 GMT

When I joined Medium, it was because I still believe in the internet as a place for ideas. Not noise, not outrage, but content that connects people through shared curiosity.

Medium’s mission is to deepen understanding and spread ideas that matter. We’re building the best place for reading and writing online: A space that rewards clarity, authenticity, and craft over clickbait and engagement hacks.

That vision only works if the technology beneath it is as thoughtful as the writing above it. That’s where you come in. Yes, you!

Photo by Yannick Pulver on Unsplash

The Role

We are starting the search for a Principal Backend Engineer at Medium, our most senior IC level and a leadership position at the company. In this role, you will lead the evolution of Medium’s core backend, the systems that power reading, writing, and discovery for millions of users every day.

This is a deeply hands-on, high-impact role. You’ll help craft the strategy for where we want to be as an engineering organization on the backend, and be accountable for the quality, standards, and evolution of the Medium backend platform.

You will:

Partner with engineering, design, data, and executive leadership to define and deliver the future of Medium.
Make our platform not just reliable, but genuinely delightful to build on.
Modernize a large, living codebase so our teams can move faster and ship with confidence.
Lead through influence by setting standards, reviewing critical code and designs, and mentoring engineers across teams.
Shape how we build, ensuring our architecture, practices, and systems are strong, scalable, and aligned with Medium’s long-term goals.
Balance vision with execution by building, debugging, and refining systems yourself to model technical excellence and thoughtful engineering craft.

And you can do this work wherever you do your best thinking, whether that’s a home office, a beachside Airbnb, a rainy coastal cabin in the Pacific Northwest, or your favorite café with great Wi-Fi and even better croissants. This role is fully remote in the US.

Photo by Volodymyr Proskurovskyi on Unsplash

Why This Role Matters

I’ve been lucky enough to work with some truly great Principal Engineers in my career, the kind of people who bend the trajectory of a company. They don’t just write brilliant code, they change how everyone around them thinks about building. They spot patterns before they calcify into problems, simplify where others complicate, and bring calm, clear reasoning to complex systems and high-stakes decisions. When they get it right, everyone moves faster, builds better, and understands the system (and the mission) more deeply.

That’s what makes this role special. It’s not just about scale, it’s about stewardship.

About You

You find joy in making complex systems feel simple, in building things that help others move faster and think deeper. You leave codebases better than you found them, teams more confident than before you joined them, and products more human in how they behave.

You’ve built and scaled systems that millions rely on, and you’ve learned that reliability isn’t just uptime, it’s trust. You’ve wrestled with technical debt, fan-outs, and edge cases galore, and found ways to turn them into leverage, not legacy.

You lead with curiosity, not control. You ask great questions, create shared understanding, and help others see the system as clearly as you do.

And through it all, you never forget who it’s for: the writers and readers who come to Medium to make sense of the world.

Why Medium

To me, Medium has always been a rare and wonderful corner of the internet. A place where people come to slow down, think deeply, reflect, and share real human experiences.

In an era where misinformation, AI, and bots are reshaping public discourse, it’s more important than ever to elevate real human voices. That’s only possible if the technology underneath is fast, stable, and built for longevity so we can keep Medium human-centered, sustainable, and ready for what’s next.

If you’re a builder who loves the elegance of distributed systems and the beauty of human stories, we’d love to meet you.

Let’s build a better internet together. One that rewards ideas, not division.

👉 Apply here

Source: Medium

We’re Hiring a Principal Backend Engineer to Shape the Future of Medium was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Medium Android App — Migrating from Apollo Kotlin 3 to 4: Lessons Learned

Pierrick CAEN — Mon, 06 Oct 2025 08:18:42 GMT

Photo by Mario Verduzco on Unsplash

Medium Android App — Migrating from Apollo Kotlin 3 to 4: Lessons Learned

In this post, I’ll share my experience migrating the Medium Android app from Apollo Kotlin version 3 to version 4, including the challenges I encountered and how I solved them to improve our GraphQL implementation.

Understanding Our Apollo Cache Implementation

Before diving into the migration, it’s important to understand how we use Apollo’s cache in the Medium Android app. Our app relies heavily on Apollo’s normalized cache for several critical purposes:

Performance Optimization: We use FetchPolicy.CacheFirst as our default strategy, which means we always try to serve data from the cache first before making network requests. This significantly reduces loading times and provides a smooth user experience, especially when users navigate between screens that display similar content.

Real-time Updates: We use Apollo’s watch() functionality extensively to observe cache changes and automatically update our UI when data changes. This is particularly useful for features like:

Live clap counts on posts
Real-time follower updates
Post viewed updates
and more…

Starting the Migration

The initial plan was straightforward: update Apollo Kotlin from version 3 to 4. The IntelliJ plugin made this process seem simple at first glance.

Key Changes in Apollo Kotlin 4

Group id / plugin id / package name: Apollo Kotlin 4 uses a new identifier (com.apollographql.apollo) for its maven group id, Gradle plugin id, and package name. This change from com.apollographql.apollo3 allows running version 4 alongside version 3 if needed. Source: Apollo Kotlin Migration Guide - Group id / plugin id / package name
Exception handling: Apollo 4 has a new way of handling exceptions. Instead of throwing exceptions directly, they’re now passed through ApolloResponse. Source: Apollo Kotlin Migration Guide — Fetch errors do not throw
ApolloCompositeException: In Apollo Kotlin 3, when both cache and network operations failed with a CacheFirst policy, you’d get an ApolloCompositeException containing both errors. Apollo Kotlin 4 simplifies this by throwing only the primary exception while adding any secondary failures as suppressed exceptions, making error handling more straightforward. Source: Apollo Kotlin Migration Guide — ApolloCompositeException is not thrown

The Challenge: Cache Miss Exceptions

After making the initial changes, I encountered a major issue: CacheMissException errors were appearing throughout our UI wherever we used watch(). This was happening because Apollo 4 passes all exceptions to ApolloResponse instead of silently ignoring them when using FetchPolicy.CacheFirst.

This exposed an underlying issue: our cache configuration wasn’t optimal.

Fixing the Cache Implementation

Our app was using Programmatic cache IDs, but the implementation had issues:

Some IDs were missing from our CacheKeyGenerator
Some cache key generation logic was incorrect
These problems led to frequent cache misses

Solution: Declarative Cache IDs & __typename on all Operations

I decided to switch from Programmatic to Declarative cache IDs, which made it significantly easier to match IDs to types and fields. Here’s how I implemented it:

# Types
extend type Catalog @typePolicy(keyFields: "id")
extend type CatalogViewerEdge @typePolicy(keyFields: "id")
extend type Collection @typePolicy(keyFields: "id")
extend type CollectionViewerEdge @typePolicy(keyFields: "id")
extend type Post @typePolicy(keyFields: "id")
extend type PostViewerEdge @typePolicy(keyFields: "id")
extend type User @typePolicy(keyFields: "id")
extend type UserViewerEdge @typePolicy(keyFields: "id")
# etc

# Fields
extend type Query @fieldPolicy(forField: "collection", keyArgs: "id")
extend type Query @fieldPolicy(forField: "post", keyArgs: "id")
extend type Query @fieldPolicy(forField: "publication", keyArgs: "id")
extend type Query @fieldPolicy(forField: "user", keyArgs: "id")
# etc

To further improve cache hit rates, I added the __typename to all operations by configuring it in the Gradle build file:

apollo {
    service("service") {
        addTypename.set("always")
    }
}

Cache Update Extension Functions

To further improve our caching logic, I created extension functions for updating cache fragments that make both reading and writing more intuitive:

internal suspend inline fun  ApolloStore.updateCache(
    fragment: Fragment,
    cacheKey: CacheKey,
    customScalarAdapters: CustomScalarAdapters = CustomScalarAdapters.Empty,
    cacheHeaders: CacheHeaders = CacheHeaders.NONE,
    publish: Boolean = true,
    crossinline block: (cachedData: D) -> D,
): Set {
    val cachedFragment = getCachedFragment(
        fragment = fragment,
        cacheKey = cacheKey,
        customScalarAdapters = customScalarAdapters,
        cacheHeaders = cacheHeaders,
    ) ?: return emptySet()

    return writeFragment(
        fragment = fragment,
        cacheKey = cacheKey,
        fragmentData = block(cachedFragment),
        customScalarAdapters = customScalarAdapters,
        cacheHeaders = cacheHeaders,
        publish = publish,
    )
}

suspend fun  ApolloStore.getCachedFragment(
    fragment: Fragment,
    cacheKey: CacheKey,
    customScalarAdapters: CustomScalarAdapters = CustomScalarAdapters.Empty,
    cacheHeaders: CacheHeaders = CacheHeaders.NONE,
): D? = runCatching {
    readFragment(
        fragment = fragment,
        cacheKey = cacheKey,
        customScalarAdapters = customScalarAdapters,
        cacheHeaders = cacheHeaders,
    )
}
.onFailure { e ->
    when (e) {
        is CacheMissException -> Timber.e(e, "Cache miss on fragment $fragment with cache key $cacheKey.")
        is ApolloException -> Timber.e(e, "Cache read error on fragment $fragment with cache key $cacheKey.")
        else -> Timber.e(e, "Unexpected error while reading fragment $fragment with cache key $cacheKey.")
    }
}
.getOrNull()

These extension functions significantly improved the readability of our cache manipulation code and provided better error handling for cache operations.

Testing and Handling Cache Exceptions

To ensure our Apollo response handling was correct, I first wrote unit tests for our methods that transform ApolloResponse into Kotlin Result types.

@Test
fun `safeWatch with CacheMissException and FetchPolicy#CacheFirst should return success flow`() = runTest {
    // Given
    val requestUuid = Uuid.randomUUID()
    val cacheResponse = ApolloResponse.Builder(
        operation = IsFollowingCatalogQuery("ID"),
        requestUuid = requestUuid,
    )
        .exception(CacheMissException(CacheKey(Catalog.type.name, "CATALOG_ID").toString()))
        .build()
    val data = IsFollowingCatalogQuery.Data(
        catalogById = IsFollowingCatalogQuery.CatalogById(
            __typename = Catalog.type.name,
            catalogFollowData = CatalogFollowData(
		            __typename = Catalog.type.name,
		            id = "CATALOG_ID",
		            viewerEdge = CatalogFollowData.ViewerEdge(
		                __typename = CatalogViewerEdge.type.name,
		                id = "VIEWER_EDGE_ID",
		                isFollowing = true,
		            )
		        ),
        )
    )
    val networkResponse = ApolloResponse.Builder(
        operation = IsFollowingCatalogQuery("CATALOG_ID"),
        requestUuid = requestUuid,
    )
        .data(data)
        .build()
    every { mockApolloCall.watch() } returns flowOf(cacheResponse, networkResponse)

    // When
    mockApolloCall.safeWatch(FetchPolicy.CacheFirst) { it }.test {
        // Then
        val result: Result = awaitItem()
        assertTrue(result.isSuccess)
        assertEquals(expected = data, actual = result.getOrNull())
        assertNull(result.exceptionOrNull())
        awaitComplete()
        ensureAllEventsConsumed()
    }
}

Then I fixed the exception propagation in our watchers with this approach:

We added fetchPolicy and refetchPolicy parameters to our watchers with default values. These defaults match those used by the Apollo Kotlin SDK.
We transform the ApolloResponse into a Result, enabling us to handle either Success or Failure cases.
If the fetchPolicy is CacheFirst or CacheAndNetwork, we are skipping the CacheMissException, as Network will emit after either a Success or an ApolloNetworkException.
If the fetchPolicy is NetworkFirst, we are skipping the ApolloNetworkException, as Cache will emit after either a Success or an CacheMissException.

inline fun  ApolloCall.safeWatch(
    fetchPolicy: FetchPolicy = FetchPolicy.CacheFirst,
    refetchPolicy: FetchPolicy = FetchPolicy.CacheOnly,
    crossinline transform: (D) -> R,
): Flow> = this
    .fetchPolicy(fetchPolicy)
    .refetchPolicy(refetchPolicy)
    .watch()
    .mapNotNull { response ->
        val result = response.toResult(transform)
        val exception = result.exceptionOrNull()

        when {
            exception is CacheMissException && fetchPolicy == FetchPolicy.CacheFirst ->
                null
            exception is CacheMissException && fetchPolicy == FetchPolicy.CacheAndNetwork ->               
                null
            exception is ApolloNetworkException && fetchPolicy == FetchPolicy.NetworkFirst ->
                null
            else ->
                result
        }
    }

The key insight here is that different fetch policies have different fallback strategies, and our exception handling needs to respect these strategies. By filtering out expected exceptions that will be followed by either success or a different type of exception, we ensure that our UI only receives meaningful errors that require user attention.

Completing the Migration

After resolving these cache-related issues, I was finally able to complete the migration to Apollo 4:

Replaced executeV3() with execute()
Updated watch() calls to remove fetchThrows = true
Fixed ApolloCompositeException handling

Additional Improvements: Custom Type Adapters

While diving deep into the Apollo documentation, I also discovered we could use type adapters for scalar values. I implemented this for our Currency scalar:

import java.util.Currency
object CurrencyAdapter : Adapter {
    override fun fromJson(reader: JsonReader, customScalarAdapters: CustomScalarAdapters): Currency =
        Currency.getInstance(reader.nextString())

    override fun toJson(writer: JsonWriter, customScalarAdapters: CustomScalarAdapters, value: Currency) {
        writer.value(value.currencyCode)
    }
}

Future Improvements

HTTP Batching

HTTP batching allows multiple GraphQL operations to be sent in a single HTTP request, reducing network overhead. This is particularly useful for applications that execute multiple queries simultaneously, as it can significantly improve performance by reducing the number of network requests. We are currently using HTTP batching on our Web platform without encountering any issues.

Persisted Queries

Persisted Queries improve network performance by sending a query hash instead of the full query text. This reduces payload size and can improve security. The server maintains a mapping of hashes to query strings, allowing it to execute the appropriate query when it receives a hash. Note that implementing Persisted Queries requires backend support.

Conclusion

What began as a simple version upgrade became a comprehensive overhaul of our GraphQL implementation. By switching to Declarative cache IDs, adding __typename to all operations, and properly handling cache exceptions, we've significantly improved the cache hit of our Apollo GraphQL integration.

The key takeaway: when upgrading Apollo Kotlin, be prepared to revisit your caching strategy. The improvements in version 4 expose issues that might have been hidden in version 3, but fixing them leads to a more robust implementation.

Medium Android App — Migrating from Apollo Kotlin 3 to 4: Lessons Learned was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Engineering stories behind the Medium Daily Digest Algorithm: Part 2

Raphael Montaud — Thu, 28 Aug 2025 11:31:36 GMT

How we made our filtering 10x cheaper by removing our Bloom Filters

Bloom Filters are great tools to make fast and cheap filtering. They also come with plenty of problems and can easily get expensive and cumbersome. We switched to user-based direct database queries, which made our filtering cheaper and easy to maintain. Here’s the full breakdown of that migration.

Intro: This is a 4-part series breaking down improvements to the algorithm behind the Medium’s Daily Digest over the past year. When we started this work, the Digest was suboptimal — and since it’s a huge distribution surface, reaching millions of readers every day, we started working on incremental improvements.

By the end of these projects, the digest was 10% more likely to convert users to paying members, less expensive to run, more flexible and easier to maintain and it’s now providing higher quality recommendations for all our users, including our “power readers”.

This is told through the lens of our engineering team tackling a series of challenges one by one. Medium has a small team but we operate on a big scale. We’re working our way through some technical debt and at the same time, striving to provide the best experience for our readers. This is the source of many interesting challenges.

I hope this series helps you understand how the recommendations algorithm work and can help others who are facing similar technical challenges.

This is probably the most technical story in the series, but I will keep it as simple as possible and hopefully this is interesting for non-technical readers too.

Some Concepts

Here’s a little cheat sheet with some concepts you may need to follow along with this story

You may need this to understand the rest of this post

Bloom Filters at Medium

A lot of the filters I mention in this series are backed by Bloom Filters (I’ve described some of those filtering rules in Part 1 if you haven’t read it already). We use Bloom filters to remove stories we think won’t interest readers from their feeds and other recommendations:

For example:

our “muted” filter removes all stories from writers that you have muted
our “read” filter removes all stories that you have already read
our “presentation” filter removes all stories that have already been presented to you 3 times or more

These all rely on Bloom Filters

So what’s a Bloom Filter?

I asked Claude to summarize Bloom filters in a really simple way and it went with a funny analogy I’m going to try here.

A Bloom filter is like a super-efficient bouncer at a club who has a really good memory but isn’t perfect. It lets you do two things:

let someone into the club
→ in code that would be an add(string) function
lets you ask if someone is in the club. There are two possible answers to this:
→ “yes, probably”
→ “no, definitely not”
→ in code that would be a check(string) --> bool function

That doesn’t sound super useful like that but we’ll see next that it’s actually kinda well suited for recommendation systems.

At Medium we’re using it to store information such as “user a read story x” or “user a muted user b”. We add those to the “club” as strings, like read|user_x|story_y . Later on, when we want to know if user x has already read story y, we just ask our “bouncer”: is read|user_x|story_y in the club?

How we use Bloom Filters to filter out muted writers and already read stories from user feeds

The scale of the filtering

What’s nice with Bloom filters is that they are able to store the information very efficiently and they are able to handle big amounts of requests per second. We don’t really need to know more about the inner workings of Bloom filters for this series but you can read more about it here (and it has some Excalidraw schemas too 🤌).

When we’re building a feed for a user (the digest for example), we’re sometimes sourcing up to 5000 stories as the initial “shortlist” of stories. This shortlist goes through many different steps and when we’re done, there’s only a handful of stories, ready to be sent in your digest 🙌

Ideally we should filter out stories as early as possible in the process. If we take the filters I’ve listed above, we’re asking ourselves 3 questions for every story in the short list:

did the user read this story already?
is the writer of the story muted by the user?
was the user presented this story more than 3 times already?

So that can add up to 15k questions in total for a single user feed, and we process thousands of feeds per second. So that’s more than 15M questions per second. You can see how that can get out of hand and become expensive very quickly, we’re going to need some solid infrastructure to handle this.

Fortunately our Bloom filter “bouncer” is able to answer an insane amount of questions per second for pretty cheap, which is exactly what we’re looking for.

Bloom Filter downsides

The biggest downside for us is that even though the Bloom filters are super memory efficient, we overused them so much by adding literally billions of items that they started getting really really big and expensive. Like I mentioned before, there are only two operations you can do on a Bloom filter: add or check. There’s no way to delete any data from the filter (remove people from the club, if we stick with the bouncer analogy). And so the club can only grow in size, there’s no way to perform routines cleanups for information that we don’t need anymore.

The only way to reduce the size of the club is to start a new club with a new bouncer and then retire the old club and bouncer.

There are also a few more downsides:

you can’t list the items that are stored (you can’t get a list of people who are in the club…). So we can’t go and see what’s stored for a given user, this makes it really hard to debug issues with the filters
you can’t remove anyone from the club, which means that you can’t change your mind. For example if a user “unmutes” a writer, there’s no way to reflect that in the Bloom filter. That writer will be muted forever for that user. Yes, that’s very janky.
this is derived data. For example the user “mutes” are stored in a proper database, and anytime there is a mute action we need to forward that information to the bloom filter. This is inconvenient and introduces data drift issue and more complexity overall.

Bloom filters are not databases. You can’t see what’s inside and you can’t remove items

All in all, there were too many downsides so we decided to explore a different approach.

Replacing the “Muted Filter”

When we build the feed for a given user, it happens in real time. It can be when you load the Medium homepage, or when we trigger the daily digest generation. We’re typically in the following situation:

we have a shortlist of 1000 stories
we want to filter out the stories from writers that are muted by the user

There are two ways to go about this:

A. perform a lookup in the database table that keeps track of muted writers for the 1000 (userID, writerID) pairs and check the mutedAt attribute

Approach A. For each pair we do a database lookup

This is extremely expensive which is why we have to introduce a Bloom filter, where lookups are faster and cheaper

Approach A. but with a Bloom filter

B. or fetch ALL of the muted writers for the current user and then cross-reference that with your shortlist of stories

Approach B. fetch all muted writers and then cross-reference with the shortlist

Approach B has a massive advantage: most users do not mute anyone. So on average, we’re reading a very small amount of data from the database. If we were to do that with dynamoDB, B is a Query that retrieves on average less than 1 item from the DB for each feed. This is very cheap and fast. For “power muters” — users who mute massive amounts of writers, in the thousands — we can still handle this in real time although with higher latencies.

So this approach immediately obliterates the need for a Bloom filter. We have a fast, cheap, reliable way to filter out muted writers. We can also use our “ground truth” database directly, no need for derived data with all the headaches this involves.

So that’s one less Bloom filter! Let’s move on to the next one.

Replacing the “Presentation Filter”

The question becomes a bit trickier when we look at the “presentation” filter. We’re typically in a situation where:

we have a shortlist of 1000 stories
we need to remove all the stories that were presented to the user in a feed (in the past) 3 times or more

We keep track of how many times posts were presented to a user in a feed. When we build a new feed we make sure to exclude stories that were already presented several times

We’re currently doing that filtering using a Bloom filter. But you might wonder how we can even do that with Bloom filters. Remember there’s only two operations you can do with a bloom filter:

add someone to the club
ask if someone is in the club

Bloom filters were not built to maintain counters, only true / false information. So we have to hack our way around it by layering them. In terms of clubs and bouncers, it’s like we have a big festival with a bouncer. Inside the festival there’s a private club with another bouncer. And inside that private club there’s a VIP zone with another bouncer… Ultimately we only want to know if someone is inside the VIP zone, so we only need to ask the VIP bouncer. But we still need the two other bouncers to keep track of who’s eligible to get in the VIP zone…

Going back to Bloom filters: we use 3 layers of Bloom filters, on top of each other, with each one encoding the information “post was presented to user x times”

How we maintain counters with bloom filters. An event listener updates the bloom filters to maintain the counter when a post presentation happens. That’s just to count to 3. Imagine if you need to count to 10…

From the recommendations algorithm’s perspective it’s fairly simple, we just lookup the (user, post) pairs in the Bloom filter that encodes the “3” value of the counter (we just ask the VIP bouncer)

NB: that’s the implementation in place at Medium. I don’t know what went into consideration when building it this way. But FIY there are other (probably better) ways to build counters with Bloom Filters. It’s possible to use bit-arrays for example.

So that’s for the Bloom filter implementation. Now how can we handle that differently?

The user-based approach involves fetching all of the “presentation history” of a user (ie all of the posts that were presented to the user in a feed before) and then going over that list to compute presentation counts for each story.

That’s a little bit more complex than the “muted filter”, because:

a lot of users have really massive presentation histories, in the tens of thousands or even bigger. Those are readers who come to Medium every day and are exposed to many recommendations
on average, users have a lot of posts in their presentation histories. Typically less than 100 though. That’s because this includes many less engaged users who only came a handful of times on Medium, and they bring the average down.

So the average cost is going to be higher than for the muted filter because we’re going to retrieve more data from the database AND some users have such big presentation histories that it’s not possible to fetch it in real time. So how can we tackle that?

Here, we’re saved by the fact that this filtering rule is not a hard requirement. Nothing says that “3 presentations is the absolute max” for a given (user, post). We built this rule to make feeds more diverse and less repetitive. We chose 3 as a reasonable default, but it’s okay if in some cases we reach higher counts.

So we can go with a “best effort” strategy here. We fetch the most recent 5k presentations for the user and simply act as if everything before that never existed. DynamoDB queries let us fetch thousands of items quickly, in the ~100ms. So that simple solution is very acceptable in terms of functionality and latencies. There are scenarios where the filter will not be doing its job properly, but they’re limited to certain edge cases with minimal impact on the user.

We tested this approach and found that the costs were reasonable. This solution gives us more flexibility, for instance we can now easily control and play with the maximum number of presentations (is 3 the right threshold? we’ll see that in Part 3). Down the line that will lead to a better experience for users. Now that this filter has been replaced, let’s move on to the final filter, which was the most challenging to migrate.

Replacing the “Read Filter”

Here’s what we’re trying to do:

we have a shortlist of 1000 stories
we need to remove all the stories that the user has already read

Very similar to the “Presentation Filter”, the user-based approach for this filter involves fetching the entire reading history of the user. We’re in the same situation where most of the time the user’s reading history is relatively small, with less than 100 posts on average. But we do have some users with massive reading histories.

Fun fact, one of or top readers is our very own Harris Sockel who was in charge of the Medium Newsletter, and had 22k stories in his reading history just for 2024!

This time, filtering already read posts is a hard requirement of the recommendations algorithm, no way around it. We don’t want to send you an email or a push notification about a post you’ve already read.

So what do we do? For this filter specifically we decided to do the filtering in two stages. Very early in the feed building process, we fetch the most recent posts in the user’s reading history (the last 5k posts you’ve read). For the vast majority of users, this will capture their entire reading history. And we use that to filter out already read posts from the feed (at this point, we can have up to 5k posts in the feed). But that’s not enough, we need to guarantee that there are no read posts in the final results for all users.

So to explain this further, we’ll need to go into our recommendation algorithm in a little bit more detail:

A more detailed view of the recommendations algorithm

You can see that there are two different filtering steps. What we did is that we split out our “Read Filter” into two different implementations and we added each one at a different step:

We’re doing user-based queries in the early filtering step. This is nice because costs do not depend on the number of posts in the shortlist. Once we’re down to only a few posts, we can perform pairwise (user, post) lookups.

This solution works really well and doesn’t add too much complexity to the recs logic. This did require some gymnastics to maintain the requirements while keeping costs, recs performance and latencies under control. But this is much easier to control and debug than the previous implementation with Bloom Filters. It also allows to support things like clearing the reading history for a given user (you can do that from this page to get a “recs fresh start” on Medium). This is a functionality that didn’t work well with the Bloom Filter implementation.

Conclusion

All in all we were able to get rid of our Bloom filters entirely, resulting in big cost savings. The new implementation is ten times cheaper than what our Bloom filters were costing us.

This isn’t necessarily a fair comparison since our Bloom filters had grown out of proportion. If we had migrated from scratch to a new Bloom instances, the Bloom filter implementation might have been cheaper. It was hard to evaluate it in our situation because of the way we built our Bloom filters. We used a single instance and string prefixes to manage all of the different filters.

If we go back to our club and bouncer analogy, you can see it that way: let’s say there’s different crowds that go to the club, techno lovers, jazz enthusiasts, disco heads…

A. you can build a single club that hosts everyone. A single bouncer has the responsibility to memorize all the people in the club

B. or you can build one club for each crowd, each one with a different bouncer

If one day you don’t care about jazz enthusiasts anymore, with B. you can just retire the jazz club and it’s bouncer. But with A. you still need the bouncer for the other crowds, so you can’t retire the bouncer and it’s still holding the information about all the jazz enthusiasts that entered the club. Down the line you might be paying for a club that’s bigger than you need with approach A. With approach B. you pay exactly for what you need!

With Bloom filters it looks like this:

Decoupling filters into separate instances is probably much easier to maintain. When you stop using a filter you can just delete the instance. Our pattern of a single monolith Bloom instance forced us to keep paying until we had migrated all of the filters to newer instances (or to another implementation).

We were working with a single gigantic Bloom Filter, this meant that we weren’t able to retire anything and our Bloom Instance was ever-growing.

The bottom line is that this project helped us get our costs down and migrate to newer, more flexible implementations that are much easier to maintain. With our newer implementations:

costs are easy to understand and are under control
filtering costs don’t depend on the initial size of the feed “shortlist”
there’s no need to maintain derived data

This also laid the groundwork for experiments on our filtering rules that we’ll see in our Part 3: Hard vs Soft Filtering and how this applies to Medium’s Recommendation System.

Thank you for reading this serires, you can stay tuned for the next installments of this series by following the Medium Eng Blog.

Engineering stories behind the Medium Daily Digest Algorithm: Part 2 was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Engineering stories behind the Medium Daily Digest Algorithm: Part 1

Raphael Montaud — Tue, 26 Aug 2025 11:31:37 GMT

How we made our email story recommendations better

In this Part 1, you’ll understand how we improved one of the main ways our users are exposed to our product and how that led to a massive 7% increase on the average reading time for the digest users.

Intro: This is a 4-part series breaking down improvements to the algorithm behind the Medium’s Daily Digest over the past year. When we started this work, the Digest was suboptimal — and since it’s a huge distribution surface, reaching millions of readers every day, we started working on incremental improvements.

By the end of these projects, the digest was 10% more likely to convert users to paying members, less expensive to run, more flexible and easier to maintain and it’s now providing higher quality recommendations for all our users, including our “power readers”.

This is told through the lens of our engineering team tackling a series of challenges one by one. Medium has a small team but we operate on a big scale. We’re working our way through some technical debt and at the same time, striving to provide the best experience for our readers. This is the source of many interesting challenges.

I hope this series helps you understand how the recommendations algorithm work and can help others who are facing similar technical challenges.

Some Concepts

Here’s a little cheat sheet with some concepts you may need to follow along with this story

You may need this to understand the rest of this post

The Discovery

A little while back, Leigh, our Machine Learning Engineer and model training guru, started noticing something weird. The recommended stories in his digest were consistently not great matches for his reading profile. At the same time, the recommended stories in his homepage feed (what we call the “For You” feed) were consistently very well targeted.

This was a bit puzzling and unexpected. Those two recommendation surfaces rely on exactly the same algorithm. We source stories the same way, and we rank them using the same model and features. The only big difference lies in the filtering step.

The Investigation

So at this point, we thought maybe this is a bug or maybe there’s something in the filtering step that’s amok specifically for the digest. We started digging into the different filters we use for the digest vs the ones we use for the homepage feed. We have many different filtering rules. For example the “Filter Read” simply filters out all the stories that the user has already read. This one is applied on most recommendation surfaces. But some other filters are specific to certain recommendation surfaces:

Filtering rules differ between the Homepage Feed and the Digest

One of the reasons why those two surfaces use different filtering is because the reporting does not work the same:

on the homepage we can track exactly what posts were presented to the user
with the digest email we have access to less information:
→ we know if the email was opened (thanks to a tracking pixel, explained below)
→ we know what posts we sent in a given digest
→ we know which posts were clicked in a given digest

Looking at this, we started realizing that the filtering applied to the digest is probably too aggressive. Every day, the algorithm sources the best 15 stories on the entire platform for the reader. But as soon as we detect an email was opened then we will never send the posts it contains again to that reader. That can be a bummer in some situations, for example:

the user opens the email but doesn’t scroll past the top 5 stories
all the stories in that digest will never be sent again, although 10 of them didn’t even get a chance to be presented to the reader

Illustration to show how we quickly burn through the best stories for a given user if we detect that they open their daily digest emails. After a few weeks the recommendations are less and less relevant to the user as there are fewer and fewer eligible stories for their digest. This is a simplification, the recommendations are not stale and there’s new supply coming in every day, but not necessarily enough to compensate what we’re burning through

Having worked on email open tracking before, we also knew that there were other elements at play here and that are potentially making things even worse.

Apple Mail Privacy Protection

In 2021, Apple started Mail Privacy Protection. The idea was to better protect user data and to prevent email open tracking.

When we send the digest, we also send a pixel tracker in the email. That’s a tiny image URL that we send as part of the mail content and when the email is loaded, the email client has to fetch this image. The URL is engineered in a way that when it’s loaded, we know exactly which email loaded it and so we can record the email as “opened”.

Tracking pixels are used by email senders to know if a user has opened an email

Apple’s Mail Privacy Protection has completely destroyed this concept. They are now preloading all your emails directly from their servers, and your email client communicates only with the Apple servers. From our point of view, it means that pretty much all the emails from Apple users are marked as “opened”.

So a lot of our “email open events” are actually “remote servers opening your email for you so that no one can tell if you actually opened it or not” events. We’re still not sure exactly about the exact figures but several estimates suggest that 50 to 80% of our “email open events” are fake.

That means a lot of our email open events are “fake news”. And those 15 stories that we carefully selected for you will never be eligible for your digest again — even if a lot of user didn’t actually open the email.

So it does look like we’re filtering way too aggressively. At this point we were confident that this filtering rule was the culprit and that it’s responsible for degrading the quality of the recommendations in the digest. Now we just need to come up with a plan to fix it.

The Plan

We went over the filtering rules in place and decided we’d modify a few things:

We removed the “Sent in Opened Digest Filter” and scaled down the backoff logic from 7 to 4 days

This is now a much less aggressive setup. It means that there will be more repetition in digests, the hope is that the best stories are able to make their way to the user and that they don’t get “disqualified forever” too early. Several things are in place to ensure that the digests are not too repetitive and evolve over time:

We expect those filtering rules and organic evolution of recs to bring diversification between the daily digests

A digest contains 15 stories. The first one is used as the email subject, which is why it’s not eligible for the digest anymore

We were still a little bit wary about potential repetitiveness of digests. We wondered if users were going to notice and complain about that, so we gave a heads-up to the support team to be on the lookout for such complaints

We put that to the trial in an A/B test and got incredible results very quickly.

Users in the experiment:

were 10% more likely to convert to paying members
read 7% more on Medium

A massive win for Medium!

Readers didn’t seem to notice any repetitiveness in the digests, so we decided to ship without thinking too much about it.

We were so happy about the results that we immediately went on the hunt for more filtering rules that we could tweak.

We first had to dig into the implementation of our filters (CF Part 2: How we made our filtering 10x cheaper by removing our Bloom Filters), before we could start making more changes (CF Part 3: “Hard vs Soft Filtering” and how this applies to Medium’s Recommendation System and Part 4: Cross-Digest diversification)

Some Final Thoughts

So it looks like the Apple Mail Privacy Program has been affecting our recommendation systems since 2021. Not all Apple Mail users were in the program initially and the number of users in the program has been scaling up over the years as well as the blast radius on Digest users. My recommendation to you, if you have any features that rely on email open tracking in your product, is to immediately audit the potential effects of those “fake email opens”.

To be clear we are absolutely in favor of user privacy measures. The Apple privacy program has forced us to re-evaluate the way we build our recommendations in a less intrusive way, which is a good thing.

We recently made the daily digest available in the Medium App, this was a much requested feature by our users. On top of that this is an opportunity to get better tracking than what we do in the email clients. Down the line this could help us improve the recommendations we put in those digests.

Thank for reading this series, you can stay tuned for the next installments of this series by following the Medium Eng Blog.

Engineering stories behind the Medium Daily Digest Algorithm: Part 1 was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Engineering stories behind the Medium Daily Digest Algorithm: Part 4

Raphael Montaud — Mon, 25 Aug 2025 18:31:30 GMT

Cross-Digest diversification

In this part 4, we’ll see how we went from investigating a few complaints from digest power users to improving our digest recommendations across the board.

Intro: This is a 4-part series breaking down improvements to the algorithm behind the Medium’s Daily Digest over the past year. When we started this work, the Digest was suboptimal — and since it’s a huge distribution surface, reaching millions of readers every day, we started working on incremental improvements.

By the end of these projects, the digest was 10% more likely to convert users to paying members, less expensive to run, more flexible and easier to maintain and it’s now providing higher quality recommendations for all our users, including our “power readers”.

This is told through the lens of our engineering team tackling a series of challenges one by one. Medium has a small team but we operate on a big scale. We’re working our way through some technical debt and at the same time, striving to provide the best experience for our readers. This is the source of many interesting challenges.

I hope this series helps you understand how the recommendations algorithm work and can help others who are facing similar technical challenges.

Some Concepts

Here’s a little cheat sheet with some concepts you may need to follow along this story

You may need this to understand the rest of this post

User Complaints

After we shipped all the changes mentioned in the previous installments of this series, we started seeing some support tickets coming in related to the digest:

User complaints that started off our investigation

I think we should appreciate the level of thoughtfulness our users put in those support tickets. We review those carefully and we take pride in reading and answering every support tickets.

Those were forwarded to the recommendation team and we immediately thought that maybe we oversteered to much when we removed some of the filtering rules for the digest.

User digests were too repetitive and the issue was particularly noticeable for our digest power users. We started investigating the issue in search for a quick solution.

The Problem

As we’ve seen in Part 1: How we made our email story recommendations better there’s a few things that are in place to make the digest less repetitive:

A digest contains 15 stories. The first one is used as the email subject, which is why it’s not eligible for the digest anymore

Filtering rules and organic evolution of recs is what makes every digest different

We dug into the data and found that for the vast majority of users the issue was minimal, barely noticeable. For the vast majority of users, digests are composed mostly of stories that have never been sent in a prior digest.

But for the users who were complaining, the issue was very visible. Some posts got sent up to 8 times to the same user! On average, for those users, posts in a given digest had been sent twice in a prior digest. So there was a massive issue with those users digests, seriously damaging their experience.

The fix

Fixing this is a little bit tricky. As we’ve seen in Part 1: How we made our email story recommendations better there’s no way to tell if a user has truly opened their daily digest email because of the Apple Mail Privacy Protection.

If we assume that all emails are “opened” and do some filtering based on that, we will undo all the gains we got in Part 1. That’s some big wins that we’re not ready to say goodbye to.

So how do we get out of this situation? There is one thing that we are able to track and that we haven’t leveraged yet. When a reader clicks on a story in their email digest, they navigate to Medium and we report the ID of the email they came from. We can use that to our advantage. Whenever a user clicks a story in a digest, we can be sure that they actually opened the email.

We decided to put some logic in place to keep track of which digest emails have at least one post that was clicked.

Links to posts in emails are engineered so that we can keep track of the emailID when the user navigated to Medium

So now we have a way to list some digests that the user has “definitely opened” with the posts that they contained.

With that new information we can set up a new filter in our recommendation stack. When building the feed, we fetch all of those “definitely opened” digests for this user and we count total occurrences of each post.

We can then filter out posts above a certain threshold. That rule would translate to something like:

if we have sent this story in more than x digests that the user has “definitely opened”, then we will never send it to that user again

When building the feed, we can look at the previously sent digests to create a filtering rule based on the hasPostRead attribute

In order to find the right threshold, we ran some tests to measure the impact on the user experience. For different user samples, we computed how many posts would be filtered out from their digest with this new rule, depending on the value of the threshold:

Number of posts filtered from the digest depending on the threshold

Here’s the takeaways from those results:

free users digests are barely affected by this new filtering rule. Whatever the threshold, less than 1 story would be filtered out from their digests (a digest is 15 stories) if we started applying that filter today
members would be moderately affected with a threshold of 1. With a higher threshold they would barely be affected
complaining users would be greatly affected by this change with a threshold of 1, with almost 5 stories filtered out from each of their digests. It seems like that would introduce enough diversification to fix their issues with repetitive digests

So that test helped us decide that a threshold of 1 is what we’re looking for and it gave us confidence that this a solution worth exploring:

it seems to fix things for users who had an issue
it has a marginal impact on the majority of users

Also, we’re leveraging new, reliable data in our recs algorithm so there’s chances that this will improve recommendations overall.

We decided to put it to the trial in an A/B test

Results

Results for this A/B test were really good, as we observed a statistically significant increase in reading time for the users in the experiment. So it seems like this had a positive impact on the experience of all our users overall, not just the small percentage of users that were the most impacted. Members had a bigger increase in reading time which makes sense as they were more impacted according to our tests from the previous section.

We didn’t see any complaints on digest diversification since we shipped this and we’re pretty happy about the new state of things. We love that our users are so open and ready to share their feedback with us — this helped us build a better product for all Medium users.

This is a new “hard filter” which is not necessarily the best approach here. It’s likely that turning this into a soft filter (CF Part 3: Hard vs Soft Filtering and how this applies to Medium’s Recommendation System) would yield better results.

Another possible element to explore is the position of stories in the digest. It’s important because the top stories are the first to be displayed, and it’s possible that a user didn’t scroll the entire email and so didn’t get a chance to see the posts at the bottom of the digest. We could imagine some new rules leveraging the position in the email to make this filter more fine-grained.

Thank for you for following this series till the end, I hope this helped you understand a few concepts.

Overall these projects helped our user read more and made them more likely to convert to paying members. Our algorithm is also less expensive to run, more flexible and easier to maintain and it’s now providing higher quality recommendations for all our users, including our “power readers”.

I hope this helped understand a few concepts as well as our recommendation systems and how we try to improve it step by step. Hopefully that gave you some ideas to improve you own systems too!

Engineering stories behind the Medium Daily Digest Algorithm: Part 4 was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Engineering stories behind the Medium Daily Digest Algorithm: Part 3

Raphael Montaud — Mon, 25 Aug 2025 18:31:24 GMT

Hard vs Soft Filtering and how this applies to Medium’s Recommendation System

In this part 3 we’ll see how we modified one of our hard filtering rules and attempted to turn it into a machine learning based “soft filter”.

Intro: This is a 4-part series breaking down improvements to the algorithm behind the Medium’s Daily Digest over the past year. When we started this work, the Digest was suboptimal — and since it’s a huge distribution surface, reaching millions of readers every day, we started working on incremental improvements.

By the end of these projects, the digest was 10% more likely to convert users to paying members, less expensive to run, more flexible and easier to maintain and it’s now providing higher quality recommendations for all our users, including our “power readers”.

This is told through the lens of our engineering team tackling a series of challenges one by one. Medium has a small team but we operate on a big scale. We’re working our way through some technical debt and at the same time, striving to provide the best experience for our readers. This is the source of many interesting challenges.

I hope this series helps you understand how the recommendations algorithm work and can help others who are facing similar technical challenges.

Some Concepts

Here’s a little cheat sheet with some concepts you may need to follow along this story

You may need this to understand the rest of this post

Conference time

Back in 2024, Leigh and I were in Bari, Italy for the annual RecSys conference, an international conference around Recommender Systems. In between some panzerottis and capuccinos in the old town patios we managed to go to a talk or two.

surprisingly small cups of coffee. And surprisingly small amounts of water

Nala (my dog) attending the conference

Nala realizing she’s been given her own plate of orecchiette alla bolognese

It had been a while since we went to a conference and it was a great way to get our heads out of the day-to-day grind and focus on some big picture stuff.

There were many interesting talks but one that stood out for us was Chris Johnson’s. He gave some insights on the recommendations algorithm at Indeed and mentioned something about hard vs soft filtering that resonated with us.

Hard Filtering

In recommendations systems, you want to make sure you filter out some items from a user’s recommendation for many different reasons.

For example:

a Youtube video that you have already watched
a Medium story from a writer that you have muted or blocked
a Job posting that doesn’t match your requirements

Those filtering rules are hard filters. It’s an “all or nothing” scenario. If the (user, item) pair passes a certain condition, then it will be filtered out.

A failure in one of those filters would be considered a bug and users would probably report it.

But there are some filtering rules that are not associated to a “feature”. They are just rules that are in place because we think they make the recommendations better.

For example at Medium:

any story that we’ve presented to you 3 times or more in a feed is not eligible to be recommended to you anymore (we call that our “Presentation Filter”)
in certain feeds like the “Trending” feed, stories past a certain “age” are completely removed (we call that our “Old Filter”)

Chris Johnson’s point (at least my understanding of it) is that, as much as possible, the hard filters should be associated with a “feature” (ie a user expectation or a product specification). The other filters should be transformed into soft filters.

So what’s a soft filter?

Instead of a yes or no rule, a soft filter applies a rule in a more continuous way.

To illustrate that, let’s say we want to do a “trending” feed showing all the stories that are trending on Medium. We want this feed to show recently published stories that are popular. How can we create that fresh and trending experience for our users?

To build that feed we need two things:

pick a pool of stories that are eligible to show up in the feed
find a way to rank them so that we can select the “top 10” that will be displayed to the user

A. Trending feed with a hard age filter:

One way to go about it is to say:

we want only recent stories so we’ll select only stories published in the last 7 days
→ NB: this is a hard filter: the story is either recent enough or too old
we want the most popular stories so we’ll rank them by the total number of claps they have received, and select the top 10

Now you have 10 recent, popular stories on Medium which will create a “trending feed” experience for the user.

This works nicely but it has a few blind spots:

we will prefer a story with 700 claps that’s 6 days old from a story that’s 2 days old but has 600 claps — which is not great because the one with 600 claps is clearly more promising and arguably more “trending”
a story that has 2k claps and that’s 8 days old will be excluded by our hard filter, although we could argue this one is very “trending”.

So with this hard filter rule there’s a big bias when we build our feed. It’s not fair to stories that were published very recently, and it’s not fair for stories that are just above the threshold.

B. Trending feed with a soft age filter:

To counter those blind spots we can implement a soft filter on the age of posts. The idea is to include the age of posts in our ranking formula in a continuous way.

Previously we were ranking just based on score = number_of_claps. Let’s see what happens if we decide to rank stories based on score = number_of_claps - 100 * age_of_story_in_days instead.

For the same number of claps, a story’s score will go down as it ages. It will therefore start ranking lower and lower compared to other “newer” stories. Eventually, even a recently published story with zero claps will outperform it. That makes sense, users come to see the “trending” feed to see recent stories.

In the ranking A, it’s a yes or no function: it’s either young enough or too old: that’s what we call a hard filter. In the ranking B it’s slowly going down in the ranks as it ages, that’s what we call a soft filter.

In practice we don’t always have to design formulas like that. We can use a machine learning algorithm to predict a score for each (user, post) pair and have this score act as a soft filter. Let’s see how this works in the next section.

Machine Learning model as a soft filter

Let’s dig in a little bit on machine learning models and how we can use them as soft filters.

Our Machine Learning ranking algorithms can be summarized like so:

they have access to information on the user and the post
their goal is to predict the likelihood of certain events (eg: likelihood that you will click on the story preview, likelihood that you will clap the story, etc)
they rely on historical data to train and adjust their predictions

We trained our model so that it can be fed user and post information and send back some event likelihoods. We associate a weight to each event. The weight is the translation of “how bad we want this to happen”. A click is good, but a clap is much better. A “dislike” is really bad.

So we don’t have to filter out all the stories past a certain age, or to create handmade rules to prioritize more recent stories. We can rely on our Machine Learning model to learn the patterns and interactions between the different variables. The model should be able to sort out the impact of the post age by learning from the historical data.

And it should then be able to predict how that affects the different likelihoods. All things being equal, increasing the “age” value of the post should decrease the score. This means that the Machine Learning ranking step can act as a soft filter.

All things being equal, increasing the age of a post should decrease the score. The machine learning model is therefore acting as a soft filter.

This “soft filter” approach has many advantages in theory. There are more stories available at the ranking step, which means that the ranker has more leeway. It might be able to find “old gems” that wouldn’t make it into user feeds otherwise. The model has more possibilities and more information and so it should be able to produce better recommendations.

If we go back to our previous “Trending feed”, we had excluded a potentially great story with 2k claps just because it was one day older than the limit, this wouldn’t happen with a soft filter. Also we don’t have to choose and tune a maximum age limit (is 7 days the best limit? should we do 3 days, 15 days??). With a machine learning ranking we can just let the model prioritize recommendations based on the metrics we care about.

post C is potentially a great recommendation but was excluded because of the hard filter

Which hard filters can we convert into a soft filter?

Considering all this, we went back to go through our different rules, trying to see which ones of those hard filters could be turned into a soft filters. We thought our “Presentation Filter” was a good candidate for that. This filter is a hard rule to remove any posts that have been presented three times prior to that same user.

We keep track of how many times posts were presented to a user in a feed. When we build a new feed we make sure to exclude stories that were already presented several times

So the idea is that we should remove that hard rule and delegate the decision to the machine learning algorithm. Like your friend that still can’t believe you haven’t watched Friends yet, it may keep recommending the same thing over and over again, if it thinks it’s a really good recommendation. Or maybe not, and it will know after a single presentation that it’s time to stop.

The experiment

To test our theory we performed an A/B/C test. This was particularly simple because our Machine Learning ranking model already has access to the right information (the previous number of presentations). So all we needed was to test different values for our hard filter threshold:

control kept the threshold at 3 previous presentations
experiment group A got a threshold at 5. More leeway, but keeps a hard limit
experiment group B got a threshold at 10. That’s almost infinite leeway for the machine learning model. We just kept the hard rule at 10 to prevent any edge case behaviour that could be particularly annoying for the end user.

Results came in and they were …. disappointingly flat.

users average reading time stayed mostly flat
conversions to paying members also stayed mostly flat

We did see an increase in the average number of “prior presentations”. That means that we had more “repeat” recommendations, ie the model does think that it’s worth it to present the same stories several times to the same user. But that didn’t have a big enough impact on the users to show in the metrics.

So unfortunately we were not able to conclude further than this.

But that did leave us free to decide with what felt best from a product perspective. We decided to ship a threshold at 5 presentations of the same story to a reader.

This means that:

we still apply a hard filtering rule
but it does give more leeway to the model (between 0 and 5 presentations, the model is free to decide which stories are worth recommending again to a user). Although it didn’t show up in the metrics in this experiment, future iterations of the model may take advantage of that extra degree of freedom.

A threshold at 10 is what should give the best results in theory. But we thought it was too high and it doesn’t feel right to let the algorithm recommend the same story to a user more than 5 times. We could argue that it’s a “feature” that we don’t show the same recommendation to a user above that threshold. (At some point you need to give up and accept that your friend will never watch Friends and sadly miss out on the best show ever.)

Even though we didn’t get a big win here, the overall reflection on our filtering rules was super interesting and will stick with us. In future work we’ll make sure to have the hard vs soft discussion when we think about implementing new filtering rules.

A next step might be to re-evaluate our hard filtering on the age of stories. We are still using this on some recommendation surfaces. Although everything is in place for the soft filtering to apply on the age of posts, our model often gets it wrong and recommends old stories that are not relevant anymore. Our readers are (rightfully) very vocal about those out-of-date recommendations and so we need to maintain that hard filtering on certain recommendation surfaces.

Youtube for example is fantastically good at recommending “good old stuff”, sometimes surprising me with 7 or 10 years old videos, that gives a good objective for our team. It would be amazing to get better at this and be able to dig out all of the amazing stories that were published on Medium over the years.

Thanks for reading this series, you can stay tuned for the next installments of this series by following the Medium Eng Blog. In the Part 4 we’ll explore how we made digests more engaging by diversifying the daily emails

Engineering stories behind the Medium Daily Digest Algorithm: Part 3 was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Taming Post Claps

Ryan Lattanzi — Tue, 01 Oct 2024 13:06:01 GMT

The Two Billion Claps Bug

TL;DR

A user was able to exploit a race condition in our backend system to manipulate clap counts on posts. Users are supposed to only be able to clap between 0 and 50 times for a given post, but this hack allowed them to go outside those bounds (both above and below). Our fix leverages DynamoDB condition expressions and strongly consistent reads to block updates on data that have been manipulated after a read, but before a write. Additionally, we implemented an eventually consistent clap rectification solution for those posts that were already affected by this bug.

Catching The Bug

We became aware of this problem thanks to a user who brought an extremely detailed writeup of a roundabout way to manipulate claps. We appreciate the help and are grateful that it was flagged to us!

Clarifying A Few Things

Some of the claims in the user’s report should be addressed to avoid confusion:

Essentially, Medium’s Partner Program payments directly depend on claps. The more claps you receive, the more money you will make.

This is actually untrue. The Partner Program (V4) rewards posts by the number of people that clapped for a post, not the number of claps itself. Ignoring all other factors, a post with 10 unique clappers with 10 claps will earn more than a post with 5 unique clappers with 50 claps. You can read more about the partner program here.

Furthermore, our recommendation algorithms also depend on the number of clappers, not number of claps.

Further, imagine not only zeroing the current counts but future counts, indefinitely. Any future likes of that user won’t show up.

It is true that this hack can make it seem that a post has -200 claps on our backend system. If 3 different users then come and each clap 50 times, the clap count will still be -50 (all negative clap count values show up as 0 in the UI). But, the 3 users that clapped for the post are not lost, so this post will still earn the same amount of money if the post displayed all 150 claps. This is what is meant by “the bug only affects the UI”.

How Severe Was This Bug?

The user claimed this bug was significant. We don’t quite agree.

There was some merit to the user’s point that “users might feel less inclined to vote/click, etc. on zero-clapped articles.” It would also be frustrating for the writer to see a discrepancy if the post displayed 0 claps but the stats page showed >1 clappers.

However, since this bug doesn’t effect post earnings or recommendations, we don’t feel it was as significant as the user claimed.

We also have to point out that a normal user cannot reproduce this in the UI; it was only accessible to the user via software scripting.

What Exactly Was Going On?

In our backend system, a few things occur when the clap endpoint is hit:

We check to see if the user has clapped for the post before. If so, we add the existing clap count to the incoming claps (capped at 50). Otherwise, it’s a clean slate.
We update the record in the database to be existing_clap_count + incoming_clap_count. This is where the problem lies. Let’s dive in.

Unconditional Writes

These types of writes will simply read the existing value in a database, do whatever updates/writes you command, and save the result. But, what happens if two people are trying to update an item at the same time?

See below, where Alice and Bob are both updating the price of an item. They both read the same price initially, but depending on several factors (of which can be out of our control such as network latency), Bob’s write wins only because by chance it occurred after Alice’s.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate

Now let’s shift gears to clap counts. Let’s say a user has already clapped for a post 30 times, and we get a burst of 10 requests, each with the user adding 20 more claps to the post. Since these arrive at almost the same time, some (could be all, but let’s say 5) requests will read the existing record with 30 claps. For these 5 requests, adding 20 more is completely valid. So, these 5 will each add 20 requests resulting in a user clapping 130 times for the post!

The other 5 requests might come after these writes and see 130 claps. But this is an invalid number (>50) so no operation will occur.

So, How Can We Fix This?

Is there a way to update the existing record if and only if the record contains a value that we expect? In other words, how can we ensure we don’t go over 50 claps if multiple requests are sent concurrently?

The answer is yes! Dynamo’s condition expressions for conditional writes achieves exactly what we need to handle concurrency. Going back to Alice and Bob, a conditional write will only occur if price = 10 , ridding us of the race condition seen above:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate

You can imagine this works fantastically for our clap count incrementing as well. From our example above, we only want to add 20 claps if the existing clap count is the initially read 30 claps. For each request that comes in, we:

Read the existing clap count from the database. We use strongly consistent reads to ensure the record is not stale.
Using a condition expression, only update the record if the clap count is equal to what was read in step 1. This ensures that no other process has manipulated the data in between this request’s read and write operations.

So, our 5 concurrent requests all read 30 claps from the database. The request that operates the fastest will succeed and set the value to 50. Subsequent requests that attempt to write will fail since the value has been updated to 50, not the expected 30.

Cleaning Up Borked Posts

“While none of us relish dwelling on past mistakes, sometimes revisiting them is the only way to ensure a smoother future.” — ChatGPT

Obviously, this came to our attention because there were a few people that already messed with posts. The conditional writes will prevent these from happening in the future, but what about the past?

It was necessary to run a backfill script to rectify borked posts. This script simply identified records in our database that contained clap counts outside of the [0, 50] bound and cleaned them up. However, because of the way our pipeline system operates, changes will only be reflected when a new event occurs on a post, such as a read, view, or clap. We estimate that this solution will affect the claps on about 14k users’ stories, either increasing or decreasing (😬) them.

Results

The results were 🔥 🔥.

Catching Conditional Errors

We are noticing an uptick in our logs of the error:

(ConditionalError): The conditional request failed

which means our condition expressions are working!

Now, this doesn’t mean we are now discovering a bunch of “hackers” that have been flying under our nose…We run on a distributed platform rife with network failures and retries. So it’s highly likely that most of these are innocent errors caused by said network failures. Nonetheless, they are still good to catch because even innocent errors can lead to inaccurate data!

Historical Clap Cleanup

Here is an extraordinary case that our clap cleanup fixed.

The ol’ 6 views, 8 reads, and 2B+ claps trick…

This is a fun one that has floated around internally…and here is a snapshot of the metrics as of this writing:

Before…….is this real life??

Unfortunately (fortunately…?) we’ve uncovered the truth:

After….slight decrease in claps

Looking Forward

As always, products evolve. This bug has caught our eye and motivated us to reconsider “What should claps be?”

Do not fear if this solution updates your post clap counts! Your partner program earnings will not be affected! But hopefully, you will appreciate the dedication to data quality as much as we do.

Happy writing!

Taming Post Claps was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learnings from optimising 22 of our most expensive Snowflake pipelines

Raphael Montaud — Mon, 30 Sep 2024 10:06:01 GMT

We recently spent a sprint focused on reducing our Snowflake costs. During this sprint, we investigated 22 of our most expensive pipelines (in terms of Snowflake costs), one by one. In total we merged 56 changes, and in this post we’ll be laying out the optimisations that worked best for us.

Most of these changes were just common sense and didn’t involve any advanced data engineering techniques. Still, we’re always making mistakes (and that’s okay!) and hopefully this post will help readers avoid a few of the pitfalls we encountered.

⚠️ Medium is now 14 years old. Our team has inherited a tech stack that has a long history, some flaws and technical debt. Our approach to this problem was pragmatic; we’re not trying to suggest big pipeline re-designs to reach a perfect state, but rather consider our tech stack in it’s current state and figure out the best options to cut costs quickly. Our product evolves, and as we create new things we can remove some old ones, which is why we don’t need to spend too much time re-factoring old pipelines. We know we’ll rebuild those from scratch at some point with new requirements, better designs and more consideration for costs and scale.

Do we need this?

In a legacy system, there often are some old services that just “sound deprecated.” For example, we have a pipeline called medium_opportunities , which I had never heard about in 3 years at Medium. After all, it was last modified in 2020… For each of those suspect pipelines we went through a few questions:

Do we need this at all?? Through our investigation, we did find a few pipelines that were costing us more than $1k/month and that were used by… nothing.
A lot of our Snowflake pipelines will simply run a Snowflake query and overwrite a Snowflake table with the results. For those, the question is: Do we need all of the columns? For pipelines we cannot delete, we identified the columns that were never used by downstream services and started removing them. In some cases, this removed the most expensive bottlenecks and cut the costs in a massive way.
If it turns out the expensive part of your pipeline is needed for some feature, you should question if that feature is really worth the cost or if there is a way to tradeoff some cost with downgrading the feature without impacting it too much. (Of course, there are situations where it’s just an expensive and necessary feature…)
Is the pipeline schedule aligned with our needs? In our investigation we were able to save a bunch of money just by moving some pipelines from running hourly to daily.

An example:

A common workflow among our pipelines involves computing analytics data in Snowflake and exporting it to transactional SQL databases on a schedule. One such pipeline was running on a daily schedule to support a feature of our internal admin tool. Specifically, it gave some statistics on every user’s reading interests (which we sometimes use when users complain about their recommendations).

It turns out this was quite wasteful since this feature wasn’t used daily by the small team who relies on it (maybe a couple times per week). So, we figured we could do away with the pipeline and the feature, and replace it with an on-demand dashboard in our data visualization tool. Then the data will be computed only when needed for a specific user. It might require the end user to wait a few minutes for the data, but it’s massively cheaper because we only pay when somebody triggers a query. It’s also less code to maintain and a data viz dashboard is much easier to update and adapt to our team’s needs.

Old vs new architecture for this example

To conclude this section, here are a few takeaways that I think you can activate right away at your company:

Make sure your analytics tool has a way to sync with Github. Our data scientist gustavo set that up for us with Mode and it has been massively helpful to quickly identify if tables are used in our data visualisations.
Make sure you document each pipeline. Just one or two lines can save hours for the engineers who will be looking at this in 4 years like it’s an ancient artifact. I can’t tell you the amount of old code we find every week with zero docs and no description or comments in the initial PR 🤦
Deprecate things as soon as you can. If you migrate something, the follow-up PRs to remove the old code and pipelines should be part of the project planning from the start!
Avoid select * statements as much as possible. Those make it hard to track which columns are still in-use and which ones can be removed without downstream effects.

Filtering is key

By using Snowflake Query Profile we were able to drill down on each pipeline and find the expensive table scans in our queries. (We’ll publish another blog post about the tools we used for this project later on). Snowflake is extremely efficient at pruning queries and that’s something we had to leverage to keep our costs down. We’ve found many examples where the data was eventually filtered out from the query, but Snowflake was still scanning the entire table. So if we have one key piece of advice here, it’s that the filtering should be very explicit in order to make it easier for Snowflake to apply the pruning.

Snowflake’s query profile tool

Sometimes Snowflake needs a tip

Here’s an example: Let’s say that we want to get the top 30 posts published in the last 30 days that got the most views in the first 7 days after being published. Here’s a simple query that would do this:

select post_id, count(*) as n_views
from events
       join posts using (post_id)
-- only look at view events
where event_name = 'post.clientViewed'
  -- only look at views on the first seven days after publication
  and events.created_at between to_timestamp(posts.first_published_at, 3) and to_timestamp(posts.first_published_at, 3) + interval '7 days'
  -- only look at posts published in the last 30 days
  and to_timestamp(posts.first_published_at, 3) > current_timestamp - interval '30 days'
group by post_id
order by n_views desc
limit 30

If we look at the query profile we can see that 11% of the partitions from the events table were scanned. That’s more than expected. It seems like Snowflake didn’t figure out that it can filter out all the events that are older than 30 days.

Let’s see what happens if we help Snowflake a little bit:

Here I’m adding a mathematically redundant condition: events.created_at > current_timestamp — interval ’30 days’ . Mathematically, we don’t need this condition because created_at ≥ published_at ≥ current_timestamp -interval ’30 days’ ⇒ created_at ≥ current_timestamp — interval ’30 days’ .

select post_id, count(*) as n_views
from events
       join posts using (post_id)
-- only look at view events
where event_name = 'post.clientViewed'
  -- only look at views on the first seven days after publication
  and events.created_at between to_timestamp(posts.first_published_at, 3) and to_timestamp(posts.first_published_at, 3) + interval '7 days'
  -- only look at posts published in the last 30 days
  and to_timestamp(posts.first_published_at, 3) > current_timestamp - interval '30 days'
  -- mathematically doesn't change anything
  and events.created_at > current_timestamp - interval '30 days'
group by post_id
order by n_views desc
limit 30

Still, this helps Snowflake a bunch and we’re now only scanning 0.5% of our massive events table and the overall query is now 5 times faster to run!

Simplify your predicates

Here’s another example where you can help Snowflake optimise pruning.

If you have some complex predicates in your filtering rule, Snowflake may have to scan and evaluate all of the rows although that could be avoided with pruning.

The following query scans 100% of the partitions in our posts table:

select *
from posts
-- only posts published in the last 7 days
-- (That's an odd way to write it, I know.
-- This is to illustrate how predicates can impact performance) 
where datediff('hours', to_timestamp(published_at, 3), current_timestamp - interval '7 days') > 0

If you simplify this just a little bit, Snowflake will be able to understand that partition pruning is possible:

select *
from posts
-- only posts published in the last 7 days
where to_timestamp(published_at, 3) > current_timestamp - interval '7 days'

This query scanned only a single partition when I tested it!

In practice Snowflake will be able to prune entire partitions as long as you are using simple predicates. If you are comparing columns to results of subqueries, then Snowflake will not be able to perform any pruning (cf Snowflake docs, and this other post mentioning this). In that case you should store your subquery result in a variable and then use that variable in your predicate.

💡 An even better version of this is to filter raw fields against constants. That is the best way to ensure that Snowflake will be able to perform optimal pruning in my opinion. This is my take on how this is being optimised under the hood, as I couldn’t find any sources confirming this, so take this with a grain of salt.

- Suppose we store a field called published_at which is a unix timestamp (e.g. 1466945833883)

- Snowflake stores min(published_at) and max(published_at) for each micro-partition

- If you have a predicate on to_timestamp(published_at) (e.g. where to_timestamp(published_at) > current_timestamp() - interval '7 days') then Snowflake must compute to_timestamp(min(published_at)) and to_timestamp(max(published_at)) for each partition.

- If, however, you have a predicate comparing the raw published_at value to a constant, then it's easier for Snowflake to prune partitions. For example, by setting sevenDaysAgoUnixMilliseconds = date_part(epoch_millisecond, current_timestamp() - interval '7 days') , our filter becomes where published_at > $sevenDaysAgoUnixMilliseconds. This requires no computation from Snowflake on the partition metadata.

In a more general case, Snowflake can only eliminate partitions if it knows that the transformation f you are applying to your raw field is growing or decreasing (published_at > x => f(published_at) > f(x)) only if f is strictly growing). It’s not always obvious what functions are growing or not. For instance, to_timestamp and startswith are growing functions. ilike and between are non-monotonic a priori.

Work with time windows

Let’s say we are computing some stats on writers. We’ll scan some tables to get the total number of views, claps and highlights for each writer.

With the current state of a lot of our workflows, if we want to look at all time stats, we must scan the entire table on every pipeline run (that’s something we need to work on but that’s out of scope here). If our platform’s usage increases linearly, our views, claps and highlights tables will grow exponentially, causing our costs to grow exponentially as well due to scanning more and more data every time the pipeline executes. Theoretically, these costs would eventually surpass the revenue generated by a linearly growing user base.

We must move away from exponentially growing queries because they are highly inefficient and incur a lot of waste at scale. We can do this by migrating to queries based on sliding time windows. If we look at engagement received by writers only on the last 3 months, then our costs will grow linearly with our platform’s usage, which is much more acceptable.

But this can have some product implications:

In our recommendations system: when looking for top writers to recommend to a user, this new guideline could potentially miss out on writers that are now inactive but were very successful in the past since we’ll be filtering for stats only for the past few months. But it turns out this is aligned with what we prefer for our recommendations; we would rather encourage users to follow writers that are still actively writing and getting engagement on their posts.
In the features we implement: we used to have a “Top posts of all time” feed for each Medium tag. We have since removed this feature for unrelated reasons. In the future, I think that we would advise against features like this and prefer a time window approach (”Top posts this month”).
In the stats we compute and display to our users: with this new guideline we may have weaker guarantees on some statistics. For example: there’s a pipeline where we look at recent engagement on Newsletter emails. For each new engagement we record, we look up the newsletter in our sent_emails table. Previously, we would scan the entirety of that massive table to retrieve engagements for all newsletters. But, for costs sake, we now look back on engagements for emails sent in the past 60 days. This means that engagement received on a Newsletter more than 60 days after it was sent will not be taken into account on the Newsletter stats page. This has negligible impact on the stats (<1% change) but we wanted to be transparent about that with our writers. We added a disclaimer at the top of the newsletter page.

The top of the Newsletter Stats page

Thanks to this small disclaimer we were able to cut costs by 1500$/month on this pipeline.

Factorise the expensive logic:

Modularity is a cornerstone of all things software, but sometimes it can get a bit murky when applied to datasets. Theoretically, it’s easy. In practice, we find that duplicate code in an existing legacy data model doesn’t necessarily live side by side — and it’s not always just a matter of code refactoring; it may require building intermediate tables and dealing with constraints on pipeline schedules.

However, we were able to identify some common logic and modularize these datasets by dedicating some time to dive deep into our pipelines. Even if it doesn’t seem feasible, slowly working through similar pipelines and documenting their logic is a good place to start. We would highly recommend putting an effort into this — it can really cut down compute costs.

Play around with the Warehouses:

Snowflake provides many different warehouses sizes. Our pipelines can be configured to use warehouses from size XS to XL. Each size is twice as powerful as the previous one, but also twice as expensive per minute. If the query is perfectly parallelisable, it should run twice as fast and therefore cost the same.

That’s not the case for most queries though and we’ve saved thousands by playing around with warehouse sizes. In many cases, we’ve found that down-scaling reduced our costs by a good factor. Of course we need to accept that the query may take longer to run.

With perfect parallelisation the query is faster as your increase power so the costs are constant. With imperfect parallelisation the gains on the query time are more and more marginal as you increase power, so your costs ( = time * power) increase

What’s next?

First off, we’ll be following up with a post laying out the different tools that helped us identify, prioritise and track our Snowflake cost reduction efforts. And we’ll be detailing that so that you can set those up at your company too.

New tools, new rules

We’ve built some new tools during this sprint and we’ll be using them to monitor cost increases and track down the guilty pipelines.

We’ll also make sure to enforce all the good practices we’ve outlined in this post and have a link to this live somewhere in our doc for future reference.

Wait we’re underspending now?

So apparently we went a bit too hard on those cost reduction efforts and we’re now spending less credits than what we committed for in our Snowflake contract. Nobody is necessarily complaining about this “issue”…but it’s nice to know we have some wiggle room to experiment with more advanced features that Snowflake has to offer. So, we are going to do just that.

One area that could use some love is our events system. The current state involves an hourly pipeline to batch load these events into Snowflake. But, we could (and most definitely should) do better than that. Snowpipe Streaming offers a great solution for low-latency loading into Snowflake tables, and the Snowflake Connector for Kafka is an elegant abstraction to leverage the Streaming API under the hood instead of writing our own custom Java application. More to come on this in a future blog post!

The 20/80 rule

I think this applies to this project. There’s tons of other pipelines we should investigate and we can probably get some marginal savings on each of them. But it will probably take twice as much time for half the outcome… We’ll be evaluating our priorities but I already know that there’s other areas of our backend we can focus on that will yield some bigger and quicker wins.

A poor attempt at illustrating the 20/80 rule

Modularize datasets for re-use

Although we put some effort into this already, there is certainly a lot more to do. Currently, all of our production tables live in the PUBLIC schema no matter if it’s a source or derived table, which doesn’t make discovering data very intuitive. We are exploring using the Medallion Architecture pattern to apply to our Snowflake environment for better table organization and self-service discovery of existing data. Hopefully this will lay a better foundation for modularity!

Learnings from optimising 22 of our most expensive Snowflake pipelines was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

When I told 4,091 writers they weren’t getting paid

Jacob Bennett — Wed, 25 Sep 2024 16:46:59 GMT

Subtle database errors and how we recovered

On September 5, 2024, our team turned on the new Partner Program payments system in production.

And we immediately sent an email to every partner saying they weren’t going to get paid. 😨

This wasn’t a SEV1. But it was a very visible bug on a very sensitive part of our platform. We had dozens of tickets come in and a few passionate posts expressing how incompetent that one engineer is (🙋‍♂)️. We figured out the problem, and it ended up being more subtle than I first thought.

Some context on the Partner Program payroll system

All of the logic related to “how much money should we send a partner” is scoped to a single user at a time. By the time this runs each month, earnings data has already been calculated on a daily level. The “payroll” work amounts to a simple flow of “get the amount we owe a user, then send it to that user.”

We did add one additional piece to this processor that increased the complexity over previous iterations: If a user’s unpaid earnings are less than $10 (USD), don’t create a Pending Transfer. Instead, accrue their balance and notify them that their balance will roll over. Once a user has reached the $10 minimum, pay them their entire account balance.

Here’s a simplified snippet from the codebase (the entry point to this script is RunUserPayroll).

func (a *Service) RunUserPayroll(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
    // Step 1: Aggregate their earnings from last month.
    err := a.createPayrollCredit(ctx, userID, r, batchID)
    if err != nil {
        return fmt.Errorf("creating payroll credit: %w", err)
    }

    // Step 2: Pay the user all of their unpaid earnings.
    _, err = a.createTransferRequest(ctx, userID)
    if err != nil {
      return fmt.Errorf("creating pending transfer: %w", err)
    }

    return nil
}

func (a *Service) createPayrollCredit(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
    // Get the amount the user earned that we haven't rolled up yet.
    credit, err := a.calculatePayrollCredit(ctx, userID, r)
    if err != nil {
        return fmt.Errorf("calculating payroll credit: %w", err)
    }

    // If the user has not earned any money, we don't need to create a credit, we can exit early
    if credit.IsZero() {
        return nil
    }

    // Roll up the user's earnings into a credit
    err = a.payroll.CreatePartnerProgramMonthlyCredit(ctx, &model.PartnerProgramMonthlyCredit{
        ID:        uuid.New().String(),
        UserID:    userID,
        Period:    r,
        CreatedAt: time.Now(),
        Amount:    credit,
        Note:      "Partner Program Monthly Credit",
    }, batchID)
    if err != nil {
        return fmt.Errorf("creating audit credit: %w", err)
    }

    return nil
}

func (a *Service) createTransferRequest(ctx context.Context, userID string) (*model.Transfer, error) {
    // Get the user's current balance, which will now include the credit from this payroll run
    balance, err := a.accountant.GetUserAccountBalance(ctx, userID)
    if err != nil {
        return nil, fmt.Errorf("getting user account balance: %w", err)
    }

    // If the user's current balance is above the minimum transferable threshold, we can create
    // a pending transfer for the user
    meetsThreshold, err := balance.GreaterThanOrEqual(a.config.MinTransferableAmount)
    if err != nil {
        return nil, fmt.Errorf("checking if user balance meets minimum transferable threshold: %w", err)
    }
    if !meetsThreshold {
        log.Info(ctx, "User balance is below minimum transferable threshold, no transfer created", log.Tags{"user_id": userID, "balance": logAmount(balance)})
        err = a.userNotifier.NotifyUserThresholdNotMet(ctx, userID)
        if err != nil {
            log.Warn(ctx, "Failed to notify user of threshold not met", log.Tags{"user_id": userID, "error": err.Error()})
        }
        return nil, nil
    }

    // Everything looks good, create the transfer.
    transferRequest := transfers.NewTransferRequest(balance, userID)
    transfer, err := a.transfers.CreateTransferRequest(ctx, transferRequest)
    if err != nil {
        return nil, fmt.Errorf("creating transfer request: %w", err)
    }

    return transfer, nil
}

The error we ran into is already in this code snippet. Have you noticed it yet?

“The Incident”

We ran the first steps of the payroll system at 11:45am PT. As we watched the logs and metrics in Datadog, two things happened.

First, we started to see a lot of INFO-level logs that said "User balance is below minimum transferable threshold, no transfer created" (you can see the log line in the snippet above). This INFO log by itself is not cause for alarm — if a user doesn’t meet the minimum transferable threshold, this is a valid state.

While those logs were spiking, we got pinged by Fer from User Services:

This is an actual problem and a cause for alarm.

We immediately cancelled the payroll run and dug into what was going on.

The first thing we noticed was the number of users we “successfully” processed was equal to the number of INFO logs I mentioned earlier. That meant 100% of users were going through the code path for failing to meet the minimum payout threshold.

That narrowed the problem area dramatically. We immediately pulled up the code—the same snippet from above—and walked through the logic step-by-step. Another engineer and I had an “aha!” moment at the same time: We are making a database write immediately followed by a database read. This might be a race condition on the database.

The call to a.payroll.CreatePartnerProgramMonthlyCredit writes the last month’s earnings as a credit record (increase the account balance) to the RDS cluster. The call to a.accountant.GetUserAccountBalances reads from the same table to get all unpaid earnings credits.

Reading from a database you just wrote to is a safe thing to do most of the time. But when it isn’t, it can be a subtle bug to track down. In our case, the bug came from how we configured our database. Most of our production RDS clusters are configured to use at least three read replicas. This architecture allows us to scale reads and writes separately. Latency between data written to the primary node and that same data being replicated to all read replicas is low enough for most applications. In my experience it’s available in a few seconds at most.

As a side note: We didn’t catch this before production because we don’t use read replicas in non-production environments. That will probably change now.

But the Partner Program payroll system didn’t actually want separate reads and writes. We wanted the same data we just wrote to the database immediately (<2ms later). That data hadn’t been propagated from the primary node to the read replicas, so 100% of database queries were returning an empty dataset.

The steps that affected this flow were:

Write the monthly credit to the primary cluster database.
Read all monthly credits from the read replicas.
(RDS) Replicate the monthly credit from the primary database to the read replicas.

The race condition was caused in steps 2 and 3. We were querying for data that hadn’t been replicated yet.

So what was the fix?

There’s not a one-size-fits-all solution to fix every race condition. The fix we implemented here was to first fetch all of the unpaid credits and then create the new unpaid credit for last month’s earnings.

The RunPayroll function now looks like this:

func (a *Service) RunUserPayroll(ctx context.Context, userID string, r model.TimeRange, batchID string) error {
    // Step 1: Get the user's current balance.
    balance, err := a.accountant.GetUserAccountBalance(ctx, userID)
    if err != nil {
        return nil, fmt.Errorf("getting user account balances: %w", err)
    }

    // Step 2: Aggregate their earnings from last month.
    createdCredit, err := a.createPayrollCredit(ctx, userID, r, batchID)
    if err != nil {
        return fmt.Errorf("creating payroll credit: %w", err)
    }

    // Step 3: Add the new credit to their prior balance
    balance = balance + createdCredit

    // Step 4: Pay the user all of their unpaid earnings.
    _, err = a.createTransferRequest(ctx, userID, balance)
    if err != nil {
      return fmt.Errorf("creating pending transfer: %w", err)
    }

    return nil
}

func (a *Service) createPayrollCredit(ctx context.Context, userID string, r model.TimeRange, batchID string) (*money.Money, error) {
    // Get the amount the user earned that we haven't rolled up yet.
    credit, err := a.calculatePayrollCredit(ctx, userID, r)
    if err != nil {
        return nil, fmt.Errorf("calculating payroll credit: %w", err)
    }

    // If the user has not earned any money, we don't need to create a credit, we can exit early
    if credit.IsZero() {
        return nil, nil
    }

    // Roll up the user's earnings into a credit
    err = a.payroll.CreatePartnerProgramMonthlyCredit(ctx, &model.PartnerProgramMonthlyCredit{
        ID:        uuid.New().String(),
        UserID:    userID,
        Period:    r,
        CreatedAt: time.Now(),
        Amount:    credit,
        Note:      "Partner Program Monthly Credit",
    }, batchID)
    if err != nil {
        return nil, fmt.Errorf("creating audit credit: %w", err)
    }

    return credit, nil
}

func (a *Service) createTransferRequest(ctx context.Context, userID string, amount *money.Money) (*model.Transfer, error) {
    // If the user's current balance is above the minimum transferable threshold, we can create
    // a pending transfer for the user
    meetsThreshold, err := amount.GreaterThanOrEqual(a.config.MinTransferableAmount)
    if err != nil {
        return nil, fmt.Errorf("checking if amount meets minimum transferable threshold: %w", err)
    }
    if !meetsThreshold {
        log.Info(ctx, "Amount is below minimum transferable threshold, no transfer created", log.Tags{"user_id": userID, "balance": logAmount(amount)})
        err = a.userNotifier.NotifyUserThresholdNotMet(ctx, userID)
        if err != nil {
            log.Warn(ctx, "Failed to notify user of threshold not met", log.Tags{"user_id": userID, "error": err.Error()})
        }
        return nil, nil
    }

    // Everything looks good, create the transfer.
    transferRequest := transfers.NewTransferRequest(amount, userID)
    transfer, err := a.transfers.CreateTransferRequest(ctx, transferRequest)
    if err != nil {
        return nil, fmt.Errorf("creating transfer request: %w", err)
    }

    return transfer, nil
}

Race conditions are subtle, and they’re hard to identify without experiencing them firsthand. If you have a method of identifying them earlier, please let me know!

And if you nerd out about fixing race conditions, we’re hiring!

When I told 4,091 writers they weren’t getting paid was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.