Stories by Aryan Khurana on Medium

Our production database was failing under load. Here’s the one-line fix.

Aryan Khurana — Thu, 14 May 2026 17:51:10 GMT

psycopg3 auto-prepares statements by default. Under PgBouncer transaction pooling, that quietly destroys you.

Production was broken for our app Rezzy. Database calls were failing under load, and we had no idea why.

The errors weren’t constant. Light traffic was fine. But once requests stacked up, things started falling apart at the database layer. Queries that had run thousands of times without issue were suddenly throwing errors. We kept looking at our code. Nothing had changed.

That’s the worst kind of bug. The kind where you’re staring at code that looks correct, in an environment that worked yesterday, and the logs are pointing at something you don’t fully understand yet.

What We’re Working With

Rezzy runs on Google Cloud Run. If you haven’t used it, Cloud Run is Google’s managed container platform. It auto-scales based on traffic, which is great for a product with spiky load. More requests come in, more instances spin up. Things quiet down, instances scale back.

The database side of that story gets complicated fast. Every Cloud Run instance maintains its own pool of connections to Postgres. As instances multiply under load, so do connections. Postgres isn’t designed to handle that at scale. Each connection is a process on the server. Memory, file descriptors, overhead. The database starts struggling well before you run out of application capacity.

So we connect through Supabase’s transaction pooler URL, which runs PgBouncer under the hood. PgBouncer sits between your app and Postgres, maintaining a small fixed set of real connections and handing them out as needed. We run it in transaction pooling mode, which means a client only holds a real Postgres connection for the duration of a single transaction. The moment that transaction commits, the connection goes back in the pool for someone else to use.

This is an extremely efficient setup. Ten thousand clients can share twenty Postgres connections. For Cloud Run, it’s the right call.

It’s also what made the bug so hard to find.

Our ORM is SQLModel, which sits on top of SQLAlchemy and uses psycopg3 as the actual Postgres driver. That chain matters. Each layer has its own behavior, and one of them was working against us without us knowing.

How a Query Actually Travels to Postgres

Before getting into what broke, it helps to understand the full journey a query takes. Most engineers have a rough mental model of this, but the details matter here.

When your code calls something like session.execute(query), here's the real sequence:

Your ORM hands the SQL string down to psycopg3. psycopg3 either opens a network connection to the database or grabs an existing one from its pool. It then sends the query over that connection using Postgres’s wire protocol. Postgres receives the bytes, parses the SQL into a syntax tree, figures out an execution plan (which indexes to scan, how to order joins, etc.), and runs it. Results come back over the wire and get decoded into Python objects.

That parse and plan step happens every single time. For a query that runs hundreds of times per second, Postgres is doing the same work over and over on the same SQL string.

Prepared statements exist to short-circuit that. You send Postgres a query once, it parses and plans it, caches the result under a name, and from then on you just send the name plus parameter values. Postgres skips straight to execution.

In raw SQL it looks like this:

PREPARE my_query (int) AS SELECT CAST($1 AS INTEGER);

EXECUTE my_query(1);
EXECUTE my_query(2);
EXECUTE my_query(3);

DEALLOCATE my_query;

Good optimization. Less repeated work. Faster queries at volume.

Here’s the thing: psycopg3 does this automatically. You don’t write any of that PREPARE code yourself. The driver watches how many times a query runs on a connection, and after 5 executions it prepares the statement silently, in the background. The next execution and every one after that uses the cached plan.

That threshold is controlled by a setting called prepare_threshold. The default is 5. Most people never touch it. We hadn't.

Why This Explodes With Transaction Pooling

Prepared statements are connection-local. When psycopg3 sends PREPARE _pg3_0 AS ... on connection A, that statement lives only on connection A, in the memory of that specific Postgres backend process. Connection B has never heard of it.

In session pooling mode, this doesn’t matter. Your client holds the same Postgres connection for its entire session, so prepared statements accumulate and stay available.

In transaction pooling mode, your connection changes after every transaction. PgBouncer is actively shuffling real Postgres connections between clients to maximize utilization. You might get connection A for one transaction and connection C for the next.

Here’s the exact sequence that was killing us:

Transaction 1 borrows Postgres connection A from the pool. psycopg3 runs the query. It’s the 5th execution, so the driver automatically prepares it: PREPARE _pg3_0 AS .... Transaction commits. Connection A goes back into the pool.

Some other client picks up connection A. Does their thing. Returns it.

Transaction 2 starts. It borrows connection B. psycopg3 still has _pg3_0 in its local cache and sends EXECUTE _pg3_0 to connection B. Connection B has never seen _pg3_0. Postgres throws an error.

There’s a second failure mode that’s even more confusing. Two different clients can race to prepare the same statement name on the same connection:

Transaction 1 from client 1 borrows connection A. psycopg3 prepares _pg3_0. Returns connection A to the pool. Transaction 1 from client 2 then picks up connection A. Its psycopg3 instance also decides it's time to prepare its first query, also naming it _pg3_0. Postgres responds: "a prepared statement called _pg3_0 already exists." That's the DuplicatePreparedStatementError.

Both failure modes have the same root cause: the driver tracks prepared statement state on a connection it doesn’t actually own persistently. From psycopg3’s perspective, it has a connection and a cache. It has no visibility into PgBouncer shuffling that connection to a dozen other clients between transactions.

Under light traffic, you rarely hit this. The pool might consistently hand you the same connection by chance. Under real load, with concurrent requests and PgBouncer actively multiplexing, the collision rate climbs and the errors stack up fast. Which is exactly what we saw. Fine in development, fine under low traffic, broken in production under load.

Verifying It

Before touching anything in production, we wanted to confirm what was actually happening. We wrote a small script that runs the same query 10 times on one connection, then queries pg_prepared_statements directly to see if psycopg3 had prepared anything:

import asyncio
from sqlalchemy import text
from src.database.postgres.postgres_client import PostgresClient

async def main() -> None:
    client = PostgresClient()
    try:
        async with client.engine.connect() as conn:
            for _ in range(10):
                await conn.execute(
                    text("SELECT CAST(:value AS INTEGER)"),
                    {"value": 1},
                )
            result = await conn.execute(
                text(
                    """
                    SELECT name, statement
                    FROM pg_prepared_statements
                    WHERE name LIKE '_pg3_%'
                    ORDER BY name
                    """
                )
            )
            rows = result.fetchall()
        print(f"Prepared statements found: {len(rows)}")
        for name, statement in rows:
            print(f"{name}: {statement}")
        assert not rows, "psycopg prepared statements are still being created"
    finally:
        await client.engine.dispose()
if __name__ == "__main__":
    asyncio.run(main())

Running this before the fix: prepared statements showing up in the results, _pg3_0 and friends, exactly as expected. The driver was preparing statements we never asked it to prepare.

The Fix

One line in the engine configuration:

return create_async_engine(
    self.database_url,
    connect_args={
        "prepare_threshold": None
    },
)

Setting prepare_threshold to None tells psycopg3 to never automatically prepare any statement. Every query goes out as full SQL every time. The driver stays stateless across connection boundaries, which is exactly what you need when those boundaries are being managed by a pooler.

Run the verification script again after the fix: zero prepared statements. The assert passes.

We deployed it. The errors stopped.

What You’re Giving Up

Disabling prepared statements isn’t free. You lose the parse and plan cache, so Postgres does that work on every query. For most web application queries, this overhead is small relative to actual execution time and network latency. Postgres’s planner is fast. The tradeoff is worth it.

There is an alternative path. PgBouncer 1.21 added experimental support for proxying prepared statements in transaction mode, so the pooler itself intercepts and manages them on behalf of clients. It works in some setups, but there are still edge cases around DEALLOCATE behavior with psycopg3 that can bite you. Disabling auto-preparation is the simpler and more reliable fix if you're on transaction pooling.

The other option is switching to session pooling mode, which gives each client a persistent Postgres connection and makes prepared statements work correctly. The cost is that session mode is far less efficient for connection multiplexing. Depending on your traffic patterns and connection budget, that may or may not be acceptable.

Key Takeaways

psycopg3 prepares statements automatically after 5 executions by default. This is an optimization, but it assumes the driver has a persistent connection underneath it.
PgBouncer’s transaction pooling assigns a different Postgres connection per transaction. Prepared statements prepared on one connection don’t exist on the next one your driver gets handed.
Setting prepare_threshold=None in your psycopg3 connect args disables auto-preparation entirely and keeps the driver stateless, which is what you need under a transaction pooler.
This bug hides under low traffic and surfaces under real load. If your database errors appear only when things get busy and the queries themselves look correct, this is worth checking immediately.
Querying pg_prepared_statements directly is the fastest way to confirm whether the fix actually worked. Write the test before and after.

Our production database was failing under load. Here’s the one-line fix. was originally published in Beyond Localhost on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Supabase Actually Signs Your JWTs, and Why It Matters

Aryan Khurana — Wed, 13 May 2026 00:06:01 GMT

A breakdown of HS256 vs ES256, JWKS, and what you are really trusting when you accept a JWT.

If you are building with Supabase, you are using JWTs for auth. They show up in your cookies, your authorization headers, your RLS policies. But most people have a pretty fuzzy mental model of what is actually happening under the hood. In this blog we will walk through all of it.

Signing is not encryption

Before anything else, this distinction needs to be clear because it trips a lot of people up.

Encryption scrambles data so nobody can read it without a key. Signing leaves the data completely readable but attaches a tamper-proof seal. JWTs use signing, not encryption.

You can decode the payload of any JWT right now without any key at all:

echo "eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWUsImlhdCI6MTUxNjIzOTAyMn0" | base64 --decode
# {"sub":"1234567890","name":"John Doe","admin":true,"iat":1516239022}

The payload is not a secret. What the signature protects is the integrity of that payload. It lets your server say “this was created by whoever holds the signing key, and nothing has been changed since.” That is what you are verifying on every authenticated request.

If you actually need confidential claims in a JWT, you want JWE (JSON Web Encryption). That is a different standard entirely, and Supabase does not use it by default.

The symmetric approach: HS256

HS256 stands for HMAC using SHA-256.

SHA-256 is a hash function. Feed it any input, you get back a fixed 256-bit fingerprint. Same input always produces the same output, and you cannot reverse it. HMAC takes SHA-256 and mixes in a secret key, so the signature is essentially HMAC-SHA256(secret, header + "." + payload).

To verify a JWT, the server recomputes that value using its own copy of the secret and checks whether it matches the signature in the token. Match means the token is valid. No match means 401.

This is called symmetric because the same key does both jobs. Creating a signature requires the secret. Verifying one also requires the secret. Both sides have to hold the exact same key.

In a Supabase project, this means you copy the JWT secret from your dashboard and paste it into your environment. Now it lives in at least two places: Supabase’s auth server and your own infrastructure. Every service that needs to verify tokens needs a copy.

This works, and a lot of production apps run on it just fine. But the failure mode is worth thinking about. If that secret leaks:

An attacker can forge JWTs for any user, including ones that do not exist
They can craft tokens with role: service_role and bypass all your row-level security
You have to rotate the secret immediately, which means updating every environment that has a copy and invalidating every active user session at the same time

Symmetric is not inherently insecure. The issue is that the more places the secret lives, the more surface area you have to defend. And when it fails, it fails completely.

The asymmetric approach: ES256

ES256 uses elliptic curve cryptography instead of a shared secret. Rather than one key that both sides hold, you have two mathematically linked keys.

The private key lives on Supabase’s auth server. It never leaves. You cannot download it, and you never see it as a developer. It is used to create signatures.

The public key is meant to be shared. It can only verify signatures, not create them. The math linking the two keys makes this strictly one-directional: having the public key gives you no ability to produce a valid signature.

This is the meaningful shift from HS256. In the symmetric model, anything that can verify a token holds enough information to forge one. In the asymmetric model, verification and signing are decoupled. Your server can verify tokens without touching anything that could be used to mint a fake one.

Supabase uses ES256 specifically, which is elliptic curve DSA on the P-256 curve. Compared to RSA, elliptic curve gives you equivalent security with smaller keys and faster operations. That matters because JWT verification is happening on every single authenticated request.

One thing worth knowing: Supabase historically defaulted to HS256 and added asymmetric JWT support more recently. Which algorithm your project uses depends on when it was created and your project settings. Check your dashboard if you are not sure.

Since anyone doing verification needs the public key, Supabase exposes it at a standard endpoint called a JWKS (JSON Web Key Set):

https://your-project.supabase.co/auth/v1/.well-known/jwks.json

The response is a JSON object containing an array of public keys. Each entry looks something like this:

{
  "kty": "EC",
  "crv": "P-256",
  "x": "...",
  "y": "...",
  "kid": "abc123",
  "alg": "ES256"
}

kty and crv tell you it is an elliptic curve key on the P-256 curve. x and y are the actual public key coordinates. kid is a short identifier for this specific key.

Every JWT Supabase issues includes a kid field in its header (the first segment, not the payload):

{ "alg": "ES256", "typ": "JWT", "kid": "abc123" }

Verification works like this: read the kid from the header, find the matching key in the JWKS, verify the signature using that key. If the kid is not in your cache, refetch the JWKS endpoint. This is also how key rotation works without breaking existing tokens. If the key still is not there after a refetch, the token is invalid.

Who actually does the verifying

This is something that confuses people coming from HS256.

In the HS256 world, verification can happen on Supabase’s servers because they hold the shared secret. In the ES256 world, you verify tokens yourself using the public key. That is kind of the whole point of going asymmetric.

Your server fetches the JWKS on startup, caches the public keys in memory, and on every authenticated request it reads the kid from the incoming JWT, looks up the right key, and runs the ES256 verification locally. No round-trip to Supabase, just math. The only time it hits the JWKS endpoint again is when a key rotates and a kid shows up that is not in the cache yet.

The upside is that even if someone somehow got your public key, they gain nothing. The only thing that could let someone forge tokens is the private key, which never left Supabase’s auth server.

Key takeaways

JWTs are signed, not encrypted. The payload is readable by anyone. Do not put secrets in it.
HS256 uses one shared secret for both signing and verification. Any party that can verify a token holds enough information to forge one, and a leaked secret means rotating everything immediately.
ES256 splits signing and verification into separate keys. Your application only ever holds the public key, which cannot be used to forge tokens.
JWKS is how public keys get distributed. Your server fetches and caches the JWKS, and verification happens locally on every request.
In the asymmetric model, you are doing the JWT verification yourself, not Supabase.
Check which algorithm your project is actually using. The default has changed over time, and older projects may still be on HS256.

How Supabase Actually Signs Your JWTs, and Why It Matters was originally published in Beyond Localhost on Medium, where people are continuing the conversation by highlighting and responding to this story.

What a Hackathon Workshop Taught Us About Async Database Architecture in FastAPI

Aryan Khurana — Sat, 02 May 2026 11:07:06 GMT

How we diagnosed and fixed connection pool exhaustion in a production FastAPI + SQLAlchemy stack.

First, a quick win: as of writing this, Rezzy just hit 1,000 users. 🎉

Now let me tell you about the time we almost tanked the whole thing in front of a room full of developers.

We recently shipped a feature that’s essentially Cursor for resumes. You chat with an AI agent, ask it to rewrite sections, and it understands you. It pulls from your profile, remembers context across sessions, and writes changes back in real time.

This feature talks to our Postgres database a lot. Every message gets stored, profile data gets fetched, updates get written back. It worked perfectly in local testing and staging. We shipped it.

Then we sponsored a hackathon.

The Incident

We sponsored a hackathon and ran a hands-on workshop where we gave attendees free access to the product and walked them through it live. The whole point was to get a room full of developers actually using Rezzy in real time.

And that’s exactly what they did.

Mid-workshop, everything stopped working. Every button. Every API call. Pure timeouts across the board.

We were fairly confident it was connection pool exhaustion but we had no time to investigate properly mid-workshop. So we did the only thing you can do in that moment: redeployed the backend, restarted our Cloud Run service, and watched everything come back online.

Until it broke again three hours later.

We kept redeploying as a band-aid until we actually sat down and fixed the root cause. This post is that story.

Connection Pools: A Quick Primer

Before getting into what went wrong, let me make sure we’re on the same page about how connection pools work because the rest of this post depends on it.

Opening a raw TCP connection to Postgres is expensive. You don’t want to do it fresh on every API request. So SQLAlchemy maintains a pool of pre-opened connections that your app borrows and returns.

Two numbers control pool capacity:

pool_size is the number of connections kept open and idle, ready to use at any time.
max_overflow is how many extra connections can be opened on top of that during a burst of traffic.

Together they define the maximum number of simultaneous Postgres connections your app can have per instance. Once you hit that ceiling, new requests have to wait. If nothing returns a connection before pool_timeout fires, the request fails.

A connection is either:

Checked in meaning it’s idle in the pool and available.
Checked out meaning it’s actively being used by a request.

The pool starts empty and warms up lazily as requests come in. That’s the model. When it works well you barely think about it. When it breaks, everything breaks at once.

What Was Actually Wrong

The logs confirmed pool exhaustion. But the real question was why connections weren’t being returned. That came down to four compounding problems.

Problem 1: Sync Database Access Inside an Async App

Our backend is FastAPI, which is fully async. But the database layer wasn’t. We were using a synchronous SQLAlchemy engine with sync Session objects inside async request handlers.

# Sync engine in an async app
return create_engine(self.database_url)

# Sync session dependency
def get_db() -> Generator[Session, None, None]:
    yield from postgres_client.get_session()

When a sync database call runs inside an async framework it blocks the entire event loop while it waits on I/O. No other coroutines can run. Requests pile up. Connections stay checked out longer than they should. Under concurrent load this compounds really fast.

Problem 2: Blocking I/O in a High-Frequency Middleware Path

We had rate-limiting middleware that ran on every single inbound API request. Inside that middleware there was a synchronous database lookup.

# Sync DB call inside async middleware
with Session(self._engine) as session:
    row = session.get(Subscription, uid)

Middleware is a multiplier. Inefficiencies here hit every request, not just specific endpoints. A blocking DB call in middleware under concurrent traffic is a really reliable way to starve your event loop.

Problem 3 (The biggest culprit): Long-Lived Sessions Across AI Workflows

The AI agent feature was holding a single database session open for the entire duration of an AI workflow. That workflow includes model inference, tool calls, and streaming responses which can easily run 10 to 30 seconds end to end.

The thing is you don’t need a database connection for any of that. You need it for the roughly 50ms you’re actually reading or writing data. Holding a checked-out connection while waiting on a model response is pure waste and under concurrent usage that waste adds up quickly.

Problem 4: Multiple Engines, Multiple Pools

Different parts of the app were each instantiating their own SQLAlchemy engines independently.

This is a subtle but important mistake. Every SQLAlchemy engine owns its own connection pool. If your workers and your API handlers each have separate engines, you can blow past Postgres’s connection limit even when each individual pool looks healthy in isolation. You end up with more total connections than you intended and no single place to observe or control them.

Reproducing It Locally

Before fixing anything, I needed to be able to reproduce the exhaustion reliably in a local environment. There’s no point guessing at a fix if you can’t verify it actually works.

To simulate the load I used hey, a straightforward HTTP load testing tool.

brew install hey

# Warm up, 20 concurrent requests each holding a connection for 20 seconds
hey -n 20 -c 20 -t 60 "http://localhost:8000/your-slow-endpoint?seconds=20"

# Reproduce exhaustion, enough concurrent load to exceed pool_timeout
hey -n 30 -c 30 -t 90 "http://localhost:8000/your-slow-endpoint?seconds=60"

The -n flag is total requests, -c is concurrent workers, and -t is client-side timeout. The goal with the second command is to have enough concurrent sessions holding connections long enough that the waiting requests hit pool_timeout. That's exactly the production failure mode we were seeing.

With a reliable way to reproduce the error on demand I could now validate fixes instead of shipping and hoping.

The Fix

The fix was not “increase the pool size.” That would have been a band-aid. The real problem was how the app was using the pool. Holding connections too long, blocking the event loop with sync I/O, and scattering pool ownership across the codebase. Here’s what we actually changed.

Fix 1: Go Fully Async, End to End

We replaced the sync engine with an async one using create_async_engine, switched to AsyncSession, and converted the FastAPI dependency from a sync generator to an async one.

# Before
def get_db() -> Generator[Session, None, None]:
    yield from postgres_client.get_session()

# After
async def get_db() -> AsyncGenerator[AsyncSession, None]:
    async for session in postgres_client.get_session():
        yield session

This sounds like a small change but it cascades through the entire codebase. Every repository method, service call, and controller now has to await its database operations. It's a lot of mechanical work but it's non-negotiable if you want your async framework to actually behave like one.

The payoff is that instead of blocking the event loop during database I/O, the app now yields control while the query is in flight. Other requests can make progress and connections come back to the pool faster.

Fix 2: Fix the Middleware Hot Path

# Before, sync DB call inside async middleware
with Session(self._engine) as session:
    row = session.get(Subscription, uid)

# After, non-blocking
async with self._session_factory() as session:
    row = await session.get(Subscription, uid)

This was the highest leverage fix per line of code. Because this middleware ran on every API request, unblocking it here had a bigger impact than fixing it anywhere else in the stack.

Fix 3: Scope Sessions Tightly in AI Workflows

For the AI agent we stopped injecting a long-lived session and instead passed in the session factory so the service opens a connection only when it actually needs one.

@asynccontextmanager
async def _db_context(self) -> AsyncGenerator[SomeService, None]:
    async with self.session_factory() as db:
        yield SomeService(db, SomeRepository(db))

Now a connection is checked out for the duration of a database operation, not for the duration of a model call or streaming response. This is the right mental model for AI-adjacent workflows. Borrow a connection, do the DB work, return it immediately, then go do the slow AI stuff.

Fix 4: One Engine, One Pool

We consolidated all database access across API handlers, middleware, and background workers to use a single shared client instance and its session factory. No more independent engines scattered across the codebase.

One pool, one place to configure it, one place to observe it.

Fix 5: Make Pool Behavior Explicit

We moved pool configuration out of SQLAlchemy’s implicit defaults and into explicit environment-backed settings. Things like pool size, overflow capacity, timeout behavior, and whether connections should be health-checked before use are now deliberate decisions we can tune without touching code.

The specific values are something you should calibrate for your own workload and infrastructure. The point is that pool behavior should be something you consciously decide, not something a library quietly decides for you.

Key Takeaways

If you’re building async backends with AI workflows, here’s what to carry forward.

Async all the way through or not at all. One blocking database call in a hot path negates the benefits of an async framework. This is easy to miss in FastAPI because it will happily accept sync dependencies and route handlers without complaining. It won’t tell you you’re doing it wrong.

Middleware is a multiplier. Code running on every request has outsized impact on performance. Any blocking I/O there will surface under load before you feel it anywhere else in the stack.

AI workflows and long-lived DB sessions don’t mix. If your code is waiting on a model response, a tool call, or a streaming output, it should not be holding a database connection. Open it for the data operation, close it immediately, then go do the slow AI work.

One engine, one pool. If different parts of your app each create their own SQLAlchemy engine you have multiple pools with no unified view of connection usage. Centralize it.

Make your pool observable and tunable. If you can’t see how many connections are checked in versus checked out at any given moment you’re flying blind. And if your pool settings are implicit library defaults you have no leverage when things go wrong in production.

What a Hackathon Workshop Taught Us About Async Database Architecture in FastAPI was originally published in Beyond Localhost on Medium, where people are continuing the conversation by highlighting and responding to this story.

Setting Up Production Logging in a FastAPI Microservice

Aryan Khurana — Sat, 18 Apr 2026 17:44:33 GMT

A practical walkthrough of why colorful console logs break in production, and how to replace them with structured, contextual logging using structlog.

The Inspiration

I run a company called Rezzy, where we help engineers land interviews by building industry-standard resumes and cover letters using AI trained on real recruiters and hiring managers. Our stack is varied, but every backend microservice is written in Python with FastAPI.

When I first set everything up, I had a nice little logging setup going — the default uvicorn logger paired with colorama for pretty, colorful output. It looked great in the terminal, and it worked.

Thanks for reading! Subscribe for free to receive new posts and support my work.

That setup was fine in development. But we’ve grown a lot since then. Our user base is doubling every month, and more users means more bugs. Bug reports have been rolling in lately, and that is where my logging setup started to fall apart.

Our services run on GCP, so I’ve been using the log viewer that comes with it to dig through logs. The problem was that my logging was not built for that kind of environment. It worked for small scripts and dev APIs, but for production troubleshooting, it was nowhere near enough.

Everything That Was Broken

There was no structure. I had optimized for “pretty,” not parseable. Since nothing was emitted as JSON, the log parsers in third-party visualizers and search tools could not make sense of any of it. That alone was a dealbreaker.
There was no request context. When multiple users hit the app at the same time, I had no way to trace a single request across all the logs it produced. Everything just smeared together.
There was no user-level tracking. Since I never attached a user_id to log entries, I had no idea which logs belonged to which user. Debugging a specific bug report was basically guesswork.
ANSI color codes were always on. My custom UvicornLikeFormatter unconditionally wrapped every field in Fore.* / Style.RESET_ALL. In production, logs end up in Cloud Run, CloudWatch, Datadog, or Loki looking like this: x1b[92m2026-…\x1b[0m | \x1b[94mINFO\x1b[0m | ..
That breaks parsers, breaks search, and looks awful in log UIs. Colors should only turn on when sys.stderr.isatty() is true, or when PYTHON_ENV != "PROD".
Every module imported the same root logger. That meant record.name was always root. I had lost the ability to filter or route by module. The convention is to call logging.getLogger(__name__) per module, or use a helper like get_logger(name).
Third-party loggers were either silent or spammy. I set disable_existing_loggers: False and overrode uvicorn, which is fine, but I only configured the root logger and uvicorn itself. Everything else - SQLAlchemy, httpx, boto/botocore, OpenAI, Anthropic, LangChain, LangGraph, Stripe, Redis - was left to its defaults. A real production config pins the noisy ones (sqlalchemy.engine, botocore, httpx, urllib3, langchain, openai._base_client) to WARNING.

Building a Proper Logging Setup

Alright, intro’s done. Let’s get into the actual tutorial on how to build production-grade logging for a FastAPI app.

Step 1: The Logging Config File

Create a single logging config file that gets shared across the entire application. Call it whatever you want — I use src/common/logger.py.

The first thing to know is that we’re using a library called structlog. It’s the backbone of this whole setup. It gives you structured logs (as dicts), a clean processor pipeline, and plays nicely with the standard library’s logging module.

Step 2: Quiet the Noisy Libraries

Before anything else, set per-library log levels. You don’t want to see every library’s internal chatter — it drowns out your own logs. If you let them all emit at INFO, your production output becomes 95% noise and 5% signal.

This map turns them down:

_LIBRARY_LEVELS = {
    "sqlalchemy.engine": "WARNING",
    "botocore": "WARNING",
    "httpx": "WARNING",
    "urllib3": "WARNING",
    "langchain": "WARNING",
    "openai._base_client": "WARNING",
    # ...add others as needed
}

Each library gets a level that matches how useful its output actually is.

Step 3: The setup_logging Function

This is called once, on app startup.

First thing we do is force stdout to flush after every newline instead of buffering in chunks:

try:
    sys.stdout.reconfigure(line_buffering=True)
except AttributeError:
    # reconfigure doesn't exist on all stream types — fall back to default buffering
    pass

This matters because when the app runs inside Docker, logs sit in memory until the buffer fills up, and then everything gets flushed at once. That is terrible for debugging — you’ll miss logs right up until the moment of a crash. The try/except exists because reconfigure isn’t available on every stream type, and if it fails we just accept the default buffering.

Next, pull the environment and log level from your settings:

env = settings.PYTHON_ENV
level = settings.LOG_LEVEL

Step 4: The Shared Processors

Now we get to the most important part — the shared processors.

A processor is a function that takes a log event (a dict) and returns a modified dict. They run in order, like a pipeline, for every log that flows through the system. Here is what each one does:

merge_contextvars - Merges anything bound via structlog.contextvars.bind_contextvars(...). This is how per-request data (like user_id and request_id) gets attached automatically to every log inside that request. The middleware (coming up later) is what actually binds these values.
add_log_level - Pretty self-explanatory: attaches the log level (info, warning, error, etc.) to the event dict.
TimeStamper(fmt="iso", utc=True) - Adds a UTC timestamp. Always use UTC. If you log in local time, cross-service debugging becomes a nightmare the moment your services run in different regions.
StackInfoRenderer - If you pass stack_info=True to a log call, this renders the stack trace. Occasionally useful when you want context without an exception.
format_exc_info - If a log call includes exception info (e.g., logger.exception(...) or exc_info=True), this turns the traceback into a readable string. Without it, exceptions just… don’t show up in your logs. In production you might prefer dict_tracebacks for structured tracebacks, but format_exc_info works fine for both pretty and JSON output.
CallsiteParameterAdder - Attaches the source location (module, function, line number) to every log. So when you see an entry, you know exactly where in the code it came from. This is the equivalent of the %(module)s:%(funcName)s:%(lineno)d you had in the old stdlib formatter.

Then we pick the renderer based on environment:

if env == "PROD":
    renderer = structlog.processors.JSONRenderer()
else:
    renderer = structlog.dev.ConsoleRenderer(colors=True)

Production gets JSON so parsers can index it properly. Development gets colorful, human-readable output.

Step 5: Configure structlog

structlog.configure(
    processors=shared_processors + [
        structlog.stdlib.ProcessorFormatter.wrap_for_formatter,
    ],
    wrapper_class=structlog.make_filtering_bound_logger(level),
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

A quick walk-through of what each argument does:

We hand structlog all the shared processors, plus wrap_for_formatter — which is the glue that lets structlog and the stdlib logging module collaborate.
wrapper_class=make_filtering_bound_logger(level) means the logger filters out events below the configured level before running any processors. Faster, and keeps your pipeline clean.
logger_factory=LoggerFactory() makes structlog loggers use stdlib loggers underneath. This is the key to unifying everything: stdlib logs (from uvicorn, SQLAlchemy, etc.) and structlog logs (from your code) both flow through the same pipeline.
cache_logger_on_first_use=True is a small performance win. Once get_logger("foo") is called, the configured logger is cached.

Step 6: The stdlib Side of the Pipeline

Now we wire up the standard library side:

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(
    structlog.stdlib.ProcessorFormatter(
        processor=renderer,
        foreign_pre_chain=shared_processors,
    )
)

Two key pieces here:

StreamHandler(sys.stdout) - Writes to stdout, not stderr. Stdout is the right default for container environments.
ProcessorFormatter - A special formatter that bridges stdlib logging into structlog’s processor system.

Inside the formatter:

processor=renderer is the final renderer (JSON in prod, Console in dev).
foreign_pre_chain=shared_processors is the important bit. Log events that originate from stdlib loggers - uvicorn, SQLAlchemy, Stripe - are “foreign” because they didn’t come through structlog. This argument runs them through the same processor pipeline before rendering.

The result: whether a log comes from structlog.get_logger(__name__).info(...) in your code, or from uvicorn’s stdlib logger, it ends up formatted identically. One unified output format. This unification is the whole reason the structlog setup is slightly more involved - but it’s worth it.

Step 7: Wire Up the Root Logger

root = logging.getLogger()
root.handlers.clear()
root.addHandler(handler)
root.setLevel(level)

What each line does:

logging.getLogger() with no args returns the root logger - the parent of every other stdlib logger.
root.handlers.clear() removes any handlers added elsewhere (by uvicorn’s default setup, basicConfig, whatever). This prevents duplicate log lines, which is critical because uvicorn adds its own handlers if you let it.
root.addHandler(handler) routes all logs through our unified handler.
root.setLevel(level) filters at the root at our configured level.

By default, child loggers in Python propagate events up to the root. So once the root is configured, every library’s logger automatically uses our handler. That’s why we don’t have to configure each library individually — they inherit.

Step 8: Per-Library Level Overrides

for name, lib_level in _LIBRARY_LEVELS.items():
    lib_logger = logging.getLogger(name)
    lib_logger.setLevel(lib_level)
    lib_logger.propagate = True

For each entry in the map: grab the logger, set its level (so it filters its own noise at the source, before it ever reaches the root), and make sure propagate = True so events that pass the filter still bubble up.

Why set both? Two reasons:

setLevel(WARNING) means the library won’t even emit INFO/DEBUG events. They’re discarded at the source, which is more efficient.
propagate = True (the default, but set explicitly here for safety) means whatever does pass the filter flows up to the root handler for formatting.

The explicit propagate = True is slightly defensive. Some libraries set propagate = False on their own loggers to avoid double-logging, which would quietly break our setup. Forcing it to True guarantees every library’s output goes through our handler.

Step 9: The Factory Function

def get_logger(name: str) -> structlog.stdlib.BoundLogger:
    return structlog.get_logger(name)

A thin wrapper so your app code imports get_logger from this module instead of calling structlog.get_logger directly. A few reasons this is worth it:

Consistent entry point. One place to change logging behavior app-wide.
Typed return value. IDEs and type checkers know exactly what you’re getting back.
Less coupling to structlog. If you ever want to swap the underlying library, you change one file.

Usage is pretty simple:

from src.common.logger import get_logger

logger = get_logger(__name__)

logger.info("order_created", order_id=123, total=49.99)

Wrapping Up

That’s the setup. The short version of what changed:

Logs are now structured JSON in production, readable color output in development.
Every log carries request context and user identity automatically.
Noisy third-party libraries are pinned to WARNING, so they stop burying real signal.
stdlib loggers and structlog loggers emit through a single unified pipeline.
Tracebacks, timestamps, and call sites are attached to every entry.

It’s more code than my original pretty-logger setup, but the payoff is huge. Bug reports that used to take me an hour to track down now take a few minutes, because I can filter Cloud Logging by user_id, pull every log from the problem request, and actually see what happened.

If you’re running a FastAPI app in production and still relying on uvicorn defaults, I’d strongly encourage you to make the switch.

Catch you in the next one.

Setting Up Production Logging in a FastAPI Microservice was originally published in Beyond Localhost on Medium, where people are continuing the conversation by highlighting and responding to this story.

Diary of an MLH SWE Fellow — Week 12 (Final Week) + My Experience

Aryan Khurana — Sat, 09 Aug 2025 18:37:54 GMT

Diary of an MLH SWE Fellow — Week 12 (Final Week) + My Experience

This blog marks the final entry in a series I’ve been posting over the past three months. It began with my desire to document my experiences as an SWE Fellow at the MLH Fellowship, and in this blog, I’ll be reflecting on my last week and my overall journey as a fellow.

In my previous blog, I mentioned a PR I had submitted that was under review. That PR was eventually approved, merged, and the issue was closed. We also had our end-of-program demo with RBC, where we showcased everything we had accomplished since the mid-program demo and it went really well. Our final pod meeting was a chance for everyone to share their experiences.

Beforehand, each of us was sent a question via email to answer during the meeting. My question was: “You have a unique experience in both SRE and SWE. Where does your heart lie moving forward? How do these skill sets complement each other?” This was because I currently work full-time as an SRE at RBC. I shared my experiences, learned from others, and had a lot of fun during the discussion.

They congratulated us on graduating, and honestly, these past three months flew by so quickly that I didn’t even realize it.

Here’s everything I worked on so far:

Issues: https://github.com/apache/airflow/issues?q=is%3Aissue%20state%3Aclosed%20assignee%3AAryanK1511
Pull Requests: https://github.com/apache/airflow/pulls?q=is%3Apr+is%3Aclosed+author%3AAryanK1511

Along the way, I made new friends, gained valuable open-source experience (especially on such a large project), and fulfilled what was once a dream, joining and completing the MLH Fellowship.

This journey has been nothing short of amazing, and I hope you’ve enjoyed following this blog series as much as I’ve enjoyed writing it.

Diary of an MLH SWE Fellow — Week 10 and 11

Aryan Khurana — Mon, 04 Aug 2025 12:08:08 GMT

Diary of an MLH SWE Fellow — Week 10 and 11

Prepping for the Final Week

Introduction

This blog builds on my previous post, where I shared that I was tackling my most complex issue yet. Although it started as a UI-related task, it required several backend prerequisites that hadn’t been implemented. As a result, I ended up opening five additional pull requests to introduce the necessary API changes.

Before I dive into the technical details, I want to talk about our recent pod merge standup. All our pod members joined, and we played Skribbl together, it was a blast! We also have our end-of-program demo coming up next week (Week 12).

I decided to combine my updates for Weeks 10 and 11 into one post, since I spent most of both weeks working on the same issue. I’m now basically done with the fellowship, there might be one more PR, but my main work is complete.

Adding API Support for Filtering DAGs by Bundle Name and Version

Issue Link: https://github.com/apache/airflow/issues/53739

To address this, I first had to learn how to create DAG bundles so I could test the existing functionality. This was tricky, because as developers, we typically run Airflow using Breeze, and the documentation doesn’t always translate directly to that setup. You really need a solid understanding of Breeze to get this working.

After about two hours of troubleshooting, I managed to create two DAG bundles. Once that was done, I figured out how to create DAG bundle versions, and finally reached the point where I could test the API routes.

I implemented the required functionality in both the UI and public API routes. For anyone unfamiliar: in Airflow, there are two folders for API routes. The public routes are backward compatible and used by external consumers, while the UI routes are for the Airflow UI and don’t guarantee backward compatibility. I made changes to both, wrote tests, ran pre-commit hooks (which generated spec files and other outputs), and pushed my changes.

Here’s the link to my PR: https://github.com/apache/airflow/pull/54004

I received some feedback asking me to add one more test. That turned out to be quite challenging, since the test itself was pretty complex. After some struggling, I finally figured it out and requested another review.

With my PR submitted, I polished up my demo slides for the final presentation and reviewed them with our pod leader. Our final demo is on Tuesday, and I’m really looking forward to it.

Diary of an MLH SWE Fellow — Week 09

Aryan Khurana — Fri, 25 Jul 2025 02:03:49 GMT

Diary of an MLH SWE Fellow — Week 09

Tackling Complex API & UI Changes in Apache Airflow

Introduction

In my last blog, I mentioned I was hunting for bigger challenges in Apache Airflow, ideally wrapping up my term with acomplex issue. Well, mission accomplished, because I landed one that’s both fun and massive. It’s a mix of backend API changes and frontend UI updates, and it’s going to be my main focus for the next few weeks.

Restoring Legacy Filters in Airflow’s New UI

Here’s the background:
Apache Airflow’s frontend used to be built on Flask AppBuilder, but has now moved to a more modern ReactJS UI. While this upgrade is awesome, it came with a tradeoff, the new UI lost a few handy filters from the old DAGs (Directed Acyclic Graphs) view. The Airflow team wants to bring these filters back.

The API powering these filters also shifted, moving from Flask routes to FastAPI. Some of the routes required for these filters don’t exist yet. So, before we can build the filters in the UI, we first have to implement all the backend pieces!

If you want the full details, check out the main issue on GitHub:
Restore legacy filters in the new DAGs view

My Plan of Action

This project is essentially split into two main parts:

API work: Add all the required endpoints and filtering logic in the FastAPI backend.
UI work: Once the backend is ready, wire up the new filters in the React frontend.

To keep things manageable, I broke the main task into several sub-issues, each focused on a specific filter. Here are the sub issues that I have opened so far:

Add has_import_errors filter to Core API GET /dags endpoint:
Issue #53536
Add API support for filtering DAGs by timetable type:
Issue #53738
Add API support for filtering DAGs by bundle name and version:
Issue #53739
Enhance API support for filtering DAGs with asset-based schedules:
Issue #53740
Add API support for filtering unscheduled DAGs:
Issue #53741

The idea is to finish the API work for all these sub-issues first, then update the UI in one full swoop, since adding the frontend filters will mostly just be a matter of calling these new API endpoints.

The first sub-issue did not go too well

I dove in with the first issue, adding a has_import_errors filter. Seemed straightforward, but it wasn’t.

Here’s what I discovered:

The dags table in the Airflow database has an has_import_errors field.
But if a DAG fails to import (e.g., a Python error in the DAG file), it doesn’t get added to the dags table at all.
Instead, failed DAGs are tracked in a completely separate table called import_error.
In the Airflow UI, these broken DAGs show up in a separate “Import Errors” section, not mixed in with valid DAGs.

Quick code snippet to show what I mean:

from airflow import DAG
from datetime import datetime

# This line will break the import!
this_will_throw_an_error

default_args = {'start_date': datetime(2023, 1, 1)}

dag = DAG(
    'broken_dag',
    default_args=default_args,
    schedule_interval=None,
)

If you try to add a broken DAG like this, it does not appear in the dags table. But you’ll see it in the UI’s Import Errors section, thanks to the import_error table.

I flagged this to the maintainers, since it means there’s no way to filter for has_import_errors in the dags table, the broken DAGs just aren’t there!

Here’s what I asked:

Given this setup, it seems like adding an API/UI filter on has_import_errors wouldn’t actually be useful, since the broken DAGs aren’t in the dags table to be filtered in the first place, they’re already shown separately in the UI.

Should we still go ahead with this filter, or just close the issue as “not planned”? Happy to move on to the other filters in the meantime!

The response:

“Yeah, maybe just disregard that field for now and focus on other filters.
It’s weird , I wasn’t able to identify where that DAG attribute was set. I’ll need to do more digging.”

So that settles it, this filter isn’t worth implementing right now. (If you’re curious, this is a pretty common scenario in open-source: you plan for something, dig in, and sometimes realize the original plan doesn’t make sense)

What’s Next

With that detour out of the way, I’m shifting focus to the other API filters I opened.

My goal:

Week 10: Make serious progress on implementing these API routes.
Week 11: Continue API work and maybe start wiring up the UI.
Week 12: Finish the UI, tie everything together, and (hopefully!) have an awesome end-of-term demo to show off.

Stay tuned for updates, I’ll share more details as I chip away at these features.

Diary of an MLH SWE Fellow — Week 07 and 08

Aryan Khurana — Wed, 16 Jul 2025 21:54:43 GMT

Diary of an MLH SWE Fellow — Week 07 and 08

The Not-So-Glamorous Weeks of the Fellowship

Introduction

Hey everyone! I’m rolling weeks 07 and 08 into one blog post because, honestly, things have been pretty slow on my Apache Airflow journey lately. Not a lot got done, so instead of forcing two lackluster updates, I figured I’d give you a real snapshot of the past couple weeks and set the stage for the final stretch.

We’re now kicking off week 09 of the 12-week MLH Fellowship, which means there are only four weeks left, including this one. My personal goal is to close out at least two more issues and get two more PRs merged before the end-of-program demo since I wanna finish strong.

The Mid-Program Demo (And My Subway Saga)

After a few reschedules, we finally had our mid-program demo with stakeholders from RBC (shoutout to them for sponsoring the project!). The demo went… alright. We walked through what we’ve built so far, highlighted two PRs we’re proud of, and talked about what we’ve learned as MLH fellows.

I had to present my part of the demo from a Subway joint (the sandwich place) because I had a health card appointment right after. My laptop connection dropped mid-demo, so I had to awkwardly ask the Subway employee for the WiFi password. Luckily, he hooked me up and I got to finish my presentation, but it wasn’t exactly the smoothest experience.

Why the Slump?

Honestly, these two weeks felt like a slump. Airflow’s a massive, slow-moving project, and lately, there just haven’t been many beginner-friendly or mid-level issues to pick up. I was also tied up with some other commitments, so my productivity on Airflow took a hit. All three of us working on Airflow for this fellowship felt like we didn’t have anything super “flashy” to show in the demo, which was a bit of a bummer.

Scouting for Issues

But I’m back and motivated! I spent some time digging through Airflow’s GitHub issues and found a few that look interesting. Here are the ones I commented on today:

I’m also eyeing another interesting one (#52660) and planning to chat with my mentor about it soon. Fingers crossed I can get something cool assigned to me so I can wrap up the fellowship on a high note.

Wrapping Up

So yeah, the past two weeks were pretty quiet, but I’m ramping back up. I’m committed to finishing off strong, tackling some fun new issues, and hopefully having something awesome to show at the final demo. Stay tuned!

MLH Fellowship: Reflections at the Halfway Point (Weeks 1–6)

Aryan Khurana — Tue, 01 Jul 2025 17:03:19 GMT

How six weeks of open-source, mentorship, and setbacks have shaped my path as an SWE Fellow

Time really does fly! With Week 6 wrapped up and Week 7 underway, I’m officially halfway through my journey as an SWE Fellow in the MLH Fellowship. When I started back in May, July felt far off, but here we are.

The MLH Fellowship is a 12-week program where students contribute to open-source projects. For me, that project has been Apache Airflow, with sponsorship from RBC (Royal Bank of Canada). These past six weeks have been full of learning, building, and collaborating. In this post, I’m taking a step back to reflect on everything I’ve experienced so far.

What I’ve Been Working On

Since joining, I’ve really immersed myself in the Airflow ecosystem, learning how it works under the hood, navigating a huge codebase, and connecting with an amazing community of maintainers and contributors.

Issues Tackled

In total, I’ve picked up 9 issues in the Airflow repo so far:

Closed: 4
In Progress: 3
Closed as Not Planned: 2 (admittedly, it stings a bit to see these closed after investing time digging through the code, but it’s all part of the open-source process)

Here are the links to the issues I have worked on:

Completed:

In Progress:

Closed as Not Planned:

Pull Requests

On the PR front, I’ve raised a total of 5 PRs:

Merged: 4
Closed: 1 (because the related issue was closed as not planned)

Here are links to my Pull Requests:

Merged:

Closed:

Providers/HTTP: Add Debug logging for improved troubleshooting in the HTTP Provider Hook

Reflection

While working on open source has taught me a ton, I have to admit , it can get pretty lonely sometimes. When you get stuck and there’s nobody immediately around to help, it’s easy to feel like giving up on an issue. Luckily, in our case, we had mentors assigned to us who are actual Apache Airflow maintainers. Having someone to fall back on and ask questions really made a difference and kept me going whenever I hit a roadblock.

I think one thing that worked in my favor was spending a huge amount of time at the beginning just learning about the product and how to use it. I created sample DAGs, ran Airflow both as a developer and a user, and really tried to understand its capabilities. I remember binging YouTube videos from Airflow conferences where contributors talked about the architecture and how to get started, which helped a lot. I even took a Udemy course just to nail down the basics. All of this took longer than I expected, and honestly, it was pretty demoralizing when my first PR didn’t get merged, since we realized that feature wasn’t really needed. I’d worked on it for a week, only for it never to see the light of day.

Since then, I’ve faced similar situations, twice, in fact! I’ve learned that’s just the nature of open source. You might raise an issue or work on a feature thinking it’s valuable, but the project’s priorities might not align, or the change is too complicated for what’s needed, and it gets closed as “not planned.” Early on, I remember getting stuck on two issues at once and feeling really discouraged, by the end of week 4, I still had zero closed PRs or issues. But I kept pushing, and now, at the end of week 6, I have four under my belt.

Honestly, I’m not that proud of the specific changes I’ve made so far, because they haven’t been the highest-impact contributions. In fact, just today, another issue I’d invested a lot into, learning the whole auth manager flow and mechanism, was closed as “not planned.” That one was supposed to be my biggest contribution yet, and I was hoping to show it off in our upcoming midterm presentation with Airflow maintainers and RBC stakeholders.

But looking back, I realize all the time I spent learning and struggling in the first half was actually preparing me for what’s next. I’m now ready to tackle bigger, more challenging issues in the second half of the fellowship.

My goal for the rest of the program

My goal is to take on more responsibility, dig even deeper into the Airflow codebase, and make a real impact in the community.

The MLH Fellowship has been a lot of fun, and I’ve met some amazing people along the way. I’m looking forward to seeing how much I can grow in these next two months, and I hope to make some truly meaningful contributions soon.

Diary of an MLH SWE Fellow — Week 06

Aryan Khurana — Sat, 28 Jun 2025 17:56:37 GMT

Diary of an MLH SWE Fellow — Week 06

Two more PRs Merged

Introduction

I mentioned in my last blog that I was finally gaining momentum, and contributing to Airflow is starting to feel genuinely fun. This week, I picked up two more issues, raised PRs for both, and successfully got them merged. That brings my total to 4 merged PRs in the MLH Fellowship so far, which is great news because MLH typically expects at least 3–4 merged pull requests by the end of the program. So now, let’s talk about the issues!

Issue 1: Snowflake JSON Decode Error

Link: https://github.com/apache/airflow/issues/52079

The first issue I tackled was relatively straightforward. Some providers in Airflow, like the Snowflake SQL API, fetch JSON responses from APIs. However, when the APIs fail, users were encountering unhandled JSON decode errors, and there was no retry mechanism in place.

This issue requested better error handling and a retry mechanism using the tenacity library. I dove into the problem, only to realize that someone else had already indirectly fixed it in a different PR: #51463.

Since their fix wasn’t directly tied to this issue, there was no test coverage verifying that it resolved the problem. So I wrote a couple of test cases to validate the behavior and raised my own PR which was eventually merged and officially closed the issue.

Link to my PR: https://github.com/apache/airflow/pull/52118

Issue 2: AWS Glue Retry and Error Handling

Link: https://github.com/apache/airflow/issues/52152

After that, one of the maintainers asked me if I’d be interested in working on a similar issue, this time involving the AWS Glue provider. I agreed and took it on. Even though it was pretty straightforward but it was the most complex issue I’ve worked on so far since before this my PRs didn’t have a lot of code changes.

This issue required actual logic changes, so I used what I learned from the previous issue and added retry mechanisms, proper error handling, and test coverage to ensure everything worked as expected.

Thanks to the experience I’ve gained, the development workflow felt much smoother this time. I’ve gotten used to:

Spinning up the dev environment using Breeze
Making code changes
Running relevant tests with pytest
Running the pre-commit hooks
Committing and pushing changes and eventually raising a PR following best practices.

Link to my PR: https://github.com/apache/airflow/pull/52262

Conclusion

I didn’t mention it in this blog, but I did pick up another issue and spent some time working on it. Unfortunately, I hit a roadblock, so I’ll save the details for a future post once I figure out a solution. That said, I’m planning to take on more challenging issues moving forward. We’re officially halfway through the fellowship, and I’m excited to see what I can accomplish in the second half!